Skip to content

Commit

Permalink
[giow] (0) Fix the UTF-8 decoder error handling to handle a few error…
Browse files Browse the repository at this point in the history
…s I'd missed, including in particular surrogate halves. This may be a mistake; if I'm forgetting something please let me know so I can fix it. (e.g. did we decide not to catch surrogates or something?)

Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=11298

git-svn-id: http://svn.whatwg.org/webapps@5942 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Mar 4, 2011
1 parent 2ab06fe commit 74e3b6c
Show file tree
Hide file tree
Showing 3 changed files with 60 additions and 36 deletions.
32 changes: 20 additions & 12 deletions complete.html
Expand Up @@ -3692,39 +3692,47 @@ <h3 id=utf-8><span class=secno>2.4 </span>UTF-8</h3>

<dl class=switch><dt>One byte in the range FE to FF</dt>


<dt><a href=#overlong-form title="overlong form">Overlong forms</a> (e.g. F0 80 80 A0)</dt>

<dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt>
<dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt> <!-- overlong ASCII (redundant with the previous line, really, but worth calling out separately as it's especially dangerous to miss this case) -->


<dt>One byte in the range F0 to F4, followed by three bytes in the range 80 to BF that represent a code point above U+10FFFF</dt>

<dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt>
<dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->

<dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->

<dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->


<dt>One byte in the range C0 to FD that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt>
<dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt>
<dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>

<dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>Any byte sequence that represents a code point in the range U+D800 to U+DFFF</dt> <!-- surrogate halves -->


<dd>The whole sequence must be replaced by a single U+FFFD
<dd>The whole matched sequence must be replaced by a single U+FFFD
REPLACEMENT CHARACTER.</dd>


<dt>One byte in the range 80 to BF not preceded by a byte in the range 80 to FD</dt>

<dt>A sequence of bytes in the range 80 to BF that does not follow a byte in the range C0 to FD</dt>
<dt>One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte</dt>

<dt>One byte in the range C0 to FD not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as port of a sequence</dt>

<dd>Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>

<dd>Each byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>

</dl><p>For the purposes of the above requirements, an <dfn id=overlong-form>overlong
form</dfn> in UTF-8 is a sequence that encodes a code point using
Expand Down
32 changes: 20 additions & 12 deletions index
Expand Up @@ -3672,39 +3672,47 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d

<dl class=switch><dt>One byte in the range FE to FF</dt>


<dt><a href=#overlong-form title="overlong form">Overlong forms</a> (e.g. F0 80 80 A0)</dt>

<dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt>
<dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt> <!-- overlong ASCII (redundant with the previous line, really, but worth calling out separately as it's especially dangerous to miss this case) -->


<dt>One byte in the range F0 to F4, followed by three bytes in the range 80 to BF that represent a code point above U+10FFFF</dt>

<dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt>
<dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->

<dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->

<dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->


<dt>One byte in the range C0 to FD that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt>
<dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt>
<dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>

<dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>Any byte sequence that represents a code point in the range U+D800 to U+DFFF</dt> <!-- surrogate halves -->


<dd>The whole sequence must be replaced by a single U+FFFD
<dd>The whole matched sequence must be replaced by a single U+FFFD
REPLACEMENT CHARACTER.</dd>


<dt>One byte in the range 80 to BF not preceded by a byte in the range 80 to FD</dt>

<dt>A sequence of bytes in the range 80 to BF that does not follow a byte in the range C0 to FD</dt>
<dt>One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte</dt>

<dt>One byte in the range C0 to FD not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as port of a sequence</dt>

<dd>Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>

<dd>Each byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>

</dl><p>For the purposes of the above requirements, an <dfn id=overlong-form>overlong
form</dfn> in UTF-8 is a sequence that encodes a code point using
Expand Down
32 changes: 20 additions & 12 deletions source
Expand Up @@ -2663,39 +2663,47 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d

<dt>One byte in the range FE to FF</dt>


<dt><span title="overlong form">Overlong forms</span> (e.g. F0 80 80 A0)</dt>

<dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt>
<dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt> <!-- overlong ASCII (redundant with the previous line, really, but worth calling out separately as it's especially dangerous to miss this case) -->


<dt>One byte in the range F0 to F4, followed by three bytes in the range 80 to BF that represent a code point above U+10FFFF</dt>

<dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt>
<dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->

<dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->

<dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->


<dt>One byte in the range C0 to FD that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt>
<dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt>
<dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->

<dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>

<dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
<dt>Any byte sequence that represents a code point in the range U+D800 to U+DFFF</dt> <!-- surrogate halves -->


<dd>The whole sequence must be replaced by a single U+FFFD
<dd>The whole matched sequence must be replaced by a single U+FFFD
REPLACEMENT CHARACTER.</dd>


<dt>One byte in the range 80 to BF not preceded by a byte in the range 80 to FD</dt>

<dt>A sequence of bytes in the range 80 to BF that does not follow a byte in the range C0 to FD</dt>
<dt>One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte</dt>

<dt>One byte in the range C0 to FD not followed by a byte in the range 80 to BF</dt>
<dt>One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as port of a sequence</dt>

<dd>Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>

<dd>Each byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>

</dl>

Expand Down

0 comments on commit 74e3b6c

Please sign in to comment.