HTML Standard Tracker

Filter

File a bug

SVNBugCommentTime (UTC)
3871[Conformance Checkers] [Tools] Make surrogates in UTF-8 and character references turn into U+FFFD to prevent UTF-16 environments having hard-to-handle bugs.2009-09-16 09:22
@@ -76730,37 +76730,39 @@ interface <dfn>MessagePort</dfn> {
 
   <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if
   any are present.</p>
 
   <p class="note">The requirement to strip a U+FEFF BYTE ORDER MARK
   character regardless of whether that character was used to determine
   the byte order is a <span>willful violation</span> of Unicode,
   motivated by a desire to increase the resilience of user agents in
   the face of na&iuml;ve transcoders.</p>
 
-  <p>All U+0000 NULL characters in the input must be replaced by
-  U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
-  a <span>parse error</span>.</p>
+  <p>All U+0000 NULL characters and characters in the range U+D800 to
+  U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
+  them to suddenly turn into codepoints when they go through a UTF-16
+  pipe --> in the input must be replaced by U+FFFD REPLACEMENT
+  CHARACTERs. Any occurrences of such characters is a <span>parse
+  error</span>.</p>
 
   <p>Any occurrences of any characters in the ranges U+0001 to U+0008,
   <!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
   CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
-  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
-  to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
-  characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
-  U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
-  U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
-  U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
-  U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
-  U+10FFFF are <span title="parse error">parse errors</span>. (These
-  are all control characters or permanently undefined Unicode
-  characters.)</p>
+  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
+  to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
+  U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
+  U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
+  U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
+  U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
+  U+10FFFE, and U+10FFFF are <span title="parse error">parse
+  errors</span>. (These are all control characters or permanently
+  undefined Unicode characters.)</p>
 
   <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
   characters are treated specially. Any CR characters that are
   followed by LF characters must be removed, and any CR characters not
   followed by LF characters must be converted to LF characters. Thus,
   newlines in HTML DOMs are represented by LF characters, and there
   are never any CR characters in the input to the
   <span>tokenization</span> stage.</p>
 
   <p>The <dfn>next input character</dfn> is the first character in the
@@ -78850,40 +78852,42 @@ interface <dfn>MessagePort</dfn> {
       <tr><td>0x98 <td>U+02DC <td>SMALL TILDE ('&#x02DC;')
       <tr><td>0x99 <td>U+2122 <td>TRADE MARK SIGN ('&#x2122;')
       <tr><td>0x9A <td>U+0161 <td>LATIN SMALL LETTER S WITH CARON ('&#x0161;')
       <tr><td>0x9B <td>U+203A <td>SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('&#x203A;')
       <tr><td>0x9C <td>U+0153 <td>LATIN SMALL LIGATURE OE ('&#x0153;')
       <tr><td>0x9D <td>U+009D <td>&lt;control>
       <tr><td>0x9E <td>U+017E <td>LATIN SMALL LETTER Z WITH CARON ('&#x017E;')
       <tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('&#x0178;')
     </table>
 
-    <p>Otherwise, if the number is greater than 0x10FFFF, then this is
-    a <span>parse error</span>. Return a U+FFFD REPLACEMENT
-    CHARACTER.</p>
+    <p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
+    surrogates not allowed; see the comment in the "preprocessing the
+    input stream" section for details --> or is greater than 0x10FFFF,
+    then this is a <span>parse error</span>. Return a U+FFFD
+    REPLACEMENT CHARACTER.</p>
 
     <p>Otherwise, return a character token for the Unicode character
     whose code point is that number.
 
     <!-- this is the same as the equivalent list in the input stream
     section -->
     If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
     allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
     allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
-    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
-    0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
-    of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
-    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
-    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
-    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
-    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
-    0x10FFFF, then this is a <span>parse error</span>.</p>
+    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
+    0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
+    0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
+    0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
+    0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
+    0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
+    0x10FFFE, or 0x10FFFF, then this is a <span>parse
+    error</span>.</p>
 
    </dd>
 
 
    <dt>Anything else</dt>
 
    <dd>
 
     <p>Consume the maximum number of characters possible, with the
     consumed characters matching one of the identifiers in the first

|