[giow] (2) Strip a leading BOM from scripts in workers, if any. Also,…

… use more of the encoding spec. Fixing https://www.w3.org/Bugs/Public/show_bug.cgi?id=17839 Affected topics: DOM APIs, HTML, HTML Syntax and Parsing, Offline Web Applications, Workers git-svn-id: http://svn.whatwg.org/webapps@7782 340c8d12-0b0e-0410-8428-c7bf67bfef74
whatwg · Mar 29, 2013 · 5b130bc · 5b130bc
1 parent 0fbdead
commit 5b130bc
Show file tree

Hide file tree

Showing 3 changed files with 176 additions and 252 deletions.
diff --git a/complete.html b/complete.html
@@ -3068,9 +3068,6 @@ <h4 id=encoding-terminology><span class=secno>2.1.6 </span>Character encodings</
   <p class=note>This complexity results from the historical decision to define the DOM API in
   terms of 16 bit (UTF-16) <a href=#code-unit title="code unit">code units</a>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>
 
-  <p>When a byte stream is to be <dfn id=decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</dfn>, the user agent
-  must return the result of running the <a href=#utf-8-decoder>utf-8 decoder</a> on that byte stream.</p>
-
 
 
 
@@ -3385,10 +3382,17 @@ <h4 id=dependencies><span class=secno>2.2.2 </span>Dependencies</h4>
     <ul class=brief><li><dfn id=getting-an-encoding>Getting an encoding</dfn>
 
      <li>The <dfn id=encoder>encoder</dfn> and <dfn id=decoder>decoder</dfn> algorithms for various encodings, including
-     the <dfn id=utf-8-encoder>utf-8 encoder</dfn> and <dfn id=utf-8-decoder>utf-8 decoder</dfn>
+     the <dfn id=utf-8-encoder>UTF-8 encoder</dfn> and <dfn id=utf-8-decoder>UTF-8 decoder</dfn>
+
+     <li>The generic <dfn id=decode>decode</dfn> algorithm which takes a byte stream and an encoding and
+     returns a character stream
 
-    </ul><p class=note>The <a href=#utf-8-decoder>utf-8 decoder</a> is distinct from the <i>utf-8 decode
-    algorithm</i>. The latter is not used by this specification.</p>
+     <li>The <dfn id=utf-8-decode>UTF-8 decode</dfn> algorithm which takes a byte stream and returns a character
+     stream, additionally stripping one leading UTF-8 Byte Order Mark (BOM), if any
+
+    </ul><p class=note>The <a href=#utf-8-decoder>UTF-8 decoder</a> is distinct from the <i>UTF-8 decode
+    algorithm</i>. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the
+    former.</p>
 
    </dd>
 
@@ -8446,7 +8450,7 @@ <h4 id=resource-metadata-management><span class=secno>3.1.3 </span><dfn>Resource
   <code><a href=#document>Document</a></code>'s <a href=#origin>origin</a> is not a scheme/host/port tuple, the user agent must
   throw a <code><a href=#securityerror>SecurityError</a></code> exception. Otherwise, the user agent must first <a href=#obtain-the-storage-mutex>obtain
   the storage mutex</a> and then return the cookie-string for <a href="#the-document's-address">the document's address</a>
-  for a "non-HTTP" API, <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>. <a href=#refsCOOKIES>[COOKIES]</a>
+  for a "non-HTTP" API, decoded using the <a href=#utf-8-decoder>UTF-8 decoder</a>. <a href=#refsCOOKIES>[COOKIES]</a>
   <a class=fingerprint href=#fingerprint><img alt="(This is a fingerprinting vector.)" height=64 src=http://images.whatwg.org/fingerprint.png width=46></a>
   </p>
 
@@ -14643,38 +14647,7 @@ <h4 id=the-script-element><span class=secno>4.3.1 </span>The <dfn id=script><cod
 
           <p>To obtain the Unicode string, the user agent run the following steps:</p>
 
-          <ol><li><p>For each of the rows in the following table, starting with the first one and going
-           down, if the file has as many or more bytes available than the number of bytes in the
-           first column, and the first bytes of the file match the bytes given in the first column,
-           then set <var title="">character encoding</var> to the encoding given in the cell in the
-           second column of that row, and jump to the bottom step in this series of steps:</p>
-
-            <!-- this table is present in several forms in this file; keep them in sync -->
-            <table id=table-script-bom><thead><tr><th>Bytes in Hexadecimal
-               <th>Encoding
-             <tbody><!-- nobody uses this
-              <tr>
-               <td>00 00 FE FF
-               <td>UTF-32BE
-              <tr>
-               <td>FF FE 00 00
-               <td>UTF-32LE
-    --><tr><td>FE FF
-               <td>Big-endian UTF-16
-              <tr><td>FF FE
-               <td>Little-endian UTF-16
-              <tr><td>EF BB BF
-               <td>UTF-8
-    <!-- nobody uses this
-              <tr>
-               <td>DD 73 66 73
-               <td>UTF-EBCDIC
-    -->
-            </table><p class=note>This step looks for Unicode Byte Order Marks (BOMs).</p>
-
-           </li>
-
-           <li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
+          <ol><li><p>If the resource's <a href=#content-type title=Content-Type>Content Type metadata</a>, if any,
            specifies a character encoding, and the user agent supports that encoding, then let <var title="">character encoding</var> be that encoding, and jump to the bottom step in this
            series of steps.</li>
 
@@ -14685,9 +14658,20 @@ <h4 id=the-script-element><span class=secno>4.3.1 </span>The <dfn id=script><cod
            <li><p>Let <var title="">character encoding</var> be <var><a href="#the-script-block's-fallback-character-encoding">the script block's fallback
            character encoding</a></var>.</li>
 
-           <li><p>Convert the file to Unicode using <var>character encoding</var>, following the
-           rules for doing so given by the specification for <var><a href="#the-script-block's-type">the script block's
-           type</a></var>.</li>
+           <li>
+
+            <p>If the specification for <var><a href="#the-script-block's-type">the script block's type</a></var> gives specific rules for
+            decoding files in that format to Unicode, follow them, using <var>character
+            encoding</var> as the character encoding specified by higher-level protocols, if
+            necessary.</p> <!-- e.g. XML -->
+
+            <p>Otherwise, <a href=#decode>decode</a> the file to Unicode, using <var>character
+            encoding</var> as the fallback encoding.</p>
+
+            <p class=note>The <a href=#decode>decode</a> algorithm overrides <var>character
+            encoding</var> if the file contains a BOM.</p>
+
+           </li>
 
           </ol></dd>
 
@@ -68758,11 +68742,17 @@ <h5 id=parsing-cache-manifests><span class=secno>6.7.3.3 </span>Parsing cache ma
   <p>When a user agent is to <dfn id=parse-a-manifest>parse a manifest</dfn>, it means that the user agent must run the
   following steps:</p>
 
-  <ol><li><p>Decode the byte stream corresponding with the manifest to be parsed <a href=#decoded-as-utf-8,-with-error-handling title="decoded
-   as UTF-8, with error handling">as UTF-8, with error handling</a>. <!--All U+0000 NULL
-   characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't black-box testable
-   since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus both will be treated
-   the same anyway)--></li>
+  <ol><li>
+
+    <p><a href=#utf-8-decode>UTF-8 decode</a> the byte stream corresponding with the manifest to be parsed.</p>
+
+    <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips a leading BOM, if any.</p>
+
+    <!--All U+0000 NULL characters must be replaced by U+FFFD REPLACEMENT CHARACTERs. (this isn't
+    black-box testable since neither U+0000 nor U+FFFD are valid anywhere in the syntax and thus
+    both will be treated the same anyway)-->
+
+   </li>
 
    <li><p>Let <var title="">base URL</var> be the <a href=#absolute-url>absolute URL</a> representing the
    manifest.</li>
@@ -68792,9 +68782,6 @@ <h5 id=parsing-cache-manifests><span class=secno>6.7.3.3 </span>Parsing cache ma
    <li><p>Let <var title="">position</var> be a pointer into <var title="">input</var>, initially
    pointing at the first character.</li>
 
-   <li><p>If <var title="">position</var> is pointing at a U+FEFF BYTE ORDER MARK (BOM) character,
-   then advance <var title="">position</var> to the next character.</li>
-
    <li><p>If the characters starting from <var title="">position</var> are "CACHE", followed by a
    U+0020 SPACE character, followed by "MANIFEST", then advance <var title="">position</var> to the
    next character after those. Otherwise, this isn't a cache manifest; abort this algorithm with a
@@ -78794,9 +78781,8 @@ <h4 id=processing-model-6><span class=secno>9.2.4 </span>Processing model</h4>
     a simple event</a> named <code title=event-error>error</code> at that object. Abort these
     steps.</p>
 
-    <p>If the attempt succeeds, then let <var title="">source</var> be the script resource
-    <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error handling</a>.
-    </p>
+    <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+    <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
 
     <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -79479,10 +79465,8 @@ <h4 id=importing-scripts-and-libraries><span class=secno>9.3.1 </span>Importing
       <code><a href=#networkerror>NetworkError</a></code> exception and abort all these
       steps.</p>
 
-      <p>If the attempt succeeds, then let <var title="">source</var> be
-      the script resource <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-      handling</a>.
-      </p>
+      <p>If the attempt succeeds, then let <var title="">source</var> be the result of running the
+      <a href=#utf-8-decode>UTF-8 decode</a> algorithm on the script resource.</p>
 
       <p>Let <var title="">language</var> be JavaScript.</p>
 
@@ -80101,11 +80085,10 @@ <h4 id=parsing-an-event-stream><span class=secno>10.2.4 </span>Parsing an event
 
   <h4 id=event-stream-interpretation><span class=secno>10.2.5 </span>Interpreting an event stream</h4>
 
-  <p>Streams must be <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-  handling</a>.
-  </p>
+  <p>Streams must be decoded using the <a href=#utf-8-decode>UTF-8 decode</a> algorithm.</p>
 
-  <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.</p>
+  <p class=note>The <a href=#utf-8-decode>UTF-8 decode</a> algorithm strips one leading UTF-8 Byte Order Mark
+  (BOM), if any.</p>
 
   <p>The stream must then be parsed by reading everything line by line, with a U+000D CARRIAGE
   RETURN U+000A LINE FEED (CRLF) character pair, a single U+000A LINE FEED (LF) character not
@@ -81115,9 +81098,9 @@ <h4 id=the-websocket-interface><span class=secno>10.3.2 </span>The <code><a href
    action, whose <code title=dom-CloseEvent-wasClean><a href=#dom-closeevent-wasclean>wasClean</a></code> attribute is initialized to
    true if the connection closed <i title="">cleanly</i> and false otherwise, whose <code title=dom-CloseEvent-code><a href=#dom-closeevent-code>code</a></code> attribute is initialized to <i><a href=#the-websocket-connection-close-code>the WebSocket connection
    close code</a></i>, and whose <code title=dom-CloseEvent-reason><a href=#dom-closeevent-reason>reason</a></code> attribute is
-   initialized to <i><a href=#the-websocket-connection-close-reason>the WebSocket connection close reason</a></i> <a href=#decoded-as-utf-8,-with-error-handling>decoded as UTF-8, with error
-   handling</a>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event at the
-   <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
+   initialized to the result of applying the <a href=#utf-8-decoder>UTF-8 decoder</a> to <i><a href=#the-websocket-connection-close-reason>the WebSocket
+   connection close reason</a></i>, and <a href=#concept-event-dispatch title=concept-event-dispatch>dispatch</a> the event
+   at the <code><a href=#websocket>WebSocket</a></code> object. <a href=#refsWSP>[WSP]</a></li>
 
   </ol><div class=warning>
 
@@ -84062,6 +84045,7 @@ <h4 id=overview-of-the-parsing-model><span class=secno>12.2.1 </span>Overview of
 
   <h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte stream</dfn></h4>
 
+<!--CLEANUP-->
   <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
@@ -84079,24 +84063,21 @@ <h4 id=the-input-byte-stream><span class=secno>12.2.2 </span>The <dfn>input byte
   <p>Given a character encoding, the bytes in the <a href=#the-input-byte-stream>input byte
   stream</a> must be converted to Unicode code points for the
   tokenizer's <a href=#input-stream>input stream</a>, as described by the rules for
-  that encoding, except that the leading U+FEFF BYTE ORDER MARK
-  character, if any, must not be stripped by the encoding layer (it is
-  stripped by the rule below).</p> <!-- this is to prevent two leading
-  BOMs from being both stripped, once by the decoder, and once by the
-  parser -->
-
-  <p>Bytes or sequences of bytes in the original byte stream that
-  could not be converted to Unicode code points must be converted to
-  U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is
-  UTF-8, the bytes must be <a href=#decoded-as-utf-8,-with-error-handling title="decoded as UTF-8, with error
-  handling">decoded with the error handling</a> defined in this
-  specification.</p>
+  that encoding's <a href=#decoder>decoder</a>.</p>
 
   <p class=note>Bytes or sequences of bytes in the original byte
   stream that did not conform to the encoding specification (e.g.
   invalid UTF-8 byte sequences in a UTF-8 input byte stream) are
   errors that conformance checkers are expected to report.</p>
 
+  <p class=note>Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they
+  are stripped by the algorithm below.</p>
+
+  <p class=warning>The decoder algorithms describe how to handle invalid input; for security
+  reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte
+  sequences are handled can result in, amongst other problems, script injection vulnerabilities
+  ("XSS").</p>
+
 
   <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Determining the character encoding</h5>
 
@@ -84688,8 +84669,8 @@ <h5 id=character-encodings><span class=secno>12.2.2.2 </span>Character encodings
   UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior
   in implementations of this specification.</p>
 
-  <p>When a user agent is to use the self-describing UTF-16 encoding but no BOM has been found, user
-  agents must default to little-endian UTF-16.</p>
+  <p>When a user agent is to use the self-describing UTF-16 encoding but no Byte Order Mark (BOM)
+  has been found, user agents must default to little-endian UTF-16.</p>
 
   <p class=note>The requirement to default UTF-16 to little-endian rather than big-endian is a
   <a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a desire for compatibility with legacy