Skip to content

Commit

Permalink
[e] (0) Clean up how we refer to UTF-16.
Browse files Browse the repository at this point in the history
Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=13396

git-svn-id: http://svn.whatwg.org/webapps@6498 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Aug 17, 2011
1 parent 61872ab commit c84194e
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 53 deletions.
40 changes: 23 additions & 17 deletions complete.html
Expand Up @@ -3343,6 +3343,10 @@ <h4 id=character-encodings><span class=secno>2.1.6 </span>Character encodings</h
different <meta charset> elements applying in each case.
-->

<p>The term <dfn id=a-utf-16-encoding>a UTF-16 encoding</dfn> refers to any variant of
UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without
a BOM, raw UTF-16LE, and raw UTF-16BE. <a href=#refsRFC2781>[RFC2781]</a></p>

<p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>

Expand Down Expand Up @@ -6627,7 +6631,8 @@ <h4 id=terminology-0><span class=secno>2.6.1 </span>Terminology</h4>
component contains no unescaped non-ASCII characters. <a href=#refsRFC3987>[RFC3987]</a></li>

<li><p>The <a href=#url>URL</a> is a valid IRI reference and the <a href="#document's-character-encoding" title="document's character encoding">character encoding</a> of
the URL's <code><a href=#document>Document</a></code> is UTF-8 or UTF-16. <a href=#refsRFC3987>[RFC3987]</a></li>
the URL's <code><a href=#document>Document</a></code> is UTF-8 or <a href=#a-utf-16-encoding>a UTF-16
encoding</a>. <a href=#refsRFC3987>[RFC3987]</a></li>

</ul><p>A string is a <dfn id=valid-non-empty-url>valid non-empty URL</dfn> if it is a
<a href=#valid-url>valid URL</a> but it is not the empty string.</p>
Expand Down Expand Up @@ -6819,8 +6824,8 @@ <h4 id=resolving-urls><span class=secno>2.6.3 </span>Resolving URLs</h4>

</dl></li>

<li><p>If <var title="">encoding</var> is a UTF-16 encoding, then
change the value of <var title="">encoding</var> to UTF-8.</li>
<li><p>If <var title="">encoding</var> is <a href=#a-utf-16-encoding>a UTF-16
encoding</a>, then change the value of <var title="">encoding</var> to UTF-8.</li>

<li>

Expand Down Expand Up @@ -84216,9 +84221,8 @@ <h5 id=determining-the-character-encoding><span class=secno>13.2.2.1 </span>Dete
<li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
step of the overall "two step" algorithm.</li>

<li><p>If <var title="">charset</var> is a UTF-16 encoding,
change the value of <var title="">charset</var> to
UTF-8.</li>
<li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>

<li><p>If <var title="">charset</var> is not a supported
character encoding, then jump to the second step of the
Expand Down Expand Up @@ -84650,12 +84654,14 @@ <h5 id=character-encodings-0><span class=secno>13.2.2.2 </span>Character encodin
violation</a> of the W3C Character Model specification, motivated
by a desire for compatibility with legacy content. <a href=#refsCHARMOD>[CHARMOD]</a></p>

<p>When a user agent is to use the UTF-16 encoding but no BOM has
been found, user agents must default to UTF-16LE.</p>
<p>When a user agent is to use the self-describing UTF-16 encoding
but no BOM has been found, user agents must default to little-endian
UTF-16.</p>

<p class=note>The requirement to default UTF-16 to LE rather than
BE is a <a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a
desire for compatibility with legacy content. <a href=#refsRFC2781>[RFC2781]</a></p>
<p class=note>The requirement to default UTF-16 to little-endian
rather than big-endian is a <a href=#willful-violation>willful violation</a> of RFC
2781, motivated by a desire for compatibility with legacy content.
<a href=#refsRFC2781>[RFC2781]</a></p>

<hr><p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
encodings. <a href=#refsCESU8>[CESU8]</a> <a href=#refsUTF7>[UTF7]</a> <a href=#refsBOCU1>[BOCU1]</a> <a href=#refsSCSU>[SCSU]</a></p>
Expand Down Expand Up @@ -84771,13 +84777,13 @@ <h5 id=changing-the-encoding-while-parsing><span class=secno>13.2.2.4 </span>Cha
earlier section failed to find the right encoding.</li>

<li>If the encoding that is already being used to interpret the
input stream is a UTF-16 encoding, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
input stream is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
<i>certain</i> and abort these steps. The new encoding is ignored;
if it was anything but the same encoding, then it would be clearly
incorrect.</li>

<li>If the new encoding is a UTF-16 encoding, change it to
UTF-8.</li>
<li>If the new encoding is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, change
it to UTF-8.</li>

<li>If all the bytes up to the last byte converted by the current
decoder have the same Unicode interpretations in both the current
Expand Down Expand Up @@ -88176,7 +88182,7 @@ <h6 id=the-before-head-insertion-mode><span class=secno>13.2.5.4.3 </span>The "<

<p id=meta-charset-during-parse>If the element has a <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute, and its value
is either a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character
encoding</a> or a UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
encoding</a> or <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
<i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
encoding given by the value of the <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute.</p>

Expand All @@ -88186,8 +88192,8 @@ <h6 id=the-before-head-insertion-mode><span class=secno>13.2.5.4.3 </span>The "<
<code title=attr-meta-content><a href=#attr-meta-content>content</a></code> attribute, and
applying the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding from a
<code>meta</code> element</a> to that attribute's value returns
a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or a
UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or
<a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
<i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
extracted encoding.</p>

Expand Down
40 changes: 23 additions & 17 deletions index
Expand Up @@ -3240,6 +3240,10 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
different <meta charset> elements applying in each case.
-->

<p>The term <dfn id=a-utf-16-encoding>a UTF-16 encoding</dfn> refers to any variant of
UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without
a BOM, raw UTF-16LE, and raw UTF-16BE. <a href=#refsRFC2781>[RFC2781]</a></p>

<p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>

Expand Down Expand Up @@ -6491,7 +6495,8 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
component contains no unescaped non-ASCII characters. <a href=#refsRFC3987>[RFC3987]</a></li>

<li><p>The <a href=#url>URL</a> is a valid IRI reference and the <a href="#document's-character-encoding" title="document's character encoding">character encoding</a> of
the URL's <code><a href=#document>Document</a></code> is UTF-8 or UTF-16. <a href=#refsRFC3987>[RFC3987]</a></li>
the URL's <code><a href=#document>Document</a></code> is UTF-8 or <a href=#a-utf-16-encoding>a UTF-16
encoding</a>. <a href=#refsRFC3987>[RFC3987]</a></li>

</ul><p>A string is a <dfn id=valid-non-empty-url>valid non-empty URL</dfn> if it is a
<a href=#valid-url>valid URL</a> but it is not the empty string.</p>
Expand Down Expand Up @@ -6683,8 +6688,8 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d

</dl></li>

<li><p>If <var title="">encoding</var> is a UTF-16 encoding, then
change the value of <var title="">encoding</var> to UTF-8.</li>
<li><p>If <var title="">encoding</var> is <a href=#a-utf-16-encoding>a UTF-16
encoding</a>, then change the value of <var title="">encoding</var> to UTF-8.</li>

<li>

Expand Down Expand Up @@ -79663,9 +79668,8 @@ interface <dfn id=messagechannel>MessageChannel</dfn> {
<li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
step of the overall "two step" algorithm.</li>

<li><p>If <var title="">charset</var> is a UTF-16 encoding,
change the value of <var title="">charset</var> to
UTF-8.</li>
<li><p>If <var title="">charset</var> is <a href=#a-utf-16-encoding>a UTF-16
encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>

<li><p>If <var title="">charset</var> is not a supported
character encoding, then jump to the second step of the
Expand Down Expand Up @@ -80097,12 +80101,14 @@ interface <dfn id=messagechannel>MessageChannel</dfn> {
violation</a> of the W3C Character Model specification, motivated
by a desire for compatibility with legacy content. <a href=#refsCHARMOD>[CHARMOD]</a></p>

<p>When a user agent is to use the UTF-16 encoding but no BOM has
been found, user agents must default to UTF-16LE.</p>
<p>When a user agent is to use the self-describing UTF-16 encoding
but no BOM has been found, user agents must default to little-endian
UTF-16.</p>

<p class=note>The requirement to default UTF-16 to LE rather than
BE is a <a href=#willful-violation>willful violation</a> of RFC 2781, motivated by a
desire for compatibility with legacy content. <a href=#refsRFC2781>[RFC2781]</a></p>
<p class=note>The requirement to default UTF-16 to little-endian
rather than big-endian is a <a href=#willful-violation>willful violation</a> of RFC
2781, motivated by a desire for compatibility with legacy content.
<a href=#refsRFC2781>[RFC2781]</a></p>

<hr><p>User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
encodings. <a href=#refsCESU8>[CESU8]</a> <a href=#refsUTF7>[UTF7]</a> <a href=#refsBOCU1>[BOCU1]</a> <a href=#refsSCSU>[SCSU]</a></p>
Expand Down Expand Up @@ -80218,13 +80224,13 @@ interface <dfn id=messagechannel>MessageChannel</dfn> {
earlier section failed to find the right encoding.</li>

<li>If the encoding that is already being used to interpret the
input stream is a UTF-16 encoding, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
input stream is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, then set the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> to
<i>certain</i> and abort these steps. The new encoding is ignored;
if it was anything but the same encoding, then it would be clearly
incorrect.</li>

<li>If the new encoding is a UTF-16 encoding, change it to
UTF-8.</li>
<li>If the new encoding is <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, change
it to UTF-8.</li>

<li>If all the bytes up to the last byte converted by the current
decoder have the same Unicode interpretations in both the current
Expand Down Expand Up @@ -83623,7 +83629,7 @@ document.body.appendChild(text);

<p id=meta-charset-during-parse>If the element has a <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute, and its value
is either a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character
encoding</a> or a UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
encoding</a> or <a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
<i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
encoding given by the value of the <code title=attr-meta-charset><a href=#attr-meta-charset>charset</a></code> attribute.</p>

Expand All @@ -83633,8 +83639,8 @@ document.body.appendChild(text);
<code title=attr-meta-content><a href=#attr-meta-content>content</a></code> attribute, and
applying the <a href=#algorithm-for-extracting-an-encoding-from-a-meta-element>algorithm for extracting an encoding from a
<code>meta</code> element</a> to that attribute's value returns
a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or a
UTF-16 encoding, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
a supported <a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a> or
<a href=#a-utf-16-encoding>a UTF-16 encoding</a>, and the <a href=#concept-encoding-confidence title=concept-encoding-confidence>confidence</a> is currently
<i>tentative</i>, then <a href=#change-the-encoding>change the encoding</a> to the
extracted encoding.</p>

Expand Down
45 changes: 26 additions & 19 deletions source
Expand Up @@ -2202,6 +2202,11 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
different <meta charset> elements applying in each case.
-->

<p>The term <dfn>a UTF-16 encoding</dfn> refers to any variant of
UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without
a BOM, raw UTF-16LE, and raw UTF-16BE. <a
href="#refsRFC2781">[RFC2781]</a></p>

<p>The term <dfn>Unicode character</dfn> is used to mean a <i
title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a
Expand Down Expand Up @@ -6212,8 +6217,8 @@ is conforming depends on which specs apply, and leaves it at that. -->

<li><p>The <span>URL</span> is a valid IRI reference and the <span
title="document's character encoding">character encoding</span> of
the URL's <code>Document</code> is UTF-8 or UTF-16. <a
href="#refsRFC3987">[RFC3987]</a></p></li>
the URL's <code>Document</code> is UTF-8 or <span>a UTF-16
encoding</span>. <a href="#refsRFC3987">[RFC3987]</a></p></li>

</ul>

Expand Down Expand Up @@ -6435,8 +6440,9 @@ is conforming depends on which specs apply, and leaves it at that. -->

</li>

<li><p>If <var title="">encoding</var> is a UTF-16 encoding, then
change the value of <var title="">encoding</var> to UTF-8.</p></li>
<li><p>If <var title="">encoding</var> is <span>a UTF-16
encoding</span>, then change the value of <var
title="">encoding</var> to UTF-8.</p></li>

<li>

Expand Down Expand Up @@ -95332,9 +95338,9 @@ interface <dfn>WindowLocalStorage</dfn> {
title="">got pragma</var> is false, then jump to the second
step of the overall "two step" algorithm.</p></li>

<li><p>If <var title="">charset</var> is a UTF-16 encoding,
change the value of <var title="">charset</var> to
UTF-8.</p></li>
<li><p>If <var title="">charset</var> is <span>a UTF-16
encoding</span>, change the value of <var
title="">charset</var> to UTF-8.</p></li>

<li><p>If <var title="">charset</var> is not a supported
character encoding, then jump to the second step of the
Expand Down Expand Up @@ -95876,13 +95882,14 @@ interface <dfn>WindowLocalStorage</dfn> {
by a desire for compatibility with legacy content. <a
href="#refsCHARMOD">[CHARMOD]</a></p>

<p>When a user agent is to use the UTF-16 encoding but no BOM has
been found, user agents must default to UTF-16LE.</p>
<p>When a user agent is to use the self-describing UTF-16 encoding
but no BOM has been found, user agents must default to little-endian
UTF-16.</p>

<p class="note">The requirement to default UTF-16 to LE rather than
BE is a <span>willful violation</span> of RFC 2781, motivated by a
desire for compatibility with legacy content. <a
href="#refsRFC2781">[RFC2781]</a></p>
<p class="note">The requirement to default UTF-16 to little-endian
rather than big-endian is a <span>willful violation</span> of RFC
2781, motivated by a desire for compatibility with legacy content.
<a href="#refsRFC2781">[RFC2781]</a></p>

<hr>

Expand Down Expand Up @@ -96006,14 +96013,14 @@ interface <dfn>WindowLocalStorage</dfn> {
earlier section failed to find the right encoding.</li>

<li>If the encoding that is already being used to interpret the
input stream is a UTF-16 encoding, then set the <span
input stream is <span>a UTF-16 encoding</span>, then set the <span
title="concept-encoding-confidence">confidence</span> to
<i>certain</i> and abort these steps. The new encoding is ignored;
if it was anything but the same encoding, then it would be clearly
incorrect.</li>

<li>If the new encoding is a UTF-16 encoding, change it to
UTF-8.</li>
<li>If the new encoding is <span>a UTF-16 encoding</span>, change
it to UTF-8.</li>

<li>If all the bytes up to the last byte converted by the current
decoder have the same Unicode interpretations in both the current
Expand Down Expand Up @@ -99925,7 +99932,7 @@ document.body.appendChild(text);
<p id="meta-charset-during-parse">If the element has a <code
title="attr-meta-charset">charset</code> attribute, and its value
is either a supported <span>ASCII-compatible character
encoding</span> or a UTF-16 encoding, and the <span
encoding</span> or <span>a UTF-16 encoding</span>, and the <span
title="concept-encoding-confidence">confidence</span> is currently
<i>tentative</i>, then <span>change the encoding</span> to the
encoding given by the value of the <code
Expand All @@ -99938,8 +99945,8 @@ document.body.appendChild(text);
<code title="attr-meta-content">content</code> attribute, and
applying the <span>algorithm for extracting an encoding from a
<code>meta</code> element</span> to that attribute's value returns
a supported <span>ASCII-compatible character encoding</span> or a
UTF-16 encoding, and the <span
a supported <span>ASCII-compatible character encoding</span> or
<span>a UTF-16 encoding</span>, and the <span
title="concept-encoding-confidence">confidence</span> is currently
<i>tentative</i>, then <span>change the encoding</span> to the
extracted encoding.</p>
Expand Down

0 comments on commit c84194e

Please sign in to comment.