Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
[e] (0) Reword the stuff about authors not using encodings to make mo…
…re sense.

git-svn-id: http://svn.whatwg.org/webapps@4307 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Oct 23, 2009
1 parent c6657ec commit 1954b35
Show file tree
Hide file tree
Showing 3 changed files with 104 additions and 93 deletions.
65 changes: 34 additions & 31 deletions complete.html
Expand Up @@ -2075,12 +2075,11 @@ <h4 id=character-encodings><span class=secno>2.1.6 </span>Character encodings</h
correspond to single-byte sequences that map to the same Unicode
characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href=#refsRFC1345>[RFC1345]</a></p>

<p class=note>This includes such encodings as Shift_JIS and
variants of ISO-2022, even though it is possible in these encodings
for bytes like 0x70 to be part of longer sequences that are
unrelated to their interpretation as ASCII. It excludes such
encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
variants.</p>
<p class=note>This includes such encodings as Shift_JIS,
HZ-GB-2312, and variants of ISO-2022, even though it is possible in
these encodings for bytes like 0x70 to be part of longer sequences
that are unrelated to their interpretation as ASCII. It excludes
such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>

<!--
We'll have to change that if anyone comes up with a way to have a
Expand Down Expand Up @@ -11881,47 +11880,51 @@ <h5 id=charset><span class=secno>4.2.5.5 </span>Specifying the document's charac
state</a>, then the character encoding used must be an
<a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a>.</p>

<p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
<p>Authors are encouraged to use UTF-8. Conformance checkers may
advise authors against using legacy encodings.</p>

<div class=impl>

<p>Authoring tools should default to using UTF-8 for newly-created
documents.</p>

</div>

<p>Encodings in which a series of bytes in the range 0x20 to 0x7E
can encode characters other than the corresponding characters in the
range U+0020 to U+007E represent a potential security vulnerability:
a user agent that does not support the encoding (or does not support
the label used to declare the encoding, or does not use the same
mechanism to detect the encoding of unlabelled content as another
user agent) might end up interpreting technically benign plain text
content as HTML tags and JavaScript. In particular, this applies to
encodings in which the bytes corresponding to "<code title="">&lt;script&gt;</code>" in ASCII can encode a different
string. Authors should not use such encodings, which are known to
include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
handling of ASCII "~" -->, encodings based on ISO-2022<!--
http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
-->, and encodings based on EBCDIC. Authors should not use UTF-32.
Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
-->, and encodings based on EBCDIC. Furtermore, authors must not use
the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
this category, because these encodings were never intended for use
for Web content.
<a href=#refsRFC1345>[RFC1345]</a><!-- for the JIS types -->
<a href=#refsRFC1842>[RFC1842]</a><!-- HZ-GB-2312 -->
<a href=#refsRFC1468>[RFC1468]</a><!-- ISO-2022-JP -->
<a href=#refsRFC2237>[RFC2237]</a><!-- ISO-2022-JP-1 -->
<a href=#refsRFC1554>[RFC1554]</a><!-- ISO-2022-JP-2 -->
<a href=#refsRFC1922>[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<a href=#refsRFC1557>[RFC1557]</a><!-- ISO-2022-KR -->
<a href=#refsUNICODE>[UNICODE]</a>
<a href=#refsCESU8>[CESU8]</a>
<a href=#refsUTF7>[UTF7]</a>
<a href=#refsBOCU1>[BOCU1]</a>
<a href=#refsSCSU>[SCSU]</a>
<!-- no idea what to reference for EBCDIC, so... -->
</p>

<p class=note>Most of these encodings are discouraged because of
security concerns. If a hostile user can contribute text to a site
using these encodings, bugs in the site's whitelisting filter or in
a user agent can easily lead to the filter interpreting the
contribution as "safe" while the user agent interprets the same
contribution as containing a <code><a href=#script>script</a></code> element. This would
enable cross-site scripting attacks. By avoiding these encodings,
and always providing a <a href=#character-encoding-declaration>character encoding declaration</a>,
an author is less likely to run into this kind of problem.</p>

<p>Authors are encouraged to use UTF-8. Conformance checkers may
advise authors against using legacy encodings.</p>

<div class=impl>

<p>Authoring tools should default to using UTF-8 for newly-created
documents.</p>

</div>
<p>Authors should not use UTF-32, as the HTML5 encoding detection
algorithms intentionally do not distinguish it from UTF-16. <a href=#refsUNICODE>[UNICODE]</a></p>

<p class=note>Using non-UTF-8 encodings can have unexpected
results on form submission and URL encodings, which use the
Expand Down
65 changes: 34 additions & 31 deletions index
Expand Up @@ -1885,12 +1885,11 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
correspond to single-byte sequences that map to the same Unicode
characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href=#refsRFC1345>[RFC1345]</a></p>

<p class=note>This includes such encodings as Shift_JIS and
variants of ISO-2022, even though it is possible in these encodings
for bytes like 0x70 to be part of longer sequences that are
unrelated to their interpretation as ASCII. It excludes such
encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
variants.</p>
<p class=note>This includes such encodings as Shift_JIS,
HZ-GB-2312, and variants of ISO-2022, even though it is possible in
these encodings for bytes like 0x70 to be part of longer sequences
that are unrelated to their interpretation as ASCII. It excludes
such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>

<!--
We'll have to change that if anyone comes up with a way to have a
Expand Down Expand Up @@ -11691,47 +11690,51 @@ people expect to have work and what is necessary.
state</a>, then the character encoding used must be an
<a href=#ascii-compatible-character-encoding>ASCII-compatible character encoding</a>.</p>

<p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
<p>Authors are encouraged to use UTF-8. Conformance checkers may
advise authors against using legacy encodings.</p>

<div class=impl>

<p>Authoring tools should default to using UTF-8 for newly-created
documents.</p>

</div>

<p>Encodings in which a series of bytes in the range 0x20 to 0x7E
can encode characters other than the corresponding characters in the
range U+0020 to U+007E represent a potential security vulnerability:
a user agent that does not support the encoding (or does not support
the label used to declare the encoding, or does not use the same
mechanism to detect the encoding of unlabelled content as another
user agent) might end up interpreting technically benign plain text
content as HTML tags and JavaScript. In particular, this applies to
encodings in which the bytes corresponding to "<code title="">&lt;script&gt;</code>" in ASCII can encode a different
string. Authors should not use such encodings, which are known to
include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
handling of ASCII "~" -->, encodings based on ISO-2022<!--
http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
-->, and encodings based on EBCDIC. Authors should not use UTF-32.
Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
-->, and encodings based on EBCDIC. Furtermore, authors must not use
the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
this category, because these encodings were never intended for use
for Web content.
<a href=#refsRFC1345>[RFC1345]</a><!-- for the JIS types -->
<a href=#refsRFC1842>[RFC1842]</a><!-- HZ-GB-2312 -->
<a href=#refsRFC1468>[RFC1468]</a><!-- ISO-2022-JP -->
<a href=#refsRFC2237>[RFC2237]</a><!-- ISO-2022-JP-1 -->
<a href=#refsRFC1554>[RFC1554]</a><!-- ISO-2022-JP-2 -->
<a href=#refsRFC1922>[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<a href=#refsRFC1557>[RFC1557]</a><!-- ISO-2022-KR -->
<a href=#refsUNICODE>[UNICODE]</a>
<a href=#refsCESU8>[CESU8]</a>
<a href=#refsUTF7>[UTF7]</a>
<a href=#refsBOCU1>[BOCU1]</a>
<a href=#refsSCSU>[SCSU]</a>
<!-- no idea what to reference for EBCDIC, so... -->
</p>

<p class=note>Most of these encodings are discouraged because of
security concerns. If a hostile user can contribute text to a site
using these encodings, bugs in the site's whitelisting filter or in
a user agent can easily lead to the filter interpreting the
contribution as "safe" while the user agent interprets the same
contribution as containing a <code><a href=#script>script</a></code> element. This would
enable cross-site scripting attacks. By avoiding these encodings,
and always providing a <a href=#character-encoding-declaration>character encoding declaration</a>,
an author is less likely to run into this kind of problem.</p>

<p>Authors are encouraged to use UTF-8. Conformance checkers may
advise authors against using legacy encodings.</p>

<div class=impl>

<p>Authoring tools should default to using UTF-8 for newly-created
documents.</p>

</div>
<p>Authors should not use UTF-32, as the HTML5 encoding detection
algorithms intentionally do not distinguish it from UTF-16. <a href=#refsUNICODE>[UNICODE]</a></p>

<p class=note>Using non-UTF-8 encodings can have unexpected
results on form submission and URL encodings, which use the
Expand Down
67 changes: 36 additions & 31 deletions source
Expand Up @@ -901,12 +901,11 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a
href="#refsRFC1345">[RFC1345]</a></p>

<p class="note">This includes such encodings as Shift_JIS and
variants of ISO-2022, even though it is possible in these encodings
for bytes like 0x70 to be part of longer sequences that are
unrelated to their interpretation as ASCII. It excludes such
encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
variants.</p>
<p class="note">This includes such encodings as Shift_JIS,
HZ-GB-2312, and variants of ISO-2022, even though it is possible in
these encodings for bytes like 0x70 to be part of longer sequences
that are unrelated to their interpretation as ASCII. It excludes
such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p>

<!--
We'll have to change that if anyone comes up with a way to have a
Expand Down Expand Up @@ -12376,47 +12375,53 @@ people expect to have work and what is necessary.
state</span>, then the character encoding used must be an
<span>ASCII-compatible character encoding</span>.</p>

<p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
<p>Authors are encouraged to use UTF-8. Conformance checkers may
advise authors against using legacy encodings.</p>

<div class="impl">

<p>Authoring tools should default to using UTF-8 for newly-created
documents.</p>

</div>

<p>Encodings in which a series of bytes in the range 0x20 to 0x7E
can encode characters other than the corresponding characters in the
range U+0020 to U+007E represent a potential security vulnerability:
a user agent that does not support the encoding (or does not support
the label used to declare the encoding, or does not use the same
mechanism to detect the encoding of unlabelled content as another
user agent) might end up interpreting technically benign plain text
content as HTML tags and JavaScript. In particular, this applies to
encodings in which the bytes corresponding to "<code
title="">&lt;script></code>" in ASCII can encode a different
string. Authors should not use such encodings, which are known to
include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
handling of ASCII "~" -->, encodings based on ISO-2022<!--
http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
-->, and encodings based on EBCDIC. Authors should not use UTF-32.
Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
-->, and encodings based on EBCDIC. Furtermore, authors must not use
the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
this category, because these encodings were never intended for use
for Web content.
<a href="#refsRFC1345">[RFC1345]</a><!-- for the JIS types -->
<a href="#refsRFC1842">[RFC1842]</a><!-- HZ-GB-2312 -->
<a href="#refsRFC1468">[RFC1468]</a><!-- ISO-2022-JP -->
<a href="#refsRFC2237">[RFC2237]</a><!-- ISO-2022-JP-1 -->
<a href="#refsRFC1554">[RFC1554]</a><!-- ISO-2022-JP-2 -->
<a href="#refsRFC1922">[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<a href="#refsRFC1557">[RFC1557]</a><!-- ISO-2022-KR -->
<a href="#refsUNICODE">[UNICODE]</a>
<a href="#refsCESU8">[CESU8]</a>
<a href="#refsUTF7">[UTF7]</a>
<a href="#refsBOCU1">[BOCU1]</a>
<a href="#refsSCSU">[SCSU]</a>
<!-- no idea what to reference for EBCDIC, so... -->
</p>

<p class="note">Most of these encodings are discouraged because of
security concerns. If a hostile user can contribute text to a site
using these encodings, bugs in the site's whitelisting filter or in
a user agent can easily lead to the filter interpreting the
contribution as "safe" while the user agent interprets the same
contribution as containing a <code>script</code> element. This would
enable cross-site scripting attacks. By avoiding these encodings,
and always providing a <span>character encoding declaration</span>,
an author is less likely to run into this kind of problem.</p>

<p>Authors are encouraged to use UTF-8. Conformance checkers may
advise authors against using legacy encodings.</p>

<div class="impl">

<p>Authoring tools should default to using UTF-8 for newly-created
documents.</p>

</div>
<p>Authors should not use UTF-32, as the HTML5 encoding detection
algorithms intentionally do not distinguish it from UTF-16. <a
href="#refsUNICODE">[UNICODE]</a></p>

<p class="note">Using non-UTF-8 encodings can have unexpected
results on form submission and URL encodings, which use the
Expand Down

0 comments on commit 1954b35

Please sign in to comment.