Skip to content

Commit

Permalink
[e] (0) Try to tidy up some more of the Unicode/code unit mess with a…
Browse files Browse the repository at this point in the history
… probably over-reaching definition (there's over 2000 uses of the word 'character' in the text, so I didn't check that all of them use this new definition... hopefully it works out; otherwise, we'll just have to try something else again).

Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=13676

git-svn-id: http://svn.whatwg.org/webapps@6648 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Oct 6, 2011
1 parent 9eeedee commit ee9e809
Show file tree
Hide file tree
Showing 3 changed files with 59 additions and 19 deletions.
25 changes: 19 additions & 6 deletions complete.html
Expand Up @@ -3365,6 +3365,22 @@ <h4 id=character-encodings><span class=secno>2.1.6 </span>Character encodings</h
<p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>

<p>The term <dfn id=character>character</dfn>, when not qualified as
<em>Unicode</em> character, means a <a href=#unicode-character>Unicode character</a>
where possible, or a surrogate code point when not: when an
algorithm that processes strings is defined in terms of characters,
a pair of <span title="code unit">code units</span> consisting of a
high surrogate followed by a low surrogate must be treated as a
single character, but isolated surrogates must each be treated as a
single character also.</p>

<p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
<span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>

<p class=note>This complexity results from the historical decision
to define the DOM API in terms of 16 bit (UTF-16) <span title="code
unit">code units</span>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>




Expand Down Expand Up @@ -4457,9 +4473,6 @@ <h4 id=common-parser-idioms><span class=secno>2.5.1 </span>Common parser idioms<
whitespace</dfn> from a string, the user agent must remove all <a href=#space-character title="space character">space characters</a> that are at the
start or end of the string.</p>

<p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
<span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>

<p>When a user agent has to <dfn id=strictly-split-a-string>strictly split a string</dfn> on a
particular delimiter character <var title="">delimiter</var>, it
must use the following algorithm:</p>
Expand Down Expand Up @@ -33917,9 +33930,9 @@ <h6 id=parsing-0><span class=secno>4.8.10.13.3 </span>Parsing</h6>

</ol><p>The <dfn id=webvtt-cue-text-tokenizer>WebVTT cue text tokenizer</dfn> is as follows. It emits
a token, which is either a string (whose value is a sequence of
Unicode characters), a start tag (with a tag name, a list of
classes, and optionally an annotation), an end tag (with a tag
name), or a timestamp tag (with a tag value).</p>
characters), a start tag (with a tag name, a list of classes, and
optionally an annotation), an end tag (with a tag name), or a
timestamp tag (with a tag value).</p>

<ol><li><p>Let <var title="">input</var> and <var title="">position</var> be the same variables as those of the same
name in the algorithm that invoked these steps.</li>
Expand Down
25 changes: 19 additions & 6 deletions index
Expand Up @@ -3365,6 +3365,22 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
<p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>

<p>The term <dfn id=character>character</dfn>, when not qualified as
<em>Unicode</em> character, means a <a href=#unicode-character>Unicode character</a>
where possible, or a surrogate code point when not: when an
algorithm that processes strings is defined in terms of characters,
a pair of <span title="code unit">code units</span> consisting of a
high surrogate followed by a low surrogate must be treated as a
single character, but isolated surrogates must each be treated as a
single character also.</p>

<p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
<span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>

<p class=note>This complexity results from the historical decision
to define the DOM API in terms of 16 bit (UTF-16) <span title="code
unit">code units</span>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>




Expand Down Expand Up @@ -4457,9 +4473,6 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
whitespace</dfn> from a string, the user agent must remove all <a href=#space-character title="space character">space characters</a> that are at the
start or end of the string.</p>

<p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
<span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>

<p>When a user agent has to <dfn id=strictly-split-a-string>strictly split a string</dfn> on a
particular delimiter character <var title="">delimiter</var>, it
must use the following algorithm:</p>
Expand Down Expand Up @@ -33917,9 +33930,9 @@ The General Relativistic Field Equations</pre>

</ol><p>The <dfn id=webvtt-cue-text-tokenizer>WebVTT cue text tokenizer</dfn> is as follows. It emits
a token, which is either a string (whose value is a sequence of
Unicode characters), a start tag (with a tag name, a list of
classes, and optionally an annotation), an end tag (with a tag
name), or a timestamp tag (with a tag value).</p>
characters), a start tag (with a tag name, a list of classes, and
optionally an annotation), an end tag (with a tag name), or a
timestamp tag (with a tag value).</p>

<ol><li><p>Let <var title="">input</var> and <var title="">position</var> be the same variables as those of the same
name in the algorithm that invoked these steps.</li>
Expand Down
28 changes: 21 additions & 7 deletions source
Expand Up @@ -2242,6 +2242,24 @@ a.setAttribute('href', 'http://example.com/'); // change the content attribute d
is not a surrogate code point). <a
href="#refsUNICODE">[UNICODE]</a></p>

<p>The term <dfn>character</dfn>, when not qualified as
<em>Unicode</em> character, means a <span>Unicode character</span>
where possible, or a surrogate code point when not: when an
algorithm that processes strings is defined in terms of characters,
a pair of <span title="code unit">code units</span> consisting of a
high surrogate followed by a low surrogate must be treated as a
single character, but isolated surrogates must each be treated as a
single character also.</p>

<p>The <dfn>code-point length</dfn> of a string is the number of
<span title="code unit">code units</span> in that string. <a
href="#refsWEBIDL">[WEBIDL]</a></p>

<p class="note">This complexity results from the historical decision
to define the DOM API in terms of 16 bit (UTF-16) <span title="code
unit">code units</span>, rather than in terms of <span
title="Unicode character">Unicode characters</span>.</p>



<!--END dev-html-->
Expand Down Expand Up @@ -3519,10 +3537,6 @@ is conforming depends on which specs apply, and leaves it at that. -->
title="space character">space characters</span> that are at the
start or end of the string.</p>

<p>The <dfn>code-point length</dfn> of a string is the number of
<span title="code unit">code units</span> in that string. <a
href="#refsWEBIDL">[WEBIDL]</a></p>

<p>When a user agent has to <dfn>strictly split a string</dfn> on a
particular delimiter character <var title="">delimiter</var>, it
must use the following algorithm:</p>
Expand Down Expand Up @@ -37228,9 +37242,9 @@ The General Relativistic Field Equations</pre>

<p>The <dfn>WebVTT cue text tokenizer</dfn> is as follows. It emits
a token, which is either a string (whose value is a sequence of
Unicode characters), a start tag (with a tag name, a list of
classes, and optionally an annotation), an end tag (with a tag
name), or a timestamp tag (with a tag value).</p>
characters), a start tag (with a tag name, a list of classes, and
optionally an annotation), an end tag (with a tag name), or a
timestamp tag (with a tag value).</p>

<ol>

Expand Down

0 comments on commit ee9e809

Please sign in to comment.