Short URL: http://html5.org/r/6007
| SVN | Bug | Comment | Time (UTC) |
|---|---|---|---|
| 6007 | 8207 | apply wg decision | 2011-04-14 22:17 |
Index: source
===================================================================
--- source (revision 6006)
+++ source (revision 6007)
@@ -5609,10 +5609,23 @@
<h3>URLs</h3>
+ <p>This specification defines the term <span>URL</span>, and defines
+ various algorithms for dealing with URLs, because for historical
+ reasons the rules defined by the URI and IRI specifications are not
+ a complete description of what HTML user agents need to implement to
+ be compatible with Web content.</p>
+
+ <p class="note">The term "URL" in this specification is used in a
+ manner distinct from the precise technical meaning it is given in
+ RFC 3986. Readers familiar with that RFC will find it easier to read
+ <em>this</em> specification if they pretend the term "URL" as used
+ herein is really called something else altogether. This is a
+ <span>willful violation</span> of RFC 3986. <a
+ href="#refsRFC3986">[RFC3986]</a></p>
+
+
<h4>Terminology</h4>
- <!-- see also: svn diff -r3244:3245 source -->
-
<p>A <dfn>URL</dfn> is a string used to identify a resource.</p>
<p>A <span>URL</span> is a <dfn>valid URL</dfn> if at least one of
@@ -5650,29 +5663,176 @@
whitespace">stripping leading and trailing whitespace</span> from
it, it is a <span>valid non-empty URL</span>.</p>
+ <p>This specification defines the URL
+ <dfn><code>about:legacy-compat</code></dfn> as a reserved, though
+ unresolvable, <code title="">about:</code> URI, for use in <span
+ title="syntax-doctype">DOCTYPE</span>s in <span>HTML
+ documents</span> when needed for compatibility with XML tools. <a
+ href="#refsABOUT">[ABOUT]</a></p>
+
+ <p>This specification defines the URL
+ <dfn><code>about:srcdoc</code></dfn> as a reserved, though
+ unresolvable, <code title="">about:</code> URI, that is used as
+ <span>the document's address</span> of <span title="an iframe srcdoc
+ document"><code>iframe</code> <code
+ title="attr-iframe-srcdoc">srcdoc</code> documents</span>. <a
+ href="#refsABOUT">[ABOUT]</a></p>
+
+
<div class="impl">
+ <h4>Parsing URLs</h4>
+
<p>To <dfn>parse a URL</dfn> <var title="">url</var> into its
- component parts, the user agent must use the <span class="XXX">parse
- an address</span> algorithm defined by the IRI specification. <a
+ component parts, the user agent must use the following steps:</p>
+
+ <ol>
+
+ <li><p>Strip leading and trailing <span title="space
+ character">space characters</span> from <var
+ title="">url</var>.</p></li>
+
+ <li>
+
+ <p>Parse <var title="">url</var> in the manner defined by RFC
+ 3986, with the following exceptions:</p>
+
+ <ul>
+
+ <li>Add all characters with code points less than or equal to
+ U+0020 or greater than or equal to U+007F to the
+ <unreserved> production.</li>
+
+ <li>Add the characters U+0022, U+003C, U+003E, U+005B .. U+005E,
+ U+0060, and U+007B .. U+007D to the <unreserved>
+ production.
+ <!--
+ 0022 QUOTATION MARK
+ 003C LESS-THAN SIGN
+ 003E GREATER-THAN SIGN
+ 005B LEFT SQUARE BRACKET
+ 005C REVERSE SOLIDUS
+ 005D RIGHT SQUARE BRACKET
+ 005E CIRCUMFLEX ACCENT
+ 0060 GRAVE ACCENT
+ 007B LEFT CURLY BRACKET
+ 007C VERTICAL LINE
+ 007D RIGHT CURLY BRACKET
+ -->
+ </li>
+
+ <li>Add a single U+0025 PERCENT SIGN character as a second
+ alternative way of matching the <pct-encoded> production,
+ except when the <pct-encoded> is used in the
+ <reg-name> production.</li>
+
+ <li>Add the U+0023 NUMBER SIGN character to the characters
+ allowed in the <fragment> production.</li>
+
+ <!-- some browsers also have other differences, e.g. Mozilla
+ seems to treat ";" as if it was not in sub-delims, if the scheem
+ is "ftp". -->
+
+ </ul>
+
+ </li>
+
+ <li>
+
+ <p>If <var title="">url</var> doesn't match the
+ <URI-reference> production, even after the above changes are
+ made to the ABNF definitions, then parsing the URL fails with an
+ error. <a href="#refsRFC3986">[RFC3986]</a></p>
+
+ <p>Otherwise, parsing <var title="">url</var> was successful; the
+ components of the URL are substrings of <var title="">url</var>
+ defined as follows:</p>
+
+ <dl>
+
+ <dt><dfn title="url-scheme"><scheme></dfn></dt>
+
+ <dd><p>The substring matched by the <scheme> production, if any.</p></dd>
+
+
+ <dt><dfn title="url-host"><host></dfn></dt>
+
+ <dd><p>The substring matched by the <host> production, if any.</p></dd>
+
+
+ <dt><dfn title="url-port"><port></dfn></dt>
+
+ <dd><p>The substring matched by the <port> production, if any.</p></dd>
+
+
+ <dt><dfn title="url-hostport"><hostport></dfn></dt>
+
+ <dd><p>If there is a <scheme> component and a <port>
+ component and the port given by the <port> component is
+ different than the default port defined for the protocol given by
+ the <scheme> component, then <hostport> is the
+ substring that starts with the substring matched by the
+ <host> production and ends with the substring matched by the
+ <port> production, and includes the colon in between the
+ two. Otherwise, it is the same as the <host> component.</p>
+
+
+ <dt><dfn title="url-path"><path></dfn></dt>
+
+ <dd>
+
+ <p>The substring matched by one of the following productions, if
+ one of them was matched:</p>
+
+ <ul class="brief">
+ <li><path-abempty></li>
+ <li><path-absolute></li>
+ <li><path-noscheme></li>
+ <li><path-rootless></li>
+ <li><path-empty></li>
+ </ul>
+
+ </dd>
+
+
+ <dt><dfn title="url-query"><query></dfn></dt>
+
+ <dd><p>The substring matched by the <query> production, if any.</p></dd>
+
+
+ <dt><dfn title="url-fragment"><fragment></dfn></dt>
+
+ <dd><p>The substring matched by the <fragment> production, if any.</p></dd>
+
+
+ <dt><dfn title="url-host-specific"><host-specific></dfn></dt>
+
+ <dd><p>The substring that <em>follows</em> the substring matched
+ by the <authority> production, or the whole string if the
+ <authority> production wasn't matched.</p></dd>
+
+ </dl>
+
+ </li>
+
+ </ol>
+
+ <p class="note">These parsing rules are a <span>willful
+ violation</span> of RFC 3986 and RFC 3987 (which do not define error
+ handling), motivated by a desire to handle legacy content. <a
+ href="#refsRFC3986">[RFC3986]</a> <a
href="#refsRFC3987">[RFC3987]</a></p>
- <p>Parsing a URL can fail. If it does not, then it results in the
- following components, again as defined by the IRI specification:</p>
+ </div>
- <ul class="brief">
- <li><dfn title="url-scheme"><scheme></dfn></li>
- <li><dfn title="url-host"><host></dfn></li>
- <li><dfn title="url-port"><port></dfn></li>
- <li><dfn title="url-hostport"><hostport></dfn></li>
- <li><dfn title="url-path"><path></dfn></li>
- <li><dfn title="url-query"><query></dfn></li>
- <li><dfn title="url-fragment"><fragment></dfn></li>
- <li><dfn title="url-host-specific"><host-specific></dfn></li>
- </ul>
- <hr>
+ <h4>Resolving URLs</h4>
+ <p>Resolving a URL is the process of taking a relative URL and
+ obtaining the absolute URL that it implies.</p>
+
+ <div class="impl">
+
<p>To <dfn>resolve a URL</dfn> to an <span>absolute URL</span>
relative to either another <span>absolute URL</span> or an element,
the user agent must use the following steps. Resolving a URL can
@@ -5791,14 +5951,136 @@
</li>
- <li><p>Return the result of applying the <span class="XXX">resolve
- an address</span> algorithm defined by the IRI specification to
- resolve <var title="">url</var> relative to <var
- title="">base</var> using encoding <var title="">encoding</var>. <a
- href="#refsRFC3987">[RFC3987]</a></p></li>
+ <li><p><span title="parse a URL">Parse</span> <var
+ title="">url</var> into its component parts.</p></li>
+ <li>
+
+ <p>If parsing <var title="">url</var> resulted in a <span
+ title="url-host"><host></span> component, then replace the
+ matching substring of <var title="">url</var> with the string that
+ results from expanding any sequences of percent-encoded octets in
+ that component that are valid UTF-8 sequences into Unicode
+ characters as defined by UTF-8.</p>
+
+ <p>If any percent-encoded octets in that component are not valid
+ UTF-8 sequences, then return an error and abort these steps.</p>
+
+ <p>Apply the IDNA ToASCII algorithm to the matching substring,
+ with both the AllowUnassigned and UseSTD3ASCIIRules flags
+ set. Replace the matching substring with the result of the ToASCII
+ algorithm.</p>
+
+ <p>If ToASCII fails to convert one of the components of the
+ string, e.g. because it is too long or because it contains invalid
+ characters, then return an error and abort these steps. <a
+ href="#refsRFC3490">[RFC3490]</a></p>
+
+ </li>
+
+ <li>
+
+ <p>If parsing <var title="">url</var> resulted in a <span
+ title="url-path"><path></span> component, then replace the
+ matching substring of <var title="">url</var> with the string that
+ results from applying the following steps to each character other
+ than U+0025 PERCENT SIGN (%) that doesn't match the original
+ <path> production defined in RFC 3986:</p>
+
+ <ol>
+
+ <li>Encode the character into a sequence of octets as defined by
+ UTF-8.</li>
+
+ <li>Replace the character with the percent-encoded form of those
+ octets. <a href="#refsRFC3986">[RFC3986]</a></li>
+
+ </ol>
+
+ <div class="example">
+
+ <p>For instance if <var title="">url</var> was "<code
+ title="">//example.com/a^b☺c%FFd%z/?e</code>", then the
+ <span title="url-path"><path></span> component's substring
+ would be "<code title="">/a^b☺c%FFd%z/</code>" and the two
+ characters that would have to be escaped would be "<code
+ title="">^</code>" and "<code title="">☺</code>". The
+ result after this step was applied would therefore be that <var
+ title="">url</var> now had the value "<code
+ title="">//example.com/a%5Eb%E2%98%BAc%FFd%z/?e</code>".</p>
+
+ </div>
+
+ </li>
+
+ <li>
+
+ <p>If parsing <var title="">url</var> resulted in a <span
+ title="url-query"><query></span> component, then replace the
+ matching substring of <var title="">url</var> with the string that
+ results from applying the following steps to each character other
+ than U+0025 PERCENT SIGN (%) that doesn't match the original
+ <query> production defined in RFC 3986:</p>
+
+ <ol>
+
+ <li>If the character in question cannot be expressed in the
+ encoding <var title="">encoding</var>, then replace it with a
+ single 0x3F octet (an ASCII question mark) and skip the remaining
+ substeps for this character.</li>
+
+ <li>Encode the character into a sequence of octets as defined by
+ the encoding <var title="">encoding</var>.</li>
+
+ <li>Replace the character with the percent-encoded form of those
+ octets. <a href="#refsRFC3986">[RFC3986]</a></li>
+
+ </ol>
+
+ </li>
+
+ <li><p>Apply the algorithm described in RFC 3986 section 5.2
+ Relative Resolution, using <var title="">url</var> as the
+ potentially relative URI reference (<var title="">R</var>), and
+ <var title="">base</var> as the base URI (<var
+ title="">Base</var>). <a href="#refsRFC3986">[RFC3986]</a></p></li>
+
+ <li>
+
+ <p>Apply any relevant conformance criteria of RFC 3986 and RFC
+ 3987, returning an error and aborting these steps if
+ appropriate. <a href="#refsRFC3986">[RFC3986]</a> <a
+ href="#refsRFC3987">[RFC3987]</a></p>
+
+ <p class="example">For instance, if an absolute URI that would be
+ returned by the above algorithm violates the restrictions specific
+ to its scheme, e.g. a <code title="">data:</code> URI using the
+ "<code title="">//</code>" server-based naming authority syntax,
+ then user agents are to treat this as an error instead.<!-- RFC
+ 3986, 3.1 Scheme --></p>
+
+ </li>
+
+ <li><p>Let <var title="">result</var> be the target URI (<var
+ title="">T</var>) returned by the Relative Resolution
+ algorithm.</p></li>
+
+ <li><p>If <var title="">result</var> uses a scheme with a
+ server-based naming authority, replace all U+005C REVERSE SOLIDUS
+ (\) characters in <var title="">result</var> with U+002F SOLIDUS
+ (/) characters.</p></li>
+
+ <li><p>Return <var title="">result</var>.</p></li>
+
</ol>
+ <p class="note">Some of the steps in these rules, for example the
+ processing of U+005C REVERSE SOLIDUS (\) characters, are a
+ <span>willful violation</span> of RFC 3986 and RFC 3987, motivated
+ by a desire to handle legacy content. <a
+ href="#refsRFC3986">[RFC3986]</a> <a
+ href="#refsRFC3987">[RFC3987]</a></p>
+
</div>
<p>A <span>URL</span> is an <dfn>absolute URL</dfn> if <span
@@ -5818,32 +6100,8 @@
immediately after the <span title="url-scheme"><scheme></span>
component and they are both U+002F SOLIDUS characters (//).</p>
- <hr>
- <p>This specification defines the URL
- <dfn><code>about:legacy-compat</code></dfn> as a reserved, though
- unresolvable, <code title="">about:</code> URI, for use in <span
- title="syntax-doctype">DOCTYPE</span>s in <span>HTML
- documents</span> when needed for compatibility with XML tools. <a
- href="#refsABOUT">[ABOUT]</a></p>
- <p>This specification defines the URL
- <dfn><code>about:srcdoc</code></dfn> as a reserved, though
- unresolvable, <code title="">about:</code> URI, that is used as
- <span>the document's address</span> of <span title="an iframe srcdoc
- document"><code>iframe</code> <code
- title="attr-iframe-srcdoc">srcdoc</code> documents</span>. <a
- href="#refsABOUT">[ABOUT]</a></p>
-
- <p class="note">The term "URL" in this specification is used in a
- manner distinct from the precise technical meaning it is given in
- RFC 3986. Readers familiar with that RFC will find it easier to read
- <em>this</em> specification if they pretend the term "URL" as used
- herein is really called something else altogether. This is a
- <span>willful violation</span> of RFC 3986. <a
- href="#refsRFC3986">[RFC3986]</a></p>
-
-
<div class="impl">
<h4>Dynamic changes to base URLs</h4>