HTML Standard Tracker

Filter

File a bug

SVNBugCommentTime (UTC)
60078207apply wg decision2011-04-14 22:17
@@ -5602,23 +5602,36 @@ is conforming depends on which specs apply, and leaves it at that. -->
   the empty string, a string consisting of only <span title="space
   character">space characters</span>, or is a media query that matches
   the user's environment according to the definitions given in the
   Media Queries specification. <a href="#refsMQ">[MQ]</a></p>
 
 
 
 
   <h3>URLs</h3>
 
-  <h4>Terminology</h4>
+  <p>This specification defines the term <span>URL</span>, and defines
+  various algorithms for dealing with URLs, because for historical
+  reasons the rules defined by the URI and IRI specifications are not
+  a complete description of what HTML user agents need to implement to
+  be compatible with Web content.</p>
 
-  <!-- see also: svn diff -r3244:3245 source -->
+  <p class="note">The term "URL" in this specification is used in a
+  manner distinct from the precise technical meaning it is given in
+  RFC 3986. Readers familiar with that RFC will find it easier to read
+  <em>this</em> specification if they pretend the term "URL" as used
+  herein is really called something else altogether. This is a
+  <span>willful violation</span> of RFC 3986. <a
+  href="#refsRFC3986">[RFC3986]</a></p>
+
+
+  <h4>Terminology</h4>
 
   <p>A <dfn>URL</dfn> is a string used to identify a resource.</p>
 
   <p>A <span>URL</span> is a <dfn>valid URL</dfn> if at least one of
   the following conditions holds:</p>
 
   <ul>
 
    <li><p>The <span>URL</span> is a valid URI reference <a
    href="#refsRFC3986">[RFC3986]</a>.</p></li>
@@ -5643,42 +5656,189 @@ is conforming depends on which specs apply, and leaves it at that. -->
   <p>A string is a <dfn>valid URL potentially surrounded by
   spaces</dfn> if, after <span title="strip leading and trailing
   whitespace">stripping leading and trailing whitespace</span> from
   it, it is a <span>valid URL</span>.</p>
 
   <p>A string is a <dfn>valid non-empty URL potentially surrounded by
   spaces</dfn> if, after <span title="strip leading and trailing
   whitespace">stripping leading and trailing whitespace</span> from
   it, it is a <span>valid non-empty URL</span>.</p>
 
+  <p>This specification defines the URL
+  <dfn><code>about:legacy-compat</code></dfn> as a reserved, though
+  unresolvable, <code title="">about:</code> URI, for use in <span
+  title="syntax-doctype">DOCTYPE</span>s in <span>HTML
+  documents</span> when needed for compatibility with XML tools. <a
+  href="#refsABOUT">[ABOUT]</a></p>
+
+  <p>This specification defines the URL
+  <dfn><code>about:srcdoc</code></dfn> as a reserved, though
+  unresolvable, <code title="">about:</code> URI, that is used as
+  <span>the document's address</span> of <span title="an iframe srcdoc
+  document"><code>iframe</code> <code
+  title="attr-iframe-srcdoc">srcdoc</code> documents</span>. <a
+  href="#refsABOUT">[ABOUT]</a></p>
+
+
   <div class="impl">
 
+  <h4>Parsing URLs</h4>
+
   <p>To <dfn>parse a URL</dfn> <var title="">url</var> into its
-  component parts, the user agent must use the <span class="XXX">parse
-  an address</span> algorithm defined by the IRI specification. <a
+  component parts, the user agent must use the following steps:</p>
+
+  <ol>
+
+   <li><p>Strip leading and trailing <span title="space
+   character">space characters</span> from <var
+   title="">url</var>.</p></li>
+
+   <li>
+
+    <p>Parse <var title="">url</var> in the manner defined by RFC
+    3986, with the following exceptions:</p>
+
+    <ul>
+
+     <li>Add all characters with code points less than or equal to
+     U+0020 or greater than or equal to U+007F to the
+     &lt;unreserved&gt; production.</li>
+
+     <li>Add the characters U+0022, U+003C, U+003E, U+005B .. U+005E,
+     U+0060, and U+007B .. U+007D to the &lt;unreserved&gt;
+     production.
+      <!--
+       0022 QUOTATION MARK
+       003C LESS-THAN SIGN
+       003E GREATER-THAN SIGN
+       005B LEFT SQUARE BRACKET
+       005C REVERSE SOLIDUS
+       005D RIGHT SQUARE BRACKET
+       005E CIRCUMFLEX ACCENT
+       0060 GRAVE ACCENT
+       007B LEFT CURLY BRACKET
+       007C VERTICAL LINE
+       007D RIGHT CURLY BRACKET
+      -->
+     </li>
+
+     <li>Add a single U+0025 PERCENT SIGN character as a second
+     alternative way of matching the &lt;pct-encoded&gt; production,
+     except when the &lt;pct-encoded&gt; is used in the
+     &lt;reg-name&gt; production.</li>
+
+     <li>Add the U+0023 NUMBER SIGN character to the characters
+     allowed in the &lt;fragment&gt; production.</li>
+
+     <!-- some browsers also have other differences, e.g. Mozilla
+     seems to treat ";" as if it was not in sub-delims, if the scheem
+     is "ftp". -->
+
+    </ul>
+
+   </li>
+
+   <li>
+
+    <p>If <var title="">url</var> doesn't match the
+    &lt;URI-reference&gt; production, even after the above changes are
+    made to the ABNF definitions, then parsing the URL fails with an
+    error. <a href="#refsRFC3986">[RFC3986]</a></p>
+
+    <p>Otherwise, parsing <var title="">url</var> was successful; the
+    components of the URL are substrings of <var title="">url</var>
+    defined as follows:</p>
+
+    <dl>
+
+     <dt><dfn title="url-scheme">&lt;scheme&gt;</dfn></dt>
+
+     <dd><p>The substring matched by the &lt;scheme&gt; production, if any.</p></dd>
+
+
+     <dt><dfn title="url-host">&lt;host&gt;</dfn></dt>
+
+     <dd><p>The substring matched by the &lt;host&gt; production, if any.</p></dd>
+
+
+     <dt><dfn title="url-port">&lt;port&gt;</dfn></dt>
+
+     <dd><p>The substring matched by the &lt;port&gt; production, if any.</p></dd>
+
+
+     <dt><dfn title="url-hostport">&lt;hostport&gt;</dfn></dt>
+
+     <dd><p>If there is a &lt;scheme&gt; component and a &lt;port&gt;
+     component and the port given by the &lt;port&gt; component is
+     different than the default port defined for the protocol given by
+     the &lt;scheme&gt; component, then &lt;hostport&gt; is the
+     substring that starts with the substring matched by the
+     &lt;host&gt; production and ends with the substring matched by the
+     &lt;port&gt; production, and includes the colon in between the
+     two. Otherwise, it is the same as the &lt;host&gt; component.</p>
+
+
+     <dt><dfn title="url-path">&lt;path&gt;</dfn></dt>
+
+     <dd>
+
+      <p>The substring matched by one of the following productions, if
+      one of them was matched:</p>
+
+      <ul class="brief">
+       <li>&lt;path-abempty&gt;</li>
+       <li>&lt;path-absolute&gt;</li>
+       <li>&lt;path-noscheme&gt;</li>
+       <li>&lt;path-rootless&gt;</li>
+       <li>&lt;path-empty&gt;</li>
+      </ul>
+
+     </dd>
+
+
+     <dt><dfn title="url-query">&lt;query&gt;</dfn></dt>
+
+     <dd><p>The substring matched by the &lt;query&gt; production, if any.</p></dd>
+
+
+     <dt><dfn title="url-fragment">&lt;fragment&gt;</dfn></dt>
+
+     <dd><p>The substring matched by the &lt;fragment&gt; production, if any.</p></dd>
+
+
+     <dt><dfn title="url-host-specific">&lt;host-specific&gt;</dfn></dt>
+
+     <dd><p>The substring that <em>follows</em> the substring matched
+     by the &lt;authority&gt; production, or the whole string if the
+     &lt;authority&gt; production wasn't matched.</p></dd>
+
+    </dl>
+
+   </li>
+
+  </ol>
+
+  <p class="note">These parsing rules are a <span>willful
+  violation</span> of RFC 3986 and RFC 3987 (which do not define error
+  handling), motivated by a desire to handle legacy content. <a
+  href="#refsRFC3986">[RFC3986]</a> <a
   href="#refsRFC3987">[RFC3987]</a></p>
 
-  <p>Parsing a URL can fail. If it does not, then it results in the
-  following components, again as defined by the IRI specification:</p>
+  </div>
 
-  <ul class="brief">
-   <li><dfn title="url-scheme">&lt;scheme&gt;</dfn></li>
-   <li><dfn title="url-host">&lt;host&gt;</dfn></li>
-   <li><dfn title="url-port">&lt;port&gt;</dfn></li>
-   <li><dfn title="url-hostport">&lt;hostport&gt;</dfn></li>
-   <li><dfn title="url-path">&lt;path&gt;</dfn></li>
-   <li><dfn title="url-query">&lt;query&gt;</dfn></li>
-   <li><dfn title="url-fragment">&lt;fragment&gt;</dfn></li>
-   <li><dfn title="url-host-specific">&lt;host-specific&gt;</dfn></li>
-  </ul>
 
-  <hr>
+  <h4>Resolving URLs</h4>
+
+  <p>Resolving a URL is the process of taking a relative URL and
+  obtaining the absolute URL that it implies.</p>
+
+  <div class="impl">
 
   <p>To <dfn>resolve a URL</dfn> to an <span>absolute URL</span>
   relative to either another <span>absolute URL</span> or an element,
   the user agent must use the following steps. Resolving a URL can
   result in an error, in which case the URL is not resolvable.</p>
 
   <ol>
 
    <li><p>Let <var title="">url</var> be the <span>URL</span> being
    resolved.</p></li>
@@ -5784,71 +5944,169 @@ is conforming depends on which specs apply, and leaves it at that. -->
      <code title="attr-xml-base">xml:base</code> attributes).</p></li>
 
      <li><p>The <span>document base URL</span> is the result of the
      previous step if it was successful; otherwise it is <var
      title="">fallback base url</var>.</p></li>
 
     </ol>
 
    </li>
 
-   <li><p>Return the result of applying the <span class="XXX">resolve
-   an address</span> algorithm defined by the IRI specification to
-   resolve <var title="">url</var> relative to <var
-   title="">base</var> using encoding <var title="">encoding</var>. <a
-   href="#refsRFC3987">[RFC3987]</a></p></li>
+   <li><p><span title="parse a URL">Parse</span> <var
+   title="">url</var> into its component parts.</p></li>
+
+   <li>
+
+    <p>If parsing <var title="">url</var> resulted in a <span
+    title="url-host">&lt;host&gt;</span> component, then replace the
+    matching substring of <var title="">url</var> with the string that
+    results from expanding any sequences of percent-encoded octets in
+    that component that are valid UTF-8 sequences into Unicode
+    characters as defined by UTF-8.</p>
+
+    <p>If any percent-encoded octets in that component are not valid
+    UTF-8 sequences, then return an error and abort these steps.</p>
+
+    <p>Apply the IDNA ToASCII algorithm to the matching substring,
+    with both the AllowUnassigned and UseSTD3ASCIIRules flags
+    set. Replace the matching substring with the result of the ToASCII
+    algorithm.</p>
+
+    <p>If ToASCII fails to convert one of the components of the
+    string, e.g. because it is too long or because it contains invalid
+    characters, then return an error and abort these steps. <a
+    href="#refsRFC3490">[RFC3490]</a></p>
+
+   </li>
+
+   <li>
+
+    <p>If parsing <var title="">url</var> resulted in a <span
+    title="url-path">&lt;path&gt;</span> component, then replace the
+    matching substring of <var title="">url</var> with the string that
+    results from applying the following steps to each character other
+    than U+0025 PERCENT SIGN (%) that doesn't match the original
+    &lt;path&gt; production defined in RFC 3986:</p>
+
+    <ol>
+
+     <li>Encode the character into a sequence of octets as defined by
+     UTF-8.</li>
+
+     <li>Replace the character with the percent-encoded form of those
+     octets. <a href="#refsRFC3986">[RFC3986]</a></li>
+
+    </ol>
+
+    <div class="example">
+
+     <p>For instance if <var title="">url</var> was "<code
+     title="">//example.com/a^b&#x263a;c%FFd%z/?e</code>", then the
+     <span title="url-path">&lt;path&gt;</span> component's substring
+     would be "<code title="">/a^b&#x263a;c%FFd%z/</code>" and the two
+     characters that would have to be escaped would be "<code
+     title="">^</code>" and "<code title="">&#x263a;</code>". The
+     result after this step was applied would therefore be that <var
+     title="">url</var> now had the value "<code
+     title="">//example.com/a%5Eb%E2%98%BAc%FFd%z/?e</code>".</p>
+
+    </div>
+
+   </li>
+
+   <li>
+
+    <p>If parsing <var title="">url</var> resulted in a <span
+    title="url-query">&lt;query&gt;</span> component, then replace the
+    matching substring of <var title="">url</var> with the string that
+    results from applying the following steps to each character other
+    than U+0025 PERCENT SIGN (%) that doesn't match the original
+    &lt;query&gt; production defined in RFC 3986:</p>
+
+    <ol>
+
+     <li>If the character in question cannot be expressed in the
+     encoding <var title="">encoding</var>, then replace it with a
+     single 0x3F octet (an ASCII question mark) and skip the remaining
+     substeps for this character.</li>
+
+     <li>Encode the character into a sequence of octets as defined by
+     the encoding <var title="">encoding</var>.</li>
+
+     <li>Replace the character with the percent-encoded form of those
+     octets. <a href="#refsRFC3986">[RFC3986]</a></li>
+
+    </ol>
+
+   </li>
+
+   <li><p>Apply the algorithm described in RFC 3986 section 5.2
+   Relative Resolution, using <var title="">url</var> as the
+   potentially relative URI reference (<var title="">R</var>), and
+   <var title="">base</var> as the base URI (<var
+   title="">Base</var>). <a href="#refsRFC3986">[RFC3986]</a></p></li>
+
+   <li>
+
+    <p>Apply any relevant conformance criteria of RFC 3986 and RFC
+    3987, returning an error and aborting these steps if
+    appropriate. <a href="#refsRFC3986">[RFC3986]</a> <a
+    href="#refsRFC3987">[RFC3987]</a></p>
+
+    <p class="example">For instance, if an absolute URI that would be
+    returned by the above algorithm violates the restrictions specific
+    to its scheme, e.g. a <code title="">data:</code> URI using the
+    "<code title="">//</code>" server-based naming authority syntax,
+    then user agents are to treat this as an error instead.<!-- RFC
+    3986, 3.1 Scheme --></p>
+
+   </li>
+
+   <li><p>Let <var title="">result</var> be the target URI (<var
+   title="">T</var>) returned by the Relative Resolution
+   algorithm.</p></li>
+
+   <li><p>If <var title="">result</var> uses a scheme with a
+   server-based naming authority, replace all U+005C REVERSE SOLIDUS
+   (\) characters in <var title="">result</var> with U+002F SOLIDUS
+   (/) characters.</p></li>
+
+   <li><p>Return <var title="">result</var>.</p></li>
 
   </ol>
 
+  <p class="note">Some of the steps in these rules, for example the
+  processing of U+005C REVERSE SOLIDUS (\) characters, are a
+  <span>willful violation</span> of RFC 3986 and RFC 3987, motivated
+  by a desire to handle legacy content. <a
+  href="#refsRFC3986">[RFC3986]</a> <a
+  href="#refsRFC3987">[RFC3987]</a></p>
+
   </div>
 
   <p>A <span>URL</span> is an <dfn>absolute URL</dfn> if <span
   title="resolve a url">resolving</span> it results in the same output
   regardless of what it is resolved relative to, and that output is
   not a failure.</p>
 
   <p>An <span>absolute URL</span> is a <dfn>hierarchical URL</dfn> if,
   when <span title="resolve a url">resolved</span> and then <span
   title="parse a url">parsed</span>, there is a character immediately
   after the <span title="url-scheme">&lt;scheme&gt;</span> component
   and it is a U+002F SOLIDUS character (/).</p>
 
   <p>An <span>absolute URL</span> is an <dfn>authority-based URL</dfn>
   if, when <span title="resolve a url">resolved</span> and then <span
   title="parse a url">parsed</span>, there are two characters
   immediately after the <span title="url-scheme">&lt;scheme&gt;</span>
   component and they are both U+002F SOLIDUS characters (//).</p>
 
-  <hr>
-
-  <p>This specification defines the URL
-  <dfn><code>about:legacy-compat</code></dfn> as a reserved, though
-  unresolvable, <code title="">about:</code> URI, for use in <span
-  title="syntax-doctype">DOCTYPE</span>s in <span>HTML
-  documents</span> when needed for compatibility with XML tools. <a
-  href="#refsABOUT">[ABOUT]</a></p>
-
-  <p>This specification defines the URL
-  <dfn><code>about:srcdoc</code></dfn> as a reserved, though
-  unresolvable, <code title="">about:</code> URI, that is used as
-  <span>the document's address</span> of <span title="an iframe srcdoc
-  document"><code>iframe</code> <code
-  title="attr-iframe-srcdoc">srcdoc</code> documents</span>. <a
-  href="#refsABOUT">[ABOUT]</a></p>
-
-  <p class="note">The term "URL" in this specification is used in a
-  manner distinct from the precise technical meaning it is given in
-  RFC 3986. Readers familiar with that RFC will find it easier to read
-  <em>this</em> specification if they pretend the term "URL" as used
-  herein is really called something else altogether. This is a
-  <span>willful violation</span> of RFC 3986. <a
-  href="#refsRFC3986">[RFC3986]</a></p>
 
 
   <div class="impl">
 
   <h4>Dynamic changes to base URLs</h4>
 
   <p>When an <code title="attr-xml-base">xml:base</code> attribute
   changes, the attribute's element, and all descendant elements, are
   <span>affected by a base URL change</span>.</p>
 

|