Skip to content

Commit

Permalink
[giow] (3) New encoding defaults based on more data.
Browse files Browse the repository at this point in the history
Fixing https://www.w3.org/Bugs/Public/show_bug.cgi?id=21087
Affected topics: HTML Syntax and Parsing

git-svn-id: http://svn.whatwg.org/webapps@7958 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information
Hixie committed Jun 12, 2013
1 parent 518a15d commit c0a1047
Show file tree
Hide file tree
Showing 3 changed files with 847 additions and 155 deletions.
333 changes: 282 additions & 51 deletions complete.html
Expand Up @@ -256,7 +256,7 @@

<header class=head id=head><p><a class=logo href=http://www.whatwg.org/><img alt=WHATWG height=101 src=/images/logo width=101></a></p>
<hgroup><h1 class=allcaps>HTML</h1>
<h2 class="no-num no-toc">Living Standard &mdash; Last Updated 11 June 2013</h2>
<h2 class="no-num no-toc">Living Standard &mdash; Last Updated 12 June 2013</h2>
</hgroup><dl><dt><strong>Web developer edition:</strong></dt>
<dd><strong><a href=http://developers.whatwg.org/>http://developers.whatwg.org/</a></strong></dd>
<dt>Multiple-page version:</dt>
Expand Down Expand Up @@ -84717,102 +84717,333 @@ <h5 id=determining-the-character-encoding><span class=secno>12.2.2.1 </span>Dete
to frequent). The following table gives suggested defaults based on the user's locale, for
compatibility with legacy content. Locales are identified by BCP 47 language tags. <a href=#refsBCP47>[BCP47]</a></p>

<!-- based on mozilla 1.9.1 localizations:
http://mxr.mozilla.org/l10n-mozilla1.9.1/find?string=global%2Fintl.properties&tree=l10n-mozilla1.9.1&hint= -->
<!-- based on three sources:
1. mozilla 1.9.1 localizations: http://mxr.mozilla.org/l10n-mozilla1.9.1/find?string=global%2Fintl.properties&tree=l10n-mozilla1.9.1&hint=
2. windows vista encodings: http://msdn.microsoft.com/en-us/goglobal/bb896001
3. chrome encodings: https://code.google.com/p/chromium/codesearch#search/&q=IDS_DEFAULT_ENCODING
several assumptions were made in this process; amongst them:
- ISO-8859-1 and Windows-1252 are the same (supported by encoding.spec.whatwg.org)
- ISO-8859-9 and Windows-1254 are the same (supported by encoding.spec.whatwg.org)
- Windows-31J and Shift_JIS are the same (supported by encoding.spec.whatwg.org)
- Windows-932 is close enough to Shift_JIS to be treated as equivalent (supported by wikipedia)
- Windows-936 is a basically a subset of GBK which is basically a subset of GB18030 (supported by wikipedia)
- Windows-950 is basically the same as Big5 (supported by wikipedia)
- Firefox's UTF-8 defaults are all bogus
-->

<table><thead><tr><th>Locale language
<table><thead><tr><th colspan=2>Locale language
<th>Suggested default encoding
<tbody><tr><td>ar
<td>UTF-8
<tbody><!-- af, Afrikaans, uses windows-1252: Windows Vista and Firefox agreed --><!-- am, Amharic, uses windows-1252: Firefox and Chrome agreed --><tr><td>ar
<td>Arabic
<td>windows-1256 <!-- Windows Vista and Chrome agreed -->

<!-- arn-CL, Mapudungun (Chile), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- az, Azeri, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1254 -->

<tr><td>be
<td>ISO-8859-5
<!-- az-Cyrl-AZ, Azeri (Cyrillic, Azerbaijan), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<!-- ba-RU, Bashkir (Russia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<!-- be, Belarusian, is not listed here because Windows Vista wanted windows-1251, Chrome wanted <none>, and Firefox wanted ISO-8859-5 -->

<!-- be-BY, Belarusian (Belarus), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<tr><td>bg
<td>windows-1251
<td>Bulgarian
<td>windows-1251 <!-- Windows Vista, Chrome, and Firefox agreed -->

<tr><td>cs<!-- -CZ -->
<td>ISO-8859-2
<!-- bn, Bengali, uses windows-1252: Firefox and Chrome agreed -->

<tr><td>cy
<td>UTF-8
<!-- br-FR, Breton (France), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>fa<!-- -IR -->
<td>UTF-8
<!-- bs-Cyrl-BA, Bosnian (Cyrillic, Bosnia and Herzegovina), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<!-- bs-Latn-BA, Bosnian (Latin, Bosnia and Herzegovina), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<!-- ca, Catalan, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- co-FR, Corsican (France), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>cs
<td>Czech
<td>windows-1250 <!-- Windows Vista and Chrome agreed (but disagreed with Firefox, which thought the encoding should be ISO-8859-2) -->

<!-- cy-GB, Welsh (United Kingdom), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- da, Danish, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- de, German, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- el, Greek, is not listed here because Windows Vista wanted windows-1253, Chrome wanted ISO-8859-7, and Firefox wanted windows-1252 -->

<!-- el-GR, Greek (Greece), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1253 -->

<!-- en, English, uses windows-1252: Windows Vista and Firefox agreed -->

<!-- es, Spanish, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<tr><td>et
<td>Estonian
<td>windows-1257 <!-- Windows Vista and Chrome agreed -->

<!-- eu, Basque, uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>fa
<td>Persian
<td>windows-1256 <!-- Windows Vista and Chrome agreed -->

<!-- fi, Finnish, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- fil, Filipino, uses windows-1252: Firefox and Chrome agreed -->

<!-- fo, Faroese, uses windows-1252: Windows Vista and Firefox agreed -->

<!-- fr, French, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- fy-NL, Frisian (Netherlands), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- ga-IE, Irish (Ireland), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>he<!-- -IL -->
<td>windows-1255
<!-- gl, Galician, uses windows-1252: Windows Vista and Firefox agreed -->

<!-- gsw-FR, Alsatian (France), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- gu, Gujarati, uses windows-1252: Firefox and Chrome agreed -->

<!-- ha-Latn-NG, Hausa (Latin, Nigeria), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>he
<td>Hebrew
<td>windows-1255 <!-- Windows Vista, Chrome, and Firefox agreed -->

<!-- hi, Hindi, uses windows-1252: Firefox and Chrome agreed -->

<tr><td>hr
<td>UTF-8
<td>Croatian
<td>windows-1250 <!-- Windows Vista and Chrome agreed -->

<tr><td>hu<!-- -HU -->
<td>ISO-8859-2
<tr><td>hu
<td>Hungarian
<td>ISO-8859-2 <!-- Chrome and Firefox agreed (but disagreed with Windows Vista, which thought the encoding should be windows-1250) -->

<tr><td>ja <!-- and ja-JP-mac -->
<td>Windows-31J <!-- Shift_JIS -->
<!-- hu-HU, Hungarian (Hungary), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<tr><td>kk
<td>UTF-8
<!-- id, Indonesian, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- ig-NG, Igbo (Nigeria), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- is, Icelandic, uses windows-1252: Windows Vista and Firefox agreed -->

<!-- it, Italian, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- iu-Latn-CA, Inuktitut (Latin, Canada), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>ja
<td>Japanese
<td>Shift_JIS <!-- Windows Vista, Chrome, and Firefox agreed -->

<!-- kk, Kazakh, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<!-- kl-GL, Greenlandic (Greenland), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>ko<!-- -KR -->
<td>windows-949 <!-- EUC-KR -->
<!-- kn, Kannada, uses windows-1252: Firefox and Chrome agreed -->

<tr><td>ko
<td>Korean
<td>windows-949 <!-- Windows Vista, Chrome, and Firefox agreed -->

<tr><td>ku
<td>windows-1254 <!-- ISO-8859-9 -->
<td>Kurdish
<td>windows-1254 <!-- Best guess -->

<!-- ky, Kyrgyz, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<!-- lb-LU, Luxembourgish (Luxembourg), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>lt
<td>windows-1257
<td>Lithuanian
<td>windows-1257 <!-- Windows Vista, Chrome, and Firefox agreed -->

<tr><td>lv<!-- -LV -->
<td>ISO-8859-13
<tr><td>lv
<td>Latvian
<td>windows-1257 <!-- Windows Vista and Chrome agreed (but disagreed with Firefox, which thought the encoding should be ISO-8859-13) -->

<tr><td>mk<!-- -MK -->
<td>UTF-8
<!-- mk, Macedonian, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<tr><td>or
<td>UTF-8
<!-- ml, Malayalam, uses windows-1252: Firefox and Chrome agreed -->

<tr><td>pl<!-- -PL -->
<td>ISO-8859-2
<!-- mn, Mongolian, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<tr><td>ro
<td>UTF-8
<!-- moh-CA, Mohawk (Mohawk), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- mr, Marathi, uses windows-1252: Firefox and Chrome agreed -->

<!-- ms, Malay, uses windows-1252: Windows Vista and Firefox agreed -->

<!-- nb, Norwegian Bokm&aring;l, uses windows-1252: Firefox and Chrome agreed -->

<!-- nl, Dutch, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- nn-NO, Norwegian, Nynorsk (Norway), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- no, Norwegian, uses windows-1252: Windows Vista and Firefox agreed -->

<!-- nso-ZA, Sesotho sa Leboa (South Africa), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- oc-FR, Occitan (France), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>pl
<td>Polish
<td>ISO-8859-2 <!-- Chrome and Firefox agreed (but disagreed with Windows Vista, which thought the encoding should be windows-1250) -->

<!-- pl-PL, Polish (Poland), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<!-- prs-AF, Dari (Afghanistan), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1256 -->

<!-- pt, Portuguese, uses windows-1252: Windows Vista and Firefox agreed -->

<!-- qut-GT, K'iche (Guatemala), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- quz-BO, Quechua (Bolivia), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- quz-EC, Quechua (Ecuador), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- quz-PE, Quechua (Peru), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- rm-CH, Romansh (Switzerland), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- ro, Romanian, is not listed here because Windows Vista wanted windows-1250, Chrome wanted ISO-8859-2, and Firefox wanted <none> -->

<!-- ro-RO, Romanian (Romania), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<tr><td>ru
<td>windows-1251
<td>Russian
<td>windows-1251 <!-- Windows Vista, Chrome, and Firefox agreed -->

<!-- rw-RW, Kinyarwanda (Rwanda), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- sah-RU, Yakut (Russia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<!-- se-FI, Sami, Northern (Finland), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- se-NO, Sami, Northern (Norway), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- se-SE, Sami, Northern (Sweden), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>sk
<td>windows-1250
<td>Slovak
<td>windows-1250 <!-- Windows Vista, Chrome, and Firefox agreed -->

<tr><td>sl
<td>ISO-8859-2
<td>Slovenian
<td>ISO-8859-2 <!-- Chrome and Firefox agreed (but disagreed with Windows Vista, which thought the encoding should be windows-1250) -->

<!-- sl-SI, Slovenian (Slovenia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<!-- sma-NO, Sami, Southern (Norway), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- sma-SE, Sami, Southern (Sweden), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- smj-NO, Sami, Lule (Norway), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- smj-SE, Sami, Lule (Sweden), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- smn-FI, Sami, Inari (Finland), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- sms-FI, Sami, Skolt (Finland), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- sq, Albanian, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<tr><td>sr
<td>UTF-8
<td>Serbian
<td>windows-1251 <!-- Windows Vista and Chrome agreed -->

<!-- sr-Latn-BA, Serbian (Latin, Bosnia and Herzegovina), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<!-- sr-Latn-SP, Serbian (Latin, Serbia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<!-- sv, Swedish, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- sw, Kiswahili, uses windows-1252: Windows Vista, Chrome, and Firefox agreed -->

<!-- ta, Tamil, uses windows-1252: Firefox and Chrome agreed -->

<!-- te, Telugu, uses windows-1252: Firefox and Chrome agreed -->

<!-- tg-Cyrl-TJ, Tajik (Cyrillic, Tajikistan), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<tr><td>th
<td>windows-874 <!-- TIS-620 -->
<td>Thai
<td>windows-874 <!-- Windows Vista, Chrome, and Firefox agreed -->

<!-- tk-TM, Turkmen (Turkmenistan), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1250 -->

<!-- tn-ZA, Setswana (South Africa), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>tr
<td>Turkish
<td>windows-1254 <!-- Windows Vista, Chrome, and Firefox agreed -->

<tr><td>tr<!-- -TR -->
<td>windows-1254 <!-- ISO-8859-9 -->
<!-- tt, Tatar, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<!-- tzm-Latn-DZ, Tamazight (Latin, Algeria), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- ug-CN, Uighur (PRC), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1256 -->

<tr><td>uk
<td>windows-1251
<td>Ukrainian
<td>windows-1251 <!-- Windows Vista, Chrome, and Firefox agreed -->

<!-- ur, Urdu, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1256 -->

<!-- uz, Uzbek, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1254 -->

<!-- uz-Cyrl-UZ, Uzbek (Cyrillic, Uzbekistan), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

<tr><td>vi
<td>UTF-8
<td>Vietnamese
<td>windows-1258 <!-- Windows Vista and Chrome agreed -->

<!-- wee-DE, Lower Sorbian (Germany), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- wen-DE, Upper Sorbian (Germany), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- wo-SN, Wolof (Senegal), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- xh-ZA, isiXhosa (South Africa), uses windows-1252: Windows Vista and Firefox agreed -->

<!-- yo-NG, Yoruba (Nigeria), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td>zh-CN
<td>GB18030
<td>Chinese (People's Republic of China)
<td>GB18030 <!-- Windows Vista, Chrome, and Firefox agreed -->

<!-- zh-HK, Chinese (Hong Kong S.A.R.), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted Big5 -->

<!-- zh-Hans, Chinese (Simplified), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted GB18030 -->

<!-- zh-Hant, Chinese (Traditional), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted Big5 -->

<!-- zh-MO, Chinese (Macao S.A.R.), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted Big5 -->

<!-- zh-SG, Chinese (Singapore), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted GB18030 -->

<tr><td>zh-TW
<td>Big5
<td>Chinese (Taiwan)
<td>Big5 <!-- Windows Vista, Chrome, and Firefox agreed -->

<tr><td>All other locales
<!-- zu-ZA, isiZulu (South Africa), uses windows-1252: Windows Vista and Firefox agreed -->

<tr><td colspan=2>All other locales
<td>windows-1252

</table></li>
</table><p class=tablenote><small>The contents of this table are derived from the intersection of
Windows, Chrome, and Firefox defaults. For locales where these disagreed, user agents are
encouraged to try using UTF-8, and to report if another encoding is more successful.</small></p>


</li>

</ol><p>The <a href="#document's-character-encoding">document's character encoding</a> must immediately be set to the value returned
from this algorithm, at the same time as the user agent uses the returned value to select the
Expand Down

0 comments on commit c0a1047

Please sign in to comment.