1973年是什么年| 农历2月份是什么星座| 心累是什么原因| 身份证带x是什么意思| 为什么海藻敷完那么白| 西安古时候叫什么| 寂寞的反义词是什么| 午马是什么意思| 瘙痒是什么意思| 梦见蘑菇是什么预兆| generic是什么意思| 女性憋不住尿是什么原因| 经期是什么意思| 孕妇过敏性鼻炎可以用什么药| 领盒饭是什么意思| 耳鸣吃什么药最有效| 大便暗红色是什么原因| 入肉是什么意思| 膝盖积液用什么药最好| 规培证有什么用| 桐字属于五行属什么| 黄片是什么| 甘油三酯高是什么病| 仙风道骨指什么生肖| 随波逐流什么意思| 1978年是什么年| 王加呈念什么| 湿气重吃什么蔬菜| 姨妈血是黑褐色是什么原因| 小腿发胀是什么原因| 肚子突然变大是什么原因| 怀孕初期会有什么症状| 紫水晶五行属什么| 14岁可以做什么工作| 毛毛虫吃什么食物| 为什么油耳朵就有狐臭| 头晕出虚汗是什么原因引起的| 狗尾巴草的花语是什么| 什么症状| 肿瘤标志物五项检测是什么| 酸奶不能和什么一起吃| 男性吃什么可以壮阳| 皮肤溃烂是什么原因| 母亲节送什么| 红玛瑙五行属什么| 去黄疸吃什么药| 脸长适合什么样的发型| anxiety什么意思| 浅粉色配什么颜色好看| 正常精液是什么颜色| 盐酸苯海索片治什么病| 低钠盐适合什么人吃| 硫黄和硫磺有什么区别| 什么的山野| 农历10月14日是什么星座| 处女女和什么星座最配| 月经来头疼是什么原因引起的| mssa是什么细菌| 湿热吃什么中药| b型钠尿肽测定是什么检查| 增生期子宫内膜是什么意思| 中性粒细胞百分比偏低什么意思| vb是什么意思| 屎壳郎为什么要推粪球| 胃糜烂吃什么药最好| 人为什么要工作| 导管是什么意思| 诊刮是什么手术| elite是什么意思| 什么是工作日| 孔雀开屏是什么意思| 沅字五行属什么| 什么原因导致高血压| 长豆角叫什么| 腱鞘炎用什么药能治好| 借您吉言什么意思| 海燕是什么鸟| 牙胶是什么| 岩茶属于什么茶| 矜贵是什么意思| 奶思是什么意思| 美人盂是什么意思| 寒门什么意思| 下眼皮肿是什么原因| 什么颜色衣服最防晒| 为什么精液是流出来的| 什么水解酒| 什么才是真正的情人| 甲状腺结节吃什么好| 靖国神社是什么地方| 屏保是什么| 住院医师是什么级别| 心肌病是什么病严重吗| 降压药什么时间吃最好| 女性肛门坠胀看什么科| 膜性肾病什么意思| 多吃苹果有什么好处| 拍胸片挂什么科室| 月经来了喝红糖水有什么好处| 副主任医师是什么级别| 烘焙是什么意思| 吃石斛有什么功效| 清热败火的败是什么意思| 后会有期什么意思| 甘油三酯偏高是什么原因| 为什么会生化| 津液亏虚吃什么中成药| 血糖在化验单上叫什么| 发痧是什么原因造成的| o型血不能和什么血型的人生孩子| 肚脐眼周围痛挂什么科| 2月15日是什么星座| 宝贝是什么意思| 沦丧是什么意思| 农历九月五行属什么| 2b什么意思| 骁字五行属什么| 老是掉头发是什么原因| 熊猫为什么吃竹子| 6月28日是什么星座| 父亲节出什么生肖| 六八年属什么生肖| 美宝莲属于什么档次| 睡觉趴着睡什么原因| 神经官能症是什么症状| 一月二十五号是什么星座| 检查乳房挂什么科| leysen是什么牌子| mw是什么意思| 贾赦和贾政是什么关系| 口语化是什么意思| 蝼蛄是什么动物| 什么东西嘴里没有舌头| a1微球蛋白偏高说明什么意思| 欣喜若狂是什么意思| 丝瓜有什么功效| 喉咙痛鼻塞吃什么药| 直肠腺瘤是什么| 心智不成熟是什么意思| 农历今年是什么年| 川军为什么那么出名| 心肌缺血吃什么药最好| 为什么打哈欠会流眼泪| 乳腺癌ki67是什么意思| 恩施玉露是什么茶| 泌乳素偏高是什么原因| 和谐的意思是什么| 急性肾炎什么症状| 梦见拔花生是什么预兆| 肺虚吃什么药| 为什么睡觉会流口水| 膝盖酸痛什么原因| 肌肉一跳一跳什么原因| 暴发火眼吃什么药| nsaids是什么药| eo是什么意思| 6.13是什么星座| 为什么会湿疹| 白蛇是什么蛇| 女性分泌物像豆腐渣用什么药| 嫔妃是什么意思| 头晕恶心什么原因| 快车和专车有什么区别| 胃寒吃什么食物好| 渗透压低是什么原因| inr是什么意思医学| 哔哩哔哩是什么| 流鼻涕感冒吃什么药| 1974年属虎是什么命| 开通花呗有什么风险| 宫颈那囊什么意思| 尿崩症是什么意思| 开塞露是什么| upc码是什么意思| 什么的花| 胃窦溃疡a1期是什么意思| 宫颈出血是什么症状| 肌膜炎是什么原因造成的| 第一次见面送女生什么花| 妙手回春是什么意思| 无舌苔是什么原因| 甲鱼吃什么| dr胸部正位片是检查什么的| 臭屁是什么意思| 抗0是什么意思| 宋江是一个什么样的人| 四不放过是指什么| 乳糖不耐受吃什么奶粉好| 细胞质由什么组成| 经常爱放屁是什么原因| 花卉是什么| 乳房痛什么原因| 次长是什么职位| 猪大肠炒什么好吃| 猪下水是什么| 半胱氨酸是什么| 77年的蛇是什么命| 龙王庙是指什么生肖| 胡萝卜吃多了有什么坏处| 飞的第一笔是什么| 感冒发烧吃点什么食物比较好| 寒号鸟是什么动物| semir是什么牌子| 浸洗是什么意思| 双脚踝浮肿是什么原因| 中二什么意思| 医保和农村合作医疗有什么区别| 为什么会得肠梗阻| 尿分叉吃什么药能治好| 天天吹空调有什么危害| 蛇缠腰是什么病怎么治| 黏膜是什么| 子不孝父之过下一句是什么| 吃什么流产最快| 最快的速度是什么| 消炎痛又叫什么| 保险凭证号是什么| 嗓子痒痒吃什么药| dem是什么| 乳头痒什么原因| 亨廷顿舞蹈症是什么病| 长闭口是什么原因造成的| 人心叵测什么意思| 免疫球蛋白高说明什么| 贫血吃什么补血最快| wbc白细胞高是什么原因| aone是什么牌子| 看乳腺结节挂什么科| 耳膜穿孔什么症状| 连襟是什么意思| 疤痕修复用什么药膏好| 主治医生是什么级别| 智能电视什么品牌好| 生长发育挂什么科| 釜底抽薪什么意思| 牙痛吃什么药好得快| 物流专员是做什么的| 白细胞低有什么症状| 清影是什么意思| tid什么意思| 香菇不能和什么一起吃| 夜尿频多吃什么药效果好| 儿童胃炎吃什么药| 美业是做什么的| 纹理是什么意思| 鸢是什么意思| 什么时候普及高中义务教育| 胆固醇和血脂有什么区别| 胃恶心吃什么药| 牙齿像锯齿是什么原因| 鲁迅原名什么| 雪青色是什么颜色| zero是什么牌子| 乱伦是什么| 间接胆红素高说明什么| 黄油是什么油| 男人嘴唇薄代表什么| 鳝鱼吃什么食物| 狼毒是什么| 柠檬和什么不能一起吃| 天干地支是什么意思| 什么补血效果最好最快| 特此通知写在什么位置| 梦见大蜈蚣是什么预兆| 百度Jump to content

CGTN阿语频道《丝路——梦开始的地方》启动海外播出

From Wikipedia, the free encyclopedia
(Redirected from HTML character references)
百度 中方向喀方提供力所能及的帮助不附加任何政治条件。

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

Specifying the document's character encoding

[edit]

There are two general ways to specify which character encoding is used in the document.

First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:[1]

Content-Type: text/html; charset=utf-8

This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module mod_charset_lite.[2]

Second, a declaration can be included within the document itself.

For HTML it is possible to include this information inside the head element near the top of the document:[3]

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 also allows the following syntax to mean exactly the same:[3]

<meta charset="utf-8">

XHTML documents have a third option: to express the character encoding via XML declaration, as follows:[4]

<?xml version="1.0" encoding="utf-8"?>

With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an ASCII extension then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as UTF-16BE and UTF-16LE, a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.

Encoding detection algorithm

[edit]

As of HTML5 the recommended charset is UTF-8.[3] An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:

  1. Explicit user instruction
  2. An explicit meta tag within the first 1024 bytes of the document
  3. A byte order mark (BOM) within the first three bytes of the document
  4. The HTTP Content-Type or other transport layer information
  5. Analysis of the document bytes looking for specific sequences or ranges of byte values,[5] and other tentative detection mechanisms.

Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean (CJK) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override incorrect charset label manually as well.

It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8, which allows use of the same encoding for all languages. UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.

Permitted encodings

[edit]

The WHATWG Encoding Standard, referenced by recent HTML standards (the current WHATWG HTML Living Standard, as well as the formerly competing W3C HTML 5.0 and 5.1) specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings.[6][7][8] The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use UTF-8 exclusively.[9]

Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:[8]

  1. ^ Also specified for TIS-620, ISO-8859-11 and related labels.[9]
  2. ^ Also specified for ASCII, ISO-8859-1 and related labels.[9]
  3. ^ Also specified for ISO-8859-9 and related labels.[9]
  4. ^ Specified with 0xA3A0 as a duplicate encoding of the ideographic space (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).[10][11] Also, specified with 0x80 accepted as an alternative encoding of the euro sign (U+20AC; see Windows-936).[12] Otherwise, follows the mappings from the 2005 standard.[11]
  5. ^ Hong Kong Supplementary Character Set variant,[13] although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.[14]
  6. ^ The specification includes IBM and NEC extensions,[15] and is more precisely Windows-31J.[13]
  7. ^ The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. Half-width kana is converted to fullwidth by the encoder,[16] but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.[17] Shift Out and Shift In (0x0E and 0x0F) are excluded entirely to prevent attacks.[17][18]
  8. ^ Actually Unified Hangul Code (Windows-949), which is a superset which covers the entire Hangul Syllables block.[13][19]
  9. ^ Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in UTF-8.[20]
  10. ^ For compatibility with deployed content, also specified for the plain UTF-16 label,[21] although a byte order mark (BOM), if present, takes priority over any label.[22] Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in UTF-8.[20]
  11. ^ Maps 0x00 through 0x7F to U+0000 through U+007F, and 0x80 through 0xFF to U+F780 through U+F7FF (a Private Use Area range), such that the low 8 bits of the code point always match the original byte.[23]

The following additional encodings are listed in the Encoding Standard, and support for them is therefore also required:[9]

  1. ^ Uses the same encoder and decoder as ISO-8859-8, but is not subject to the visual-order behaviour which is used for documents labelled as ISO-8859-8.[24]
  2. ^ Titled KOI8-U and specified for both KOI8-U and KOI8-RU labels;[9] follows KOI8-RU in positions 0xAE and 0xBE (i.e. includes ?/?)[25][26] but KOI8-U in positions 0x93–9F.[25]
  3. ^ Also specified for GB2312 and related labels. Handled the same as GB 18030 for decoding purposes.[27] For encoding purposes, labelling as GBK (or GB 2312) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.[10]
  4. ^ The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. JIS X 0212 is included for decoding only.[28]

The following encodings are listed as explicit examples of forbidden encodings:[8]

The standard also defines a "replacement" decoder, which maps all content labelled as certain encodings to the replacement character (?), refusing to process it at all. This is intended to prevent attacks (e.g. cross site scripting) which may exploit a difference between the client and server in what encodings are supported in order to mask malicious content.[29] Although the same security concern applies to ISO-2022-JP and UTF-16, which also allow sequences of ASCII bytes to be interpreted differently, this approach was not seen as feasible for them since they are comparatively more frequently used in deployed content.[30] The following encodings receive this treatment:[31]

Character references

[edit]

In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.

HTML character references

[edit]

A numeric character reference in HTML refers to a character by its Universal Character Set/Unicode code point, and uses the format

&#nnnn;

or

&#xhhhh;

where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form. The x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros. The hhhh may mix uppercase and lowercase, though uppercase is the usual style.

Not all web browsers or email clients used by receivers of HTML documents, or text editors used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user's language, and will draw a box or other clear indicator for characters they cannot render.

For codes from 0 to 127, the original 7-bit ASCII standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using character entity names. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.

Character entity references can also have the format &name; where name is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as &lambda; in an HTML document. The character entity references &lt;, &gt;, &quot; and &amp; are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup. This notably did not include XML's &apos; (') entity prior to HTML5. For a list of all named HTML character entity references along with the versions in which they were introduced, see List of XML and HTML character entity references.

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native Unicode encoding like UTF-8 is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as cross-site scripting. If HTML attributes are left unquoted, certain characters, most importantly whitespace, such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.

XML character references

[edit]

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:[32]

&amp; & ampersand U+0026
&lt; < less-than sign U+003C
&gt; > greater-than sign U+003E
&quot; " quotation mark U+0022
&apos; ' apostrophe U+0027

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example &#xA1b rather than &#XA1b. XHTML, which is an XML application, supports the HTML entity set, along with XML's predefined entities.

See also

[edit]

References

[edit]
  1. ^ Fielding, R.; Reschke, J. (June 2014), "Content-Type", in Fielding, R; Reschke, J (eds.), Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content, IETF, doi:10.17487/RFC7231, S2CID 14399078, retrieved 30 July 2014
  2. ^ "Apache Module mod_charset_lite".
  3. ^ a b c "Specifying the document's character encoding", HTML5, World Wide Web Consortium, 14 December 2017, retrieved 28 May 2018
  4. ^ Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Prolog and Document Type Declaration", XML, W3C, retrieved 8 March 2010
  5. ^ "HTML5 prescan a byte stream to determine its encoding".
  6. ^ "8.2.2.3. Character encodings". HTML 5.1 Standard. W3C.
  7. ^ "8.2.2.3. Character encodings". HTML 5 Standard. W3C.
  8. ^ a b c "12.2.3.3 Character encodings". HTML Living Standard. WHATWG.
  9. ^ a b c d e f van Kesteren, Anne. "4.2: Names and labels". Encoding Standard. WHATWG.
  10. ^ a b van Kesteren, Anne. "10.2.2. gb18030 encoder". Encoding Standard. WHATWG.
  11. ^ a b van Kesteren, Anne. "5. Indexes (§ index gb18030)". Encoding Standard. WHATWG.
  12. ^ van Kesteren, Anne. "10.2.1. gb18030 decoder". Encoding Standard. WHATWG.
  13. ^ a b c Mozilla Foundation. "Notable Differences from IANA Naming". Crate encoding_rs. docs.rs.
  14. ^ van Kesteren, Anne. "5. Indexes (§ index Big5 pointer)". Encoding Standard. WHATWG.
  15. ^ van Kesteren, Anne. "5. Indexes (§ Index jis0208)". Encoding Standard. WHATWG.
  16. ^ van Kesteren, Anne. "5. Indexes (§ Index ISO-2022-JP katakana)". Encoding Standard. WHATWG.
  17. ^ a b van Kesteren, Anne. "12.2.1. ISO-2022-JP decoder". Encoding Standard. WHATWG.
  18. ^ van Kesteren, Anne. "12.2.2. ISO-2022-JP encoder". Encoding Standard. WHATWG.
  19. ^ van Kesteren, Anne. "5. Indexes (§ index EUC-KR)". Encoding Standard. WHATWG.
  20. ^ a b van Kesteren, Anne. "4.3. Output encodings". Encoding Standard. WHATWG.
  21. ^ van Kesteren, Anne. "14.4. UTF-16LE". Encoding Standard. WHATWG.
  22. ^ van Kesteren, Anne. "6. Hooks for standards (§ decode)". Encoding Standard. WHATWG.
  23. ^ van Kesteren, Anne. "14.5. x-user-defined". Encoding Standard. WHATWG.
  24. ^ van Kesteren, Anne. "9. Legacy single-byte encodings (§ Note)". Encoding Standard. WHATWG.
  25. ^ a b van Kesteren, Anne. "index KOI8-U visualization". Encoding Standard. WHATWG.
  26. ^ "Bug 17053: Support KOI8-RU mapping for KOI8-U". W3C Bugzilla. 19 August 2015.
  27. ^ van Kesteren, Anne. "10.1. GBK". Encoding Standard. WHATWG.
  28. ^ van Kesteren, Anne. "5. Indexes (§ Index jis0212)". Encoding Standard. WHATWG.
  29. ^ van Kesteren, Anne. "14.1: replacement". Encoding Standard. WHATWG.
  30. ^ van Kesteren, Anne. "2: Security background". Encoding Standard. WHATWG.
  31. ^ van Kesteren, Anne. "4.2: Names and labels (§ replacement)". Encoding Standard. WHATWG.
  32. ^ Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Character and Entity References", XML, W3C, retrieved 8 March 2010
[edit]
hpc是什么 大家闺秀是什么生肖 为什么会基因突变 灵芝与什么相克 典韦字什么
做深蹲有什么好处 哈尔滨有什么特产 上火引起的喉咙痛吃什么药 嘴唇淡紫色是什么原因 什么是血压高
儿童病毒感染吃什么药 紫藤花什么时候开花 治疗舌苔白厚用什么药 平步青云什么意思 ts和cd有什么区别
除外是什么意思 樱桃红是什么颜色 香蕉人是什么意思 桥本是什么意思 舌头溃疡吃什么药最好
生姜放肚脐眼有什么功效hcv9jop4ns6r.cn 过敏性紫癜有什么危害hcv9jop0ns2r.cn 眼压高滴什么眼药水hcv8jop1ns0r.cn 属牛配什么属相最好hcv8jop5ns0r.cn 拔牙什么时间最好hcv9jop7ns0r.cn
马来酸曲美布汀片什么时候吃hcv9jop4ns7r.cn 豆腐干炒什么好吃hcv8jop9ns4r.cn 如常是什么意思hcv9jop1ns2r.cn 什么叫人格clwhiglsz.com 胡萝卜吃多了有什么坏处hcv9jop4ns0r.cn
盆腔积液是什么症状表现hcv7jop9ns7r.cn 方巾是干什么用的helloaicloud.com 如厕什么意思hcv8jop4ns7r.cn 男生吃菠萝有什么好处hcv8jop1ns3r.cn 姓陆的女孩取什么名字好hcv9jop5ns0r.cn
珍母口服液有什么作用hcv7jop9ns4r.cn 各就各位是什么意思aiwuzhiyu.com 恩施玉露是什么茶wuhaiwuya.com 喝莓茶对身体有什么好处hcv8jop3ns0r.cn 武汉什么省hcv8jop9ns9r.cn
百度