5 Presentation of the document to HTML format

This chapter discusses how HTML documents are represented on the computer and the Internet.

Section the document character set refers to the question of abstract symbols that can be part of the HTML document. Some of these characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character for "water", etc.

Section character encoding refers to the question of how these characters may be represented in a file or during transmission over the Internet. As some character encodings can not directly represent all characters an author may want to include in the document, HTML offers other mechanisms, called character references , for referring to any character.

Since human languages ​​there is a huge number of characters and a variety of methods for their performance, care should be taken that these documents could be understood by user agents around the world.

5.1 The document character set

To ensure interoperability, SGML requires that each application (including HTML) specify the document character set. The document includes:

  • Repertoire : A set of abstract symbols, such as the Latin letter "A", the Cyrillic letter "I of", the Chinese character for "water", etc.
  • Codes : A set of integer references to the repertoire of characters.

Each SGML document (including each document HTML) - a sequence of characters from the repertoire. Computer systems identify each character by its code; for example, in the ASCII character set codes 65, 66 and 67 denote characters 'A', 'B' and 'C' respectively.

The ASCII character set is not sufficient for a global information system, as the Web, so HTML uses a more complete character set called the Universal Character Set (Universal Character Set - UCS), and defined in [ISO10646]. This standard defines a repertoire of thousands of characters used throughout the world.

Character Set, defined in [ISO10646] - this is character-oriented equivalent to the Unicode 2.0 We do ( [the UNICODE] ). Both of these standards are updated from time to time, updated with new characters, the changes should be to learn on the respective Web servers. In this specification, ISO / IEC-10646 or Unicode imply this same character set. However, Unicode HTML specification also mentioned in the discussion of other issues, such as bidirectional text algorithm.

Document character set, however, is not sufficient to user agents to correctly interpret HTML documents in a typical exchange - encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the character encoding that was used to transform the document character stream into a byte stream.

5.2 Character Encoding

Character Encoding in this specification have different names in other specifications (which may cause some confusion). However, this notion of the Internet means about the same thing. One and the same name - "the charset - charset" - is used in the protocol headers, attributes, and parameters referring to the characters and use the same values from the [IANA] registry (for the full list, see. [CHARSETS] ).

"Charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally into the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. transformation methods may vary from a simple one to one correspondence to complex switching schemes or algorithms.

Simple encryption technology "one byte - one character" is not sufficient for text strings with such a wide repertoire of characters as [ISO10646] . In addition to encoding the entire character set (for example, UCS-4), there are some other parts of the coding [ISO10646] .

5.2.1 Choosing an encoding

Development tools (eg, text editors) may encode HTML documents in the encoding of your choice, and this choice depends significantly on the conventions used by the system software. These funds may use any convenient encoding that includes most of the characters in the document, provided the encoding is correctly labeled. Some characters that are not included in this encoding can be represented by character references . It always refers to the document character set, not the character encoding.

Servers and proxies may change a character encoding (called transcoding) on the fly for querying user agents (see. Section 14.2 [RFC2068] , the HTTP request header "Accept-Charset"). Servers and proxies do not have to serve the document encoding, including the entire document character set.

It is widely used on the Web coding - ISO-8859-1 (also referred to as "Latin-1", is used for most Western European languages), ISO-8859-5 (with Cyrillic support), SHIFT_JIS (Japanese encoding), EUC-JP (another Japanese encoding) and UTF-8 (a variant encoding of ISO 10646 using a different number of bytes for different characters). Names of character encodings are case-insensitive, so that, for example, "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.

This specification does not specify what character encoding should support the user agent.

Conforming user agents must correctly display to Unicode all characters in any encoding, they can recognize.

Notes on specific encodings  

When HTML text is transmitted in UTF-16 (charset = UTF-16), text data should be transmitted in network byte order ( "big-endian", the higher-order byte - first), in accordance with [ISO10646] , section 6.3 and [UNICODE] , the position of the C3, page 3-1.

Moreover, to increase the probability of a correct interpretation, it is recommended to transmit documents UTF-16, always starting with the character non-breaking spaces of zero width (hexadecimal code FEFF, also called Mark of the order of bytes (Byte Order Mark - BOM)), which is handling bytes becomes hex FFFE never assigned to the symbol. Thus, the user agent receives FFFE hex code as the first bytes of the text will be aware that the rest of the text of the bytes have to be reversed.

Do not use the transformation format UTF-1 [ISO10646] (registered by IANA as ISO-10646-UTF-1). For information about ISO 8859-8 and the bidirectional algorithm, see. Section bidirectionality and character encoding .

5.2.2 Specifying the character encoding

How the server determines which character encoding used in the document? Some servers examine the first few bytes of the document or checked against a database of known files and encodings. Most modern Web servers provide administrators with greater character set configuration management capabilities than the old servers. Web Server administrators should use these mechanisms to send parameter if possible "charset", but must take care not to set the "charset" parameter value to false documents.

How the user agent know which character encoding has been used? This information is provided by the server. The best way to inform the user agent about the character encoding of the document - to use the "charset" parameter in the header field "the Content-the Type" the HTTP protocol ( [RFC2068] , sections 3.4 and 14.18) For example, HTTP header announces the next, that uses encoding EUC-JP:

 Content-Type: text / html;  charset = EUC-JP

Defining text / html cm. In section line .

The HTTP protocol ( [RFC2068] , section 3.7.1) considers ISO-8859-1 encoding default character, if the parameter "charset" header field "Content-Type" is missing. In practice, this recommendation is useless because some servers do not allow you to send the parameter "charset", and some may not be configured to send this item. Therefore, user agents must not assume any importance "charset" parameter.

To specify a server configuration limits or HTML documents may include explicit information about the character encoding of the document; to provide such information, user agents can use elements of the META .

For example, to specify that the character encoding of the current document is "EUC-JP", include the following classified the META :

 <META http-equiv = "Content-Type" content = "text / html; charset = EUC-JP">

Announcement META should only be used when the character encoding is ordered so that ASCII characters stand in place (at least when parsing element META ). Ads META should be in the text as early as possible in an element of the HEAD .

In cases where neither the HTTP protocol nor the element of META does not provide information about the encoding of the document, HTML provides the attribute charset for some elements. By combining these mechanisms, an author can greatly improve the chances that, when the user loads the resource, the user agent will recognize the character encoding.

To sum up, conforming user agents in determining the character encoding of the document (from highest priority to lowest) should be guided by the following sources according to priority :

  1. The parameter "charset" HTTP protocol in the "Content-Type".
  2. Ad the META , which for the "http-equiv" set to "Content-Type" and set for "charset".
  3. Attribute charset is set on an element that designates an external resource.

In addition, the list of priorities, the user agent may use heuristics and user settings setup. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese. User agents usually have a local default user-defined encoding they use, if there is no indication encoding.

User agents may provide a mechanism that allows users to change the incorrect information on the character set. However, if a user agent offers such a mechanism, it should only offer it for browsing and not for editing, to avoid the creation of Web-pages with incorrect parameter "charset".

Note. If some application you need to use characters outside the encoding [ISO10646] , these symbols personal zone in order to avoid conflicts with present or future versions of the standard must be assigned. However, it is not recommended for reasons of portability.

5.3 References to symbols

This character encoding may not contain all the characters from the document character set. For such encodings, or to such configurations of hardware and software that do not allow users to enter certain characters, authors can use character references SGML. Character references - it is independent of the encoding mechanism for entering any character.

Character references in HTML may take two forms:

  • Numeric character references (decimal or hexadecimal).
  • References to the combination of symbols.

Character references in the comments do not matter; they are the only data yet.

Note. HTML provides other ways of presenting characters in particular embedded images .

Note. The SGML may be omitted final character in some cases ";" after character references (eg, line break or immediately before a tag). In other circumstances, they can not be removed (e.g., in the middle of the word). We suggest using the ";" always avoid problems with user agents for which this symbol is mandatory.

5.3.1 Numeric character references

Numerical reference symbols indicate the code symbol in the character set of the document. Numeric character references may also take two forms:

  • The syntax "& # D;", where D - a decimal number, indicates the Unicode character with a decimal number D.
  • The syntax "& # x H;" or "& # X H;", where H - the hexadecimal number indicates the Unicode character hexadecimal number H. The hexadecimal numeric character references are case-insensitive.

Here are some examples of numeric character references:

  • & # 229; (Decimal) represents the letter "a" from the top circle (used, for example, in Norwegian).
  • & # XE5; (Hexadecimal) represents the same character.
  • & # Xe5; (Hexadecimal) represents the same character.
  • & # 1048; (Decimal) represents the Cyrillic capital letter "I".
  • & # X6C34; (Hexadecimal) represents the Chinese character for "water."

Note. Although the hexadecimal representation is not defined in [ISO8879] , it is expected in the new version, as described in [WEBSGML] . This agreement is particularly useful because the character standards generally use hexadecimal representations.

5.3.2 Combinations of character references

To give authors more proactive way to use symbols, HTML offers a set of character entity references. Combinations of character references use symbolic names so that authors do not have to memorize the codes. For example, a combination of & aring; denotes the character "a" lower case with a circle on top; "& Aring;" easier to remember than & # 229 ;.

HTML 4.0 does not define a character entity reference for every character. For example, the Cyrillic letter "I" there is no character entity reference. See. Full list of character references defined in HTML 4.0.

Combinations of character references are case-insensitive. So, & Aring; It indicates another symbol (A circle with an upper case) and not on & aring; (A circle with a lower-case).

Four links must be mentioned specially, as they are often used to indicate special characters:

  • "& Lt;" is a sign <.
  • "& Gt;" It represents the> sign.
  • "& Amp;" is the symbol &.
  • "& Quot; it is a sign."

Authors who want to put the text of the symbol "<" must use the link "& lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag opening delimiter). Similarly, use "& gt;" (ASCII decimal 62) instead of ">" to avoid problems with older user agents that incorrectly taking them for the end tag (closing tag delimiter).

Authors should use "& amp;" (ASCII decimal 38) instead of "&" to avoid confusion with reference to the symbols (opening delimiter entity reference). Authors should also use "& amp;" in attribute values since character references within attribute values CDATA allowed.

Some authors use the character entity reference "& quot;" to encode instances of the double quotes ( ") because this character is used to separate attribute values.

5.4 Non-printing characters

Possibly, the user agent is not able to display all the characters in the document, for example, due to lack of corresponding text or if the character has a value that can not be expressed in the inner encoding user agent, etc.

Since in this case there are several options, this document does not prescribe a specific tactic. Depending on the application of non-printing characters may also be processed further display system, not the application itself. In the case of more complex behavior, for example, is configured for a particular script or language, we recommend the following behavior for user agents:

  1. Please clearly visible, but unobtrusive mechanism to alert the user of missing resources.
  2. If missing characters are presented in a different numeric representation, use the hexadecimal (not decimal) form since this form is used in character set standards.