8 Language information and text direction

This section of the document discusses two important issues of internationalization of HTML: Language Reference (attribute of the lang ) and direction (attribute the dir ) of text in the document.

8.1 Language Setting content: the lang attribute

attribute definitions
the lang = language code [CI]
This attribute specifies the primary language attribute values ​​and element content sekstovogo. The default value of this attribute is not installed.

Information about the language specified with the attribute of the lang , may be used by a user agent to control the generation of the image in different ways. Some situations in which the author points out the information and language may be helpful:

  • Help search engines
  • Help speech synthesizers
  • Help user agents in the choice of glyph variants for high quality typography
  • Help the user agent in the selection of a set of quotation marks
  • Help the user agent in matters of transport , ligatures and intervals
  • Help program checks spelling and grammar

Attribute lang specifies the code element's content and attribute values; It applies whether it is for this attribute depends on the syntax and semantics of the attribute and the operation.

Attribute lang is designed to allow user agents to generate more meaningful image based on accepted cultural practice for a given language. This does not mean that user agents must generate symbols that are not typical for a particular language, less meaningful way; user agents should try to generate a CE mark , regardless of the value of the attribute of the lang .

For example, if in the Russian text should appear the symbol of the Greek alphabet:

  <P> <Q lang = "ru"> "This was the result of Xtra & gamma; -radiatsii, </ Q> he explained </ P>.

User agent (1) should try to create a Russian text as appropriate (eg, in the relevant quotation marks), and (2) to try to generate the symbol gamma, even if it is not a symbol of Russian.

For more information, see. In the section on invisible symbols .

8.1.1 Language codes

The value lang is a language code that identifies a natural language spoken, which is oral, written, or otherwise used for transferring information between people. Computer languages ​​are explicitly excluded from language codes.

The document [RFC1766] identified and described all the language codes to be used in documents in the language of HTML.

Briefly, language codes consist of a primary code and a number of sub-codes, which can be empty:

  Code-language = primary-code ( "-" subcode) *

Here are some examples of language codes:

  • "En": English
  • "En-US": the US version of English.
  • "En-cockney": the Cockney (English dialect).
  • "I-navajo": the Navajo (American Indian language).
  • "X-klingon": The primary code "x" represents an experimental language code

Two-letter primary codes are reserved for the standard language abbreviations [ISO639] . This includes fr codes (French), de (German), it (Italian), nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he (Hebrew), ru ( Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu) and sa (Sanskrit).

Any two-letter subcode is considered to be the country code in the standard [ISO3166] .

8.1.2 Inheritance of language codes

An element inherits language code information in the following order of precedence (from highest to lowest):

  • Attribute the lang , set for the element itself.
  • The closest parent element for which the set value of the attribute lang (ie attribute lang inherited).
  • HTTP header "Content-Language" (which can be configured on the server). For example:
     Content-Language: en-cockney
    
  • Default values ​​and setting the user agent.

In this example, the primary language of the document is French ( "fr"). One paragraph is declared in Spanish ( "es"), after which the language becomes French again. The following paragraph is included Japanese phrase ( "ja"), and then again the language is changed to French.

  <DOCTYPE HTML PUBLIC! "- // W3C // DTD HTML 4.0 // EN"
  "Http://www.w3.org/TR/REC-html40/strict.dtd">
 <HTML lang = "fr">
 <HEAD>
 <TITLE> Un document multilingue </ TITLE>
 </ HEAD>
 <BODY>
 ... the text is interpreted as a French ... <P lang = "es" > ... the text is interpreted as Spanish ... <P> ... text is interpreted again as the French ... <P> ... the French text in which the catches <EM lang = "ja"> fragment of Japanese </ EM>, and here again starts the French ...

 </ BODY>
 </ HTML>
Note. Table cells may inherit the value of the attribute lang not from parents, but from the first merged cell. See. In the section alignment inheritance .

8.1.3 Interpretation of language codes

In the context of HTML code should be interpreted by user agents as a hierarchy of characters, rather than one character. If the user agent generates an image in accordance with information about the language (for example, comparing the language codes in stylesheets and attribute values the lang ), he should always find an exact match, but should also take into consideration primary codes. Thus, if the attribute value the lang "en-US" is set for the item the HTML , the user agent must first select the style information, which coincides with the "en-US", and then generate a common value "en".

Note. The hierarchy of language code does not guarantee understanding of all languages with common prefixes people fluent in one or more of these languages. It helps the user to request this commonality when it is for the user true.

8.2 Specifying the direction of text and tables: dir attribute

attribute definitions

the dir = the LTR | The RTL [the CI]
This attribute defines the basic direction of the neutral in the sense of the direction of the text (for example, text that does not inherit directionality as defined in [the UNICODE] ) and the direction of the tables . Possible values ​​are:
  • LTR: Left to right.
  • RTL: From right to left.

In addition to specifying the language of the document using attribute the lang , authors may specify the main direction (left to right or right to left), parts of the text, tables, etc. This is done using the attribute the dir .

Specification [UNICODE] specification assigns directionality to characters and defines a (complex) algorithm for determining the proper directionality of text. If the document does not contain a displayable right-left characters, the user agent must use the bidirectional algorithm is not [the UNICODE] . If the document contains such characters, and if the user agent and displays, it must use the bidirectional algorithm.

Although Unicode specifies special characters that are responsible for the text direction, HTML offers higher-level markup structure, performing the same functions: attribute the dir (not to be confused with the element of the DIR ) and member of BDO . Thus, to quote the Hebrew, it is easier to write

  <Q lang = "he" dir = "rtl"> ... quote in Hebrew ... </ Q>

than the equivalent with Unicode references:

  & # X202B; & # x05F4;  ... quote in Hebrew ... & # x05F4; & # x202C ;

User agents must not use the attribute lang for determining the direction of the text.

Attribute dir inherited, and it can be overridden. See. In the section on inheritance of text direction information .

8.2.1 Introduction to the bidirectional algorithm

The following example illustrates the expected behavior of the bidirectional algorithm. It shows English text from left to right and the Hebrew text from right to left.

Consider the following text:

  english1 IVRIT2 english3 IVRIT4 English5 IVRIT6

The symbols in this example (and replicated in all examples) are stored in the computer in the same form in which they are shown here: the first symbol - "a", the second - "n" last "6".

Suppose, for the paragraph containing the document specified English. This means that the main direction is the direction from left to right. The correct presentation of this line:

 english1 2TIRVI english3 4TIRVI English5 6TIRVI
  <----- <----- <-----
  HHH
 -------------------------------------------------- ->
  E

dots lines indicate the structure of the offer: the main language is English, but some elements are embedded in Hebrew. For the correct presentation does not need any additional markup since the Hebrew fragments in the correct address user agents applying the bidirectional algorithm.

On the other hand, if the document defined the Hebrew language, the basic direction is right to left. Correct representation thus be:

 6TIRVI English5 4TIRVI english3 2TIRVI english1
  -------> -------> ------->
  EEE
 <-------------------------------------------------
  H

In this case, the whole sentence seems right to left, and the fragments in English address the bidirectional algorithm.

8.2.2 Inheritance of text direction information

For Unicode bidirectional algorithm must be the main text direction for text blocks. To specify the basic direction of block-level elements, set this attribute element the dir . The value the dir , installed by default, is "ltr" (left to right).

If the attribute dir is set to a block-level element, it acts throughout the element and for all sub-block-level elements. Setting the attribute dir for nested element has priority over the inherited value.

To set the primary text direction for an entire document, set the attribute dir in the element the HTML .

For example:

  <DOCTYPE HTML PUBLIC! "- // W3C // DTD HTML 4.0 // EN"
  "Http://www.w3.org/TR/REC-html40/strict.dtd">
 <HTML dir = "RTL">
 <HEAD>
 <TITLE> ... heading right to left ... </ TITLE>

 </ HEAD>
 ... text from right to left ... <P dir = "ltr" > ... text from left to right ... </ P>

 <P> ... again the text from right to left ... </ P>

 </ HTML>

Inline elements, on the other hand, do not inherit the attribute is the dir . This means that no inline element attribute dir will not open further implementation in accordance with the level of the bidirectional algorithm. (This element is considered to be block-level element or an inline element based on the default view. Please note that elements of INS and DEL can be block-level elements or inline elements, depending on the context.)

8.2.3 Setting the direction of embedded text

Bidirectional Algorithm [UNICODE] automatically draws a sequence of characters introduced in accordance with the inherited direction (as shown in the previous examples). However, in general, it is taken into account only one level of embedding. In order to change the direction of reaching new levels, use the attribute dir in an embedded element.

Consider the text of the previous example:

 english1 IVRIT2 english3 IVRIT4 English5 IVRIT6

Suppose the primary language for the document containing this paragraph is English. In this English sentence contains a Hebrew fragment extending from IVRIT2 to IVRIT4, and it contains a fragment of the English-language (English 3). Thus, the desired presentation of the text:

 english1 4TIRVI english3 2TIRVI English5 6TIRVI
  ------->
  A
  <-----------------------
  AND
 -------------------------------------------------- ->
  A

To change the text of the two embedded fragments need to ask for more information, and we do, clearly separating the second implementation. In this example, we use to mark up the text element SPAN and attribute the dir :

 english1 <SPAN dir = "RTL"> IVRIT2 english3 IVRIT4 </ SPAN> English5 IVRIT6

Authors can also be used to change the direction of a few fragments embedded Unicode characters. To indicate the direction from left to right in the implementing fragment surround text characters LEFT-TO-RIGHT EMBEDDING ( "LRE", hex 202A) and POP DIRECTIONAL FORMATTING ( "PDF", the hexadecimal code 202C). To indicate the direction from right to left in the implementing fragment surround text characters RIGHT-TO-LEFT EMBEDDING ( "RTE", 202B hexadecimal code) and PDF.

Using HTML markup orientation with Unicode characters. Authors and developers tools to create HTML-document should be aware of possible conflicts arising from the use of the attribute dir with built-in elements (including of BDO ) concurrently with the corresponding formatting characters [the UNICODE] . It is preferable to use only one method. Method using a markup document ensures structural integrity and eliminates some problems with editing bidirectional HTML text in simple text editors, but some software may be better to use the symbols [the UNICODE] . If both methods are used, should be well taken care of properly embedding markup and character, otherwise the results could be unpredictable.

8.2.4 Overriding the bidirectional algorithm: an element of BDO

  <The ELEMENT! Of BDO - - ( % the inline; ) * - priority over I18N BiDi ->
 <! ATTLIST BDO
  % coreattrs;
  - Id , class , style , title -
  the lang % LanguageCode;
  #IMPLIED - The language code -
  the dir (ltr | rtl) #REQUIRED - direction -
  >

Start tag: required, End tag: required

attribute definitions

the dir = the LTR | The RTL [the CI]
This required attribute indicates the main direction of text content element. This direction has priority with respect to the direction of the heritability of characters as defined in [the UNICODE] . Possible values ​​are:
  • LTR: from left to right.
  • RTL: The direction from right to left.

Attributes defined elsewhere

Bidirectional algorithm and attribute dir are usually sufficient to control the change in the direction of embedded text. However, in some situations the bidirectional algorithm may lead to an incorrect presentation. Element BDO allows authors to turn off the bidirectional algorithm for selected fragments of text.

Consider a document with the same text fragment:

 english1 IVRIT2 english3 IVRIT4 English5 IVRIT6

but assume that this text has already been submitted in the correct order. One reason for this may be that the standard of the MIME ( [RFC2045] , [RFC1556] ) favors visual order, ie the sequence with the direction from right to left are inserted into a byte stream from right to left. In an e-mail it can be formatted, including line breaks, for example:

  english1 2TIRVI english3
 4TIRVI English5 6TIRVI

This conflicts with the bidirectional algorithm [the UNICODE] , because this algorithm inverts 2TIRVI, 4TIRVI 6TIRVI and the second time, so that the Hebrew words are displayed from left to right instead of right to left.

In this case the solution is to override the effect of the bidirectional algorithm by putting the Email excerpt in the PRE element (to save the translation line) and each line, for which the dir attribute is set to the LTR, in the element of BDO:

  <PRE>
 <BDO dir = "LTR"> english1 2TIRVI english3 </ BDO>
 <BDO dir = "LTR"> 4TIRVI English5 6TIRVI </ BDO>
 </ PRE>

Bidirectional algorithm command is issued, "I must be left!", Which will lead to the desired presentation:

 english1 2TIRVI english3
 4TIRVI English5 6TIRVI

Element BDO should be used in scenarios that require absolute control over sequence (eg, multi-language part numbers). Attribute dir for this element is mandatory.

Authors may also use special Unicode characters to avoid using the bidirectional algorithm - LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E hex code). POP DIRECTIONAL FORMATTING character (hexadecimal code 202C) completes any sequence to be used to circumvent the bidirectional algorithm.

Note. Remember that when using the attribute dir in the embedded elements (including of BDO ) concurrently with the corresponding formatting characters [the UNICODE] , there may be conflicts.

Bi-directional and character encoding According to [RFC1555] and [RFC1556] , there are special arrangements for the use of the values of "charset" parameter to indicate bidirectional treatment in MIME mail, in particular for the visual differences, the explicit and implicit orientation. The value "ISO-8859-8" (Hebrew) denotes visual encoding, "ISO-8859-8-i" denotes implicit bidirectionality, and "ISO-8859-8-e" denotes explicit directionality.

Because HTML uses the Unicode bidirectional algorithm, the relevant documents using ISO 8859-8 coding should be labeled as "ISO-8859-8-i". Explicit directional control is also possible in HTML, but it can not be expressed in ISO 8859-8, since you should not use "ISO-8859-8-e".

The value "ISO-8859-8" implies that the document is formatted visually, and some markup to be used incorrectly (for example, TABLE statement with right-aligned without breakdown rows) to ensure correct display on older user agents that do not support bi-directional. Such documents do not meet this specification. If necessary, you can change them (and at the same time they are displayed correctly in older versions of user agents) by adding, where appropriate, marking of BDO . Contrary to what was said in [RFC1555] and [RFC1556] , the encoding ISO-8859-6 (Arabic) is not a visual order.

8.2.5 Character references for the management and direction of the union

Because sometimes there is some ambiguity regarding the character (for example, punctuation marks), the specification [UNICODE] includes symbols for the correct determination of purpose. Unicode specification also includes some characters to control the union, if necessary (eg, some situations with Arabic letters). HTML 4.0 includes these characters character references .

The following DTD defines the representation of some object areas:

  <ENTITY zwnj CDATA "& # 8204;" -! = Null width without joint ->
  <ENTITY zwj CDATA "& # 8205;" - = combiner zero-width ->
  <ENTITY lrm CDATA "& # 8206;" - label = left to right ->
  <ENTITY rlm CDATA "& # 8207;" - label = from right to left ->

Object zwnj used to lock the union in those contexts where the union is going to happen, but it should not happen. Object zwj has the opposite effect; it merges in the case where it is not supposed to, but it should occur. For example, the Arabic letter "HEH" is used to abbreviate "Hijri", the name of the Islamic system of chronology. As a separate character "HEH" in Arabic script looks like the number five, in order not to confuse the letter "HEH" with the last digit five in a year, using the original form of the letter "HEH". However, no further context (for example, to combine the letters), which can combine the "HEH". Zwj Symbol provides a context.

Similarly, in Persian texts, the letter can sometimes be combined with a subsequent letter, while in the handwritten text that should not be. Zwnj symbol is used to lock the union in such cases.

The order of symbols, lrm and rlm, are used to determine the direction of the neutral with respect to the direction of the characters. For example, if double quotes are placed between the Arab (from right to left) and Latin (from left to right) letter, the direction is not clear quotes (whether they belong to the Arab or the Latin text?). Lrm and rlm characters have a directional property but do not have the properties width and separation of words / lines. For details, see. [The UNICODE] .

Reflection character glyphs. In general, the bidirectional algorithm does not reflect the character glyphs and does not affect them. An exception are characters such as parentheses (see. [The UNICODE] , table 4-7). If the reflection is desirable, such as Egyptian hieroglyphics, Greek characters, or special design effects, you can do it with styles.

8.2.6 Style Sheets and bidirectional

In general, using style sheets to change the visual representation of a block-level element to the built-in and vice versa is used in the forward direction. However, because the bidirectional algorithm uses different internals / block-level elements , during the conversion must be careful.

If the inline element that does not have the attribute dir, is converted into a block-level element style using style sheets to determine the direction of the main unit, it inherits the attribute dir from the closest parent block element.

If the element is a block with no attribute a dir, is converted to the style of the built-in element by a style sheet, the resulting presentation should be equivalent, in terms of bidirectional formatting, to the formatting obtained by explicitly adding the attribute dir (which is assigned the inherited value) converted elements.