9 text

The following sections discuss issues of structuring the text. The elements represent text (alignment elements, font, style sheets, etc.) are discussed elsewhere in this specification. For information about the symbols, see. In the section on the character set of the document.

9.1 Non-printing characters

Document character set includes many different invisible symbols. Many of these are typographic elements used in some applications in order to achieve special visual effects display. In HTML, only the following characters are defined as non-displayed :

  • space ASCII character set (& # x0020;)
  • tab the ASCII character set (& # x0009;)
  • ASCII form feed (& # x000C;)
  • zero-width space (& # x200B;)

Newline characters are also non-displayable. Remember that even though the characters & # x2028; and & # x2029; defined in the specification [ISO10646] as delimiters of paragraphs and lines, respectively, it does not specify line breaks in HTML, as well as in this specification, they are not included in the more general category undisplayable character.

It is not determined by the behavior of, display, and so forth. whitespace characters, if they are not explicitly defined as invisible symbols in this specification. For this reason, in order to achieve visual formatting effects using non-displayable characters instead of spaces, authors should use appropriate elements and styles of the table.

For all elements of HTML, except nondisplayed word sequences (we use the term "word" to mean "sequences of characters displayed") to tag the PRE . When formatting text, user agents should identify these words and process them in accordance with the agreements with respect to a particular language and target media.

Formatting can include spaces between words (called inter-word spaces), but an agreement on the inter-word spaces vary depending on the scenario. For example, in Latin scripts inter-word space is typically displayed as a space ASCII character set (& # x0020;), while in Thai it is encoded zero-width word separator (& # x200B;). In Japanese and Chinese encoding inter-word space is typically not generated at all.

Remember that the sequence invisible symbols between words in the source document can cause the display completely different wordspaces (except element the PRE ). In particular, user agents should stop input sequences invisible symbols in the derivation wordspaces. This can and must be done, even in the absence of language information (attribute of the lang , the HTTP header field "the Content-the Language" (see. [RFC2068] , razdel14.13), setting the user agent, etc.).

Element PRE is used for formatted text , in which the non-displayable characters are important.

In order to avoid problems with the line breaks the rules of SGML and inconsistencies between existing applications, the authors do not have to rely on user agents in the generation of invisible symbols immediately after the start tag or immediately before an end tag. Thus, authors and especially development tools must write, for example:

  <P> We offer free technical support <A> </A> for registered users. </ P>

and should not write:

  <P> We offer free technical support <A> </A> for registered users. </ P>

9.2 Structured Text

9.2.1 Elements of phrases: the EM , the STRONG , the DFN , the CODE , the SAMP , the KBD , the VAR , CITE , the ABBR and ACRONYM

  <The ENTITY% the phrase "! The EM | the STRONG | the DFN | the CODE |
  the SAMP |  the KBD |  the VAR |  CITE |  the ABBR |  ACRONYM ">
 <! The ELEMENT ( % fontstyle; | % the phrase; ) - - ( % the inline; ) *>
 <ATTLIST (! % Fontstyle; | % the phrase; )
  % attrs;
  - % Coreattrs , % the i18n , % events -
  >

Start tag: required, End tag: required

Phrase elements added to the structure of the text fragments. The ordinary meaning of phrasal elements as follows:

EM:
Isolation.
STRONG:
Stronger selection.
CITE:
It contains a citation or a reference to other resources.
DFN:
It indicates that this is the definition of a nested term.
CODE:
A fragment of computer code.
SAMP:
Output example programs, scripts, etc.
KBD:
The text that the user must enter.
VAR:
An instance of a variable or program argument.
ABBR:
The abbreviated form (eg, WWW, HTTP, URI, Mass., Etc.).
ACRONYM:
Stands (for example, WAC, radar, etc.).

The elements EM and STRONG are used for selection. Other phrasal elements have certain significance in technical documents. The following examples show the use of some phrasal elements:

  As said, <CITE> Harry Truman </ CITE>,
 <Q lang = "en-US"> The buck stops here. </ Q>
 For details, see. <CITE> [ISO-0000] </ CITE>.
 In the future, use the following reference number: <STRONG> 1-234-55 </ STRONG>

Presentation of phrase elements depends on the user agent. Typically, visual user agents present text element EM in italics and the text element STRONG bold. Speech synthesizers may change the synthesis parameters, such as volume, tone, and increase the speed respectively.

Elements ABBR and ACRONYM allow authors to explicitly show the use of abbreviations and acronyms. Western languages ​​are widely used acronyms such as "GmbH", "NATO" and "FBI" and abbreviations type "M.", "Inc.", "et al.", "Etc.". The Chinese and Japanese use analogous abbreviation mechanisms, when a long name is replaced by the Han characters from the original sequence. Marking these designs provides user agents and auxiliaries, such as a spelling checker, speech synthesizers, transform the system and search engine indexers, useful information.

Content elements ABBR and ACRONYM defines itself an abbreviation, it is usually displayed in the text. The title attribute of these elements may be used to indicate the full expanded forms of expression.

A few examples of the element the ABBR :

  <P> <ABBR title = "World Wide Web"> WWW </ ABBR> <ABBR lang = "fr" title = "Soci & eacute; t & eacute; Nationale des Chemins de Fer"> SNCF </ ABBR> <ABBR lang = "es "title =" Do & ntilde; a "> Do & ntilde; a </ ABBR> <ABBR title =" Abbreviation "> abbr </ ABBR>. 

Remember that abbreviations and acronyms often have different pronunciations. For example, if the word "USA" and "BBC" are usually pronounced letter by letter, the word "NATO" and "UNESCO" are pronounced phonetically. Other forms of abbreviations (eg, "URI" and "SQL") some people are pronounced letter by letter, and the other - as the words. If necessary, authors should use style sheets to specify the pronunciation of acronyms.

9.2.2 Quotes: Elements BLOCKQUOTE and Q

  <The ELEMENT! The BLOCKQUOTE - - ( % block; | the SCRIPT) + - long quotation ->
 <! ATTLIST BLOCKQUOTE
  % attrs;
  - % Coreattrs , % the i18n , % events -
  the cite % the URI;
  #IMPLIED - URI for a document or message -
  >
 <The ELEMENT! The Q - - ( % the inline; ) * - short inline quotation ->
 <! ATTLIST Q
  % attrs;
  - % Coreattrs , % the i18n , % events -
  the cite % the URI;
  #IMPLIED - URI for a document or message -
  >

Start tag: required, End tag: required

attribute definitions

the cite = the uri [the CT]
The value of this attribute is a URI, which determines the source document or message. This attribute is intended to provide information about the source from which the quotation is taken.

These two elements are determined by quoted text. Element BLOCKQUOTE is for long quotations (block-level content) and Q is intended for short quotations (inline content) that do not need to partition into paragraphs.

In this example, a quote from the book "The Two Towers" J. R. R. Tolkien's formatted using the blockquote element.

  <BLOCKQUOTE cite = "http://www.mycom.com/tolkien/twotowers.shtml">
 <P> They went in single file, running like hounds on a strong scent,
 and an eager light was in their eyes.  Nearly due west the broad
 swath of the marching Orcs tramped its ugly slot;  the sweet grass
 of Rohan had been bruised and blackened as they passed. </ P>
 </ BLOCKQUOTE>

Displaying information in quotes  

Visual user agents typically generate element BLOCKQUOTE as an indented block with.

Visual user agents must ensure that the contents of the display element Q with quotation marks at the beginning and the end. Authors should not put quotation marks at the beginning and end of the text in an element of the Q .

User agents must generate quotes with taking into account the style of the language (see. The attributes of the lang ). Many languages ​​use different styles for outer and inner (nested) quotes that should be displayed by user agents, respectively.

The following example shows nested quotation marks in the cell the Q .

 John said, <Q lang = "en"> I saw Lucy at lunch, she says <Q lang = "en"> Mary wants you to get some ice cream on your way home. </ Q> I think I will get some at Ben and Jerry's, on Gloucester Road. </ Q>

Since both quotations is English, user agents should generate them accordingly - in single quotes and double quotes internal - external:

  John said, "I saw Lucy at lunch, she told me 'Mary wants you to get some ice cream on your way home.'  I think I will get some at Ben and Jerry's, on Gloucester Road. "

Note. In the implementation of style sheets is recommended to provide a mechanism to insert quotes quote, defines the elements the BLOCKQUOTE , and thereafter in accordance with the current culture and the degree of nesting for quotes.

However, as some authors have used the element BLOCKQUOTE mainly to indent text, in order not to violate the intent of the authors, user agents should not insert quotation marks in the default style.

In this regard, the use of element BLOCKQUOTE for offset text is undesirable .

9.2.3 Upper and lower indices: elements SUB and SUP

  <The ELEMENT (! The SUB | the SUP ) - - ( % the inline; ) * - subscript, superscript ->
 <ATTLIST (SUB | SUP)!
  % attrs;
  - % Coreattrs , % the i18n , % events -
  >

Start tag: required, End tag: required

Often the upper and lower indices (for example, in French) necessary for proper generation. In these cases, the markup text elements should be used SUB and the SUP .

  H <sub> 2 </ sub> O
  E = mc <sup> 2 </ sup>
  <SPAN lang = "fr"> M <sup> lle </ sup> Dupont </ SPAN>

9.3 Lines and paragraphs

Authors traditionally divide their texts in paragraphs sequence. The organization of information into paragraphs is not affected by the presentation of the paragraph: paragraphs with a double alignment contain the same thoughts as the paragraphs left-justified.

HTML is easy to define a paragraph: an element P defines a new paragraph.

The visual presentation of paragraphs is not so simple. There are a number of problems, stylistic and technical:

  • Treatment of non-printing characters
  • The transfer line and the continuation of the words
  • alignment
  • transfer
  • Agreement concerning the written language and text direction
  • Formatting of paragraphs with respect to surrounding

These issues are discussed below. Paragraph alignment and floating objects are discussed later in this document.

9.3.1 Paragraphs: the element P

  <The ELEMENT! The P - About ( % the inline; ) * - paragraph ->
 <! ATTLIST P
  % attrs;
  - % Coreattrs , % the i18n , % events -
  >

Start tag: required, End tag: optional

Element P is a paragraph. It can not contain block-level elements (including the element itself the P ).

We invite authors not to use empty element the P . User agents should ignore empty elements of the P .

9.3.2 Managing the transition to the next line

The transition to the next line is considered to be a carriage return (& # x000D;), line feed (& # x00OA;) or a pair of carriage return / line feed. All transitions to the next line are unprintable characters.

Learn more about switching to another line in the SGML specification see. Section comments on the transition to a new line in the application.

Forced transition to a new line: element BR  

  <The ELEMENT! The BR - About EMPTY The - a forced move to a new line ->
 <! ATTLIST BR
  % coreattrs;
  - Id , class , style , title -
  >

Start tag: required, End tag: forbidden

Element BR forcibly breaks (ends) the current line of text.

For visual user agents, you can use attribute clear to determine whether the wraps following the element BR layout images and other objects flying from the left or right margin or below the object begins. Further information is provided in the section on alignment and floating objects . Authors are encouraged to use style sheets to control text flow around images and other objects.

Using the bidirectional formatting element BR has to act the same as the character string delimiters acts [ISO10646] in the bidirectional algorithm.

Barring a newline  

Sometimes writers need to avoid a new line between two specific words. The combination of characters & nbsp; (& # 160; or & # xA0;) acts as a non-breaking space.

9.3.3 Transfers

In HTML, there are two types of transfers: simple and hyphens. Easily transfer is to be interpreted by the user agent just like any other character. Hyphen shows the user agent, which can make the transition to a new line.

Browsers that interpret soft hyphens must provide the following semantics: If a line is terminated at the point hyphen at the end of the first line should be displayed hyphen. If the line is not interrupted at the point hyphen, appear hyphen should not. When performing operations such as searching and sorting hyphens should always be ignored.

In HTML, the simple transfer represented by "-" (& # 45; or & # x2D;). Hyphen characters seem combination & shy; (& # 173; or & # xAD;)

9.3.4 Formatted Text: Element PRE

  <ENTITY% pre.exclusion "IMG | OBJECT | BIG | SMALL | SUB | SUP"!>
 <The ELEMENT! The PRE - - ( % the inline; ) * - ( % pre.exclusion; ) - Rich Text Format ->
 <! ATTLIST PRE
  % attrs;
  - % Coreattrs , % the i18n , % events -
  >

Start tag: required, End tag: required

attribute definitions

the width = number [the CN]
This attribute provides a visual user agents clue as to the desired width of the formatted block of text. The user agent can use this information to select the appropriate font size, or to create a corresponding indentation. Width is expressed in number of characters. This attribute is often not supported.

Element PRE tells visual user agents that the text contained therein "formatted". When processing formatted text visual user agents:

  • May leave non-printable characters as they appear.
  • Can be used to display the text font with characters of equal width.
  • Can disable the automatic continuation of the words.
  • Must not disable bidirectional processing.

Non-visual user agents are not required to take into account the additional non-printable characters in element content the PRE .

Learn more about the transition to a new line in the SGML specification see. Section comments on the transition to a new line in the application.

The DTD fragment above indicates which elements may not be in the ad the PRE . The same is done in HTML 3.2; it is intended to preserve constant line spacing and column alignment for text generated using a font with characters of equal width. Authors are recommended not to change this process by using style sheets.

The following example shows a formatted stanza from a poem by Shelley To a Skylark:

  <PRE>
  Higher still and higher
  From the earth thou springest
  Like a cloud of fire;
  The blue deep thou wingest,
 And singing still dost soar, and soaring ever singest.
 </ PRE>

Here's how it usually appears:

  Higher still and higher
  From the earth thou springest
  Like a cloud of fire;
  The blue deep thou wingest,
 And singing still dost soar, and soaring ever singest.

Horizontal tabulation character
The horizontal tab character (decimal symbol 9 in [ISO10646] and [ISO88591] ) is usually interpreted by visual user agents as the smallest non-zero number of spaces necessary to move the characters on the tab that are every 8 characters. It is not recommended to use a horizontal tab formatted text, as when editing a setting other tabs values can lead to misaligned documents.

9.3.5 Visual display paragraphs

Note. The following section provides an informative description of the treatment of some of the visual user agents formatted text. Style sheets provide the best text formatting control.

Visual generation of paragraphs depends on the user agent. Usually generated paragraphs aligned to the left and the right margin ragged. For the text direction from right to left using other defaults.

HTML user agents have traditionally generated paragraphs with unprintable characters before and after the paragraph, for example,

  At the same time we began to form numbering system,
  calendar, hieroglyphic writing and technical development
  art, all that later affected other people.
  As part of this development and cultural progress
  Predklassicheskaya era is divided into Early,
  Middle and Late periods, which can be added 
  or protoklassichesky transition period, some features
  which are then characterized by the civilizations of America.

This differs from the style used in novels, where the first line of each paragraph is shifted, and the interval between the last line of the current paragraph and the first line is different from the next line spacing within a paragraph, for example,

  At the same time we began to form numbering system,
  calendar, hieroglyphic writing and technical development
  art, all that later affected other people.
  As part of this development and cultural progress
  Predklassicheskaya era is divided into Early,
  Middle and Late periods, which can be added
  or protoklassichesky transition period, some features
  which are then characterized by the civilizations of America.
 

Following settings NCSA Mosaic browser, created in 1993, user agents generally do not justify both margins, partly because of the complexity of this process in the absence of specific procedures for hyphenation. The use of style sheets and fonts without aliases with subpixel positioning promises Lyricist on the HTML language more opportunities.

Style sheets provide ample opportunities in size and style of font management fields, the distance before and after paragraphs, first line indent, alignment and many other aspects. Table user agent style used by default, generates the elements of P as shown above. In principle, you can override this generation of paragraphs, without the use of the transition to the next line of characters that are fundamentally changing the paragraph. In general, since this may confuse readers, so it is not recommended.

By convention, visual HTML agents break up text strings so that they fall within the fields used. partitioning algorithms depend on the scenario format.

In Western scripts, for example, the text should be broken only in the position where there is a non-displayable character. Early versions of user agents incorrectly broken line immediately after the start-tag and before the end element tag, which led to a breach of punctuation. For example, consider the sentence:

  Statue of Liberty <a href="cih78"> </a>, which is ...

Splitting line immediately before the end tag of the element A will lead to the fact that a comma is placed in the next line:

  The Statue of Liberty
  , which is ...

This is a mistake, because the markup is no invisible symbols in this position.

9.4 Marking document changes: Elements of INS and DEL

  <-! INS / DEL are handled by inclusion on BODY ->
 <The ELEMENT (! The INS | the DEL ) - - ( % flow; ) * - inserted text, deleted text ->
 <ATTLIST (INS | DEL)!
  % attrs;
  - % Coreattrs , % the i18n , % events -
  the cite % the URI;
  #IMPLIED - Information about the cause of the changes -
  a datetime % Datetime;
  #IMPLIED - Date and time of change -
  >

Start tag: required, End tag: required

attribute definitions

the cite = the uri [the CT]
The value of this attribute is a URI address, determining the source document or message. This attribute indicates information explaining the reasons for the change document.
a datetime = a datetime [the CS]
The value of this attribute specifies the date and time of implementation of the changes.

Elements of INS and DEL are used to markup inserted or deleted sections of the document in relation to the other versions of the document (for example, in the draft bill, which lawyers should be amended).

These two elements are unusual for HTML, as they may be block-level elements or integrated elements (but not both). They may contain one or more words in a paragraph, or one or more elements of the block level - paragraphs, lists and tables.

This is an example of the bill on the number of deputies from the county sheriff - 3 fixed at 5.

 <P>
  At the sheriff may be <DEL> 3 </ DEL> <INS> 5 </ INS> Deputies.
 </ P>

Elements of INS and DEL should not contain block-level content when they are inline elements.

ILLEGAL EXAMPLE:
Below is the HTML code is not valid.

  <P>
 <INS> <DIV> ... the contents of the block-level ... </ DIV> </ INS >
 </ P>

User agents must render inserted and deleted text so that the changes were obvious. For instance, inserted text may appear in a special font, deleted text may not be displayed appear crossed or with special markings, etc.

Both examples below correspond November 1994 5 years 20 hr. 15 min. 30 EST US time.

  1994-11-05T13: 15: 30Z
  1994-11-05T08: 15: 30-05: 00

Using element INS obtain

  <INS datetime = "1994-11-05T08: 15: 30-05: 00"
  cite = "http://www.foo.org/mydoc/comments.shtml">
 Moreover, the latest figures from the marketing department suggest that this is a useful practice.
 </ INS>

The document "http://www.foo.org/mydoc/comments.shtml" comments should contain about why this information is placed in the document.

Authors may also post comments about inserted or deleted text elements INS and DEL using the attribute title . User agents may provide this information to the user (e.g., in a pop-up message). For example:

  <INS datetime = "1994-11-05T08: 15: 30-05: 00"
  title = "Changed as a result of Michael A. comments on the meeting.">
 Moreover, the latest figures from the marketing department suggest that this is a useful practice.
 </ INS>