What did TWX use before ASCII

 
 

§ 2.3. HTML
Hypertext Markup Language (HTML) is an application of SGML (Standard Generalized Markup Language), which is a standard for document processing, and MIME (Multipurpose Internet Mail Extensions), which is a standard for the content of Forms mail messages.2Although often incorrectly identified with it, HTML is not the same as the WWW standard. The WWW uses different formats for data exchange and transmission as well as for many other applications. The formats (protocols) supported by WWW concern not only the display or transmission of texts, but also the correct reproduction of audio files, images, as well as the recognition of certain compressed files, etc. HTML is important for the transmission of texts and their correct display on the screen, as it is a subordinate format of MIME and is used exclusively for the unambiguous exchange and unambiguous representation of the data, ie for compatibility (interoperability) between the various systems operating in the network; therefore all so-called network browsers must support the HTML format. The HTML standard from version 2.0 adopted ISO 8859-1 as the default coding (character set) of the HTML texts. In practice, this means that every user should be able to read the texts offered on the Internet in a correct way, and that the display of the characters in the upper area should not cause any problems, since every browser is ISO 8859-1 (So ​​only the first table!) and many other other characters (graphics, math, [New] Greek, etc.) should support.3 Unfortunately, it is the case that many of the usual browsers, if at all, only display ISO 8859-1 and a few other characters from the HTML character set (see Appendix), which in turn means that we have so far only been limited by the Can make use of the character palette.4

§ 3. Character sets of 16 bits and more
It can be seen from the foregoing that none of the previous attempts satisfies the requirement of error-free and unambiguous data processing in the broadest sense (transmission, exchange, indexing, etc.). The main reason for the failure of these attempts is above all the wide range of available standards as well as their inconsistent introduction and adoption (program implementation). It should also be noted that none of the 8-bit standards can offer a sufficient selection of characters. For this reason, various software developers and institutions worked on the design of "more complete" character sets at an early stage. From a technical point of view, it has long been possible to equip operating systems for normal PCs with 16-bit or even 32-bit coding, which would ultimately make so-called wide characters available to the user, i.e. immense stocks of characters of all possible alphabets.

§ 3.1. WordPerfect
One of the first programs to ever use 16-bit coding is the well-known word processing system WordPerfect (WP), which is also one of the few that, thanks to this special feature, allows you to work with texts in different languages ​​without creating ambiguities. Since version 5.0 WP has 13 tables, each containing up to 256 fixed characters (the 13th table can be freely defined). Newer versions have expanded these tables and added more. It is very important that every user of WP can exchange his data without danger, since the characters are clearly defined for every WP user. For example, the Greek is fixed on Table 8, and a Greek a is always the number 1 character in that character set (i.e. 8: 1); likewise is a ukrainian is always 10:67 (i.e., character number 67 in Table 10: Cyrillica), and a ö is always 1: 177 (Table 1 contains the letters of the Latin national alphabets with diacritics). WP, too, has the options of offering a 16-bit character set, but not fully exhausted it, as only about 2,000 characters are used from a possible character set of 65,536 characters. After all, the WP system appears to come very close to the desired goal of flawless data exchange, but since it is based on in-house considerations and was developed without taking other software companies into account, it is not foreseeable that this character encoding will ever become established.

§ 3.2. Unicode
The first serious attempt at an international, independent 16-bit coding system was made by the Unicode Consortium. The experiment evolved into a standard that, with 65,536 different fixed characters, could satisfy almost any encoding requirement. Unfortunately, this standard is so far used by very few programs (I don't know any!)5 and operating systems supported.

§ 3.3. 32-bit coding: ISO 10646
Although it looks like no developer is the wide characters takes it seriously or considers it necessary at all, there is already a well-advanced attempt to bring about a final solution: 32-bit coding, which is embodied in the ISO 10646 standard and is intended to define the coding of around 3 billion characters. However, we are still a long way from the actual introduction of this standard, which saves us from further discussion (see further information at www.pls.com/dcstug/unicode.html).

§ 4. HTML and WWW: New ways of text transmission
The HTML format makes it possible to manipulate or edit the text design and the content of the documents for the WWW. As a result, every document designed for the WWW must adhere to the HTML standard, otherwise the documents would be presented in a confusing way and the advantage of general public access or, in today's jargon, global data exchange, would be lost. The WWW documents therefore have a special format that not only affects the character coding (see § 3.2.), But also all possible control characters for the display of various text formats such as font, size, tables, e-mail addresses, image embedding, references, etc. An exact picture of what has been said can be obtained by comparing Figs. 3 and 4, which show a simple WWW page in two different perspectives.

Fig. 3: Example of a web page viewed with a browser


 

 
§ 4.1. The special character problem of HTML in the WWW
For us linguists it really only gets interesting when it comes to special characters, i.e. all possible characters that go beyond the ASCII standard, such as ñ, ü, ä, ß. These must be marked in a special way so that the browser can display them. This is about the Spanish ñ as & ntilde;, the German ü as & uuml ;, the Ä as & auml; etc. encoded. We regret that the set of characters to be used in HTML documents is fixed (see the tables in the appendix) and limited accordingly. So there is no possibility of creating new characters on the basis of the above-mentioned coding method, as it would be conceivable according to a simple proportion: e.g. if that á as & aacute;, the é as & eacute; and the Ú as & Uacute; are to be coded, then it should also be possible through the code & cacute; a or by & nacute; a, etc. to create. However, this is not the case. It becomes more difficult with Greek characters, since only those are provided that are used for modern Greek. Therefore, a Greek fully accented text cannot be displayed on the WWW, and so far there is no generally applicable solution to this problem (the same applies to the Cyrillic, Arabic, etc.). Although the combinatorial character encoding forms a theoretically open system and would be ideally suited for a huge expansion, it unfortunately remains unused. Since the "foreign scripts", as I said, can only be displayed to a limited extent, other solutions have to be sought, e.g. to save all the required characters as an image so that they can then be inserted into the text as in the "good old days". However, this solution cannot be used across the board either, since the texts would be very extensive due to the internal references required by the HTML format and would therefore be difficult to handle in the heavily loaded Internet networks. Another solution is the one that is currently being tested as part of the TITUS project. A function of various browsers was used here, which allows fonts to be freely selected. At this point you can call up self-created fonts instead of system-internal fonts. However, the solution is only available as long as the programs (i.e. the browser) allow the user to select the fonts. In addition, with this solution you immediately encounter the problem of standardization: Not all WWW readers have access to the fonts,6 you have to format the documents in a certain (not uncomplicated) way, and ultimately the fonts you have developed yourself are not supported by all operating systems (currently only MS-Windows, Apple-Macintosh, but not UNIX).

§ 4.2. Other text-related problems with HTML on the WWW
Not only the special HTML coding causes difficulties for the scientific user. It must be made clear that a network application designed for international exchange and also for linguistic purposes on the basis of a clear and completely sufficient coding (such as UNICODE or better ISO 10646) has to meet three further requirements: the possibility of data input, the Possibility of data display and the possibility of data processing and management. These prerequisites cannot be taken for granted, as programs show that support foreign fonts but leave access to these functions to the ability of the user. A word processing program that offers wide characters should also include ñd.h. How do I create any character? ñ help, for example, by delivering predefined keyboard layouts, but also by allowing characters to be easily mapped to any keyboard position. The wide characters should also be able to be displayed on the screen and should not be replaced by black boxes or the like. It is also conceivable that not all scripts can be displayed at the same time or that the user does not want this, so one could possibly think of a scientific transliteration of certain alphabets. Under no circumstances should one forego the possibility of an exact representation of an alphabet in the original. It remains the data management. It would make sense if indexing routines and fast search tools were implemented in one and the same processing program. If this is not the case, you should at least have access to additional programs that support wide characters as databases. The same desiderata also apply to the WWW and thus ultimately to the HTML format, as we as editors should be allowed to edit, edit and enter linguistic texts on the one hand, and as readers to see the texts and at the same time on the other to be scientifically evaluated by providing us with quick access to both the text itself and various indices.

§ 5. The various formats in comparison
According to the in § 4.2. The above suggests that it has not been possible to date to distribute a text with characters that go beyond the ASCII standard in an adequate way through the WWW. But it will certainly be feasible in the not too distant future.

§ 5.1. Special characters in the WWW
Regardless of the huge restrictions resulting from the lack of coding, there are very interesting attempts to offer "strange" texts on the WWW, as is the case with the Avesta Web Server by Joseph H. PETERSON (see http://kasson.cfa.org /~jpeterso/avesta.html) is the case. The following figures show the problems mentioned so far from a practical point of view. In Fig. 5 a page of the Avesta Web Server can be seen; Fig. 6 shows the source of the same text.

Fig. 5: Avesta web server page


 

 
§ 5.4. Conclusions. Minimum requirements
Although all the coding attempts presented so far would have to be sufficient for linguistic processing if they were only used consistently, the scientific (and perhaps industrial) requirements go beyond them and require a coding that not only has a sufficient supply for all written symbols ever used by humans but also contains corresponding assignments so that the languages ​​(or alphabets) are treated as unique in order to avoid mix-ups or ambiguities. Such requirements would be superfluous if the ISO 10646 standard were to prevail and operating systems and programs were subject to the conditions set out in Section 4.2. would take over and support what was said. We also have to wait for multilingual programs that can be used to the same extent on all three functional levels, i.e. input, display and data processing / management. So far we have had to be content with programs that are suitable either for word processing alone or alone as a database, but without allowing interaction between the two functions. The procedure for further scientific cooperation must therefore be reconsidered as long as we still have to continue working under the old conditions. Unfortunately, experience has shown that it is not easy to convince all of us that a uniform system is necessary. Everyone insists on sticking to the old, familiar system, even if it doesn't have as useful options as the other's and at the same time requires much more complicated handling. These considerations should give rise to creating a standard for our most urgent needs so that we can load our common data into our usual word processing program or add it to the database with the help of a simple conversion routine.

§ 6. Data exchange format
The format to be proposed must of course meet a few requirements in order to guarantee unlimited interchangeability.

§ 6.1. Requirements for the coding to be used
Since we have seen how problematic the use of 8-bit coding can be (the upper character range has to be defined differently from system to system), we must first restrict ourselves to 7-bit coding, which is the same everywhere. In fact, the characters in the lower range (at least from 32 to 127; see Fig. 1) always remain the same, and it cannot happen that one of them is misinterpreted or transmitted in a different way than the original. So much for the basic character set to be used. In order to determine the sometimes complicated letter combinations we need, we have to create a unique coding that makes the intended character exactly available at any time. It is not absolutely necessary here to aim at a one-to-one reproduction of the data, but rather the aim is simply to aim for a reversible, unambiguous coding. With reversibility, conversion into any format must be made possible; because the uniqueness itself has two effects: one is that, regardless of the respective system / program, one and the same character is always represented by a code, the other that a possible exchange / misinterpretation is absolutely excluded. § 6.2. Prerequisites for electronic transmission and transmission The new ways of transmission are all electronic in nature, and adapting our data to this new medium has the advantage of being able to keep it intact in its entirety. The Internet, the global information superhighway, only accepts information that is represented as a sequence of bytes; In addition, this octet sequence must be assigned to a coded character set so that the characters to be transmitted can be decrypted. This circumstance in turn forces us to continue working with 7-bit coding, since, as I said, it is not the general custom of programs to support 8-bit coding (ISO 8859-1) (see § 2.3).

§ 7. Coding proposal: TITUS transcription systems
In the following tables you can find transcription systems for different disciplines,7 in which only the principle of unambiguous coding prevails, which brings the desired convertibility with it.

§ 7.1 TITUS transcription system for Old Indian (Sanskrit)
The company to create a pure ASCII coding for Sanskrit is almost as old as the introduction of the ASCII coding itself. We now know different coding systems that offer the same thing, but each with a different standardization. The most common systems are listed in the following table: KH stands for the Kyoto-Harvard system, PSZ stands for that of Peter Schreiner (Zurich) (it is very common among TEX users), and FV stands for that of Franz Velthuis.All of these systems were ultimately designed for classical Sanskrit and do not take into account the accented texts. The TITUS transcription system has significant advantages in that it makes it possible to designate not only the bare letters but also the accent combinations. In addition, the TITUS transcription system adds other special characters that are urgently needed for a linguistic analysis of ancient Indian materials.8

§ 7.2 TITUS transcription system for Old Iranian (Indo-Iranian Studies)

§ 7.3 TITUS transcription system for Armenian


Remarks:

Note 1: An overall representation of the ISO 8859-X tables can be found at www.cs.tu-berlin.de/~czyborra/charsets/.

Note 2: For SGML see CH. GOLD COLOR: The SGML Handbook, OUP; on MIME see D. GOLDSMITH: Using Unicode with MIME; http://ds.internic.net/rfc/rfc1641; N. BORENSTEIN / N. FREED: MIME (Multipurpose Internet Mail Extensions) Part 1; http://ds.internic.net/rfc/rfc1521.ps; K. MOORE: MIME (Multipurpose Internet Mail Extensions) Part 2; http://ds.internic.net/rfc/rfc1522.txt.

Note 3: In the table on p. Xxx. the special characters supported in HTML format are shown. See also the HTML specification, IETF RFC 1866 at ftp://ds.internic.net/rfc/rfc1866.txt.

Note 4: Something similar applies to the TCP protocol, which supports ISO 8859-1, but usually the programmers ignore the top tier, and so TCP is used as a 7-bit protocol, although it is actually an expanded 8-bit protocol .

Note 5: I only recently found out about MASS (Multilingual Application Support Service), which offers itself as a word processing program with implemented UNICODE; see www.iss.nus.sg/RND/MLP/Projects/MASS/MASS.html. References to other products are available at www.pls.com/dcstug/index.html.

Note 6: They are available at www.rz.uni-frankfurt.de/titus/software/d-softwa.htm#ttfonts.

Note 7: The Greek corpus is almost completely available in the so-called beta format. See the TLG server at www.uci.edu:80/~tlg/, for the beta format see www.tlg.uci.edu/~tlg/BetaCode.html.

Note 8: J. GIPPERT offers an example of such an analysis in Fs. J. Schindler.


Sources used:

  • Jost GIPPERT: "The project of an Indo-European thesaurus", LDV-Forum, Forum of the Society for Linguistic Data Processing, Vol. 12/1, June 1995, pp. 35-49; also at www.rz.uni-frankfurt.de/titus/public_html/texte/titusldv.htm.
  • «From the cuneiform tablet to the text database», Research Frankfurt 4/1995, 47ñ56; also at www.rz.uni-frankfurt.de/titus/texte/forschffm/049546.htm
  • www.rz.uni-frankfurt.de/titus/public_html
  • www.ebt.com:8080/docs/multilingual-www.html
  • www.w3.org/hypertext/WWW/International/Overview.html
  • www.infocom.net/~bbs/iso8859.html
  • www.cs.tu-berlin.de/~czyborra/charsets/
  • http://www.yajima.kuis.kyoto-u.ac.jp/staffs/yasuoka/CJK.html
  • ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/FAQ-ISO-8859-1
  • ftp://ds.internic.net/rfc/rfc1866.txt

  • § 8. Appendix The HTML Coded Character Set (see. Ftp://ds.internic.net/rfc/rfc1866.txt) This list details the code positions and characters of the HTML document character set, specified in 9.5, "SGML Declaration for HTML" . This coded character set is based on [ISO-8859-1]. REFERENCE DESCRIPTION -------------- ----------- & # 00; - & # 08; Unused & # 09; Horizontal tab & # 10; Line feed & # 11; - & # 12; Unused & # 13; Carriage Return & # 14; - & # 31; Unused space! Period (full stop) / Solidus (slash) 0-9 digits 0-9: Colon; Semi-colon Greater than?
    Proposed Entities The HTML DTD references the "Added Latin 1" entity set, which only supplies named entities for a subset of the non-ASCII characters in [ISO-8859-1], namely the accented characters. The following entities should be supported so that all ISO 8859-1 characters may only be referenced symbolically. The names for these entities are taken from the appendixes of [SGML].

    El diseño de la página y las pictures son
    © 1996-2000, Universitat de València Press
    © del group "mmm"
    Comentarios a: [email protected]
    València 15th September 2000