Ay! Teck!: Text is not text

One of the fascinating things with computers is that not even as simple a thing as a text document is universal.

One would think that text was so simple that all systems could agree on one and the same standard.

However, already in the beginning of time there was confusion on which kind of character to use for "new line". In the Unix world one used the Ascii character 0A (LF, Line feed). Mac OS started instead using the Ascii character 0D (CR, Carriage Return). And DOS and Windows decided to go for both belt and braces and used CR+LF. Mac OS X decided to switch to Unix type LF to confuse its users. This is one of the reasons you sometimes see documents with very large distances between paragraphs or no paragraph marks at all.

So much for the new lines.

Then the problem was the rest of the characters.

In one respect it was a pity that the English language dominated early computing, because there was no need to handle accented characters. Ascii contains just the letters a-z; no é or å or 福. To write accented characters one extended Ascii to what your web browser probably calls Western (because it is really wild) and to write other languages many other encodings were introduced.

The first reasonably successful attempt on standardising on one encoding came in 1991 with Unicode, which encompasses most of the major alphabets in the world. However, Unicode is still not standard default in most cases now more than 15 years later. If you save a text file in Windows or Mac OS X, the default encoding is based on the language of the computer. If you type a text file in Notepad.exe in French and send it to a Chinese colleague, it is very possible that the accents will turn up as gobbledygook in his version of Notepad.

Then is the solution not to force everyone to use unicode?

It could have been. However, there is more than one unicode standard, and they all have different advantages.

If you save an English file in unicode UTF8, it is not bigger than if you save it in "Western" encoding. However, a Russian text takes about twice as much space if saved in UTF8 as in a dedicated Russian encoding, and a Chinese or Japanese text takes 50% more than a native Chinese or Japanese encoding.

If you instead use unicode UTF16, the Chinese and Japanese texts do not take more space than a native encoding, but English text takes twice as much as with Western encoding. And most HTML pages and programming languages have mostly English characters, so UTF8 makes more sense there.

So next time you double click on a text file and it displays with surprising paragraphs and gobbledygook characters, don't blame the author - blame the fact that the first computers were not built by Indians. There are more than a dozen different alphabets in India, so they would probably have been able to figure out a better solution much earlier on in the process.

Ay! Teck!

Sunday, 28 October 2007

Text is not text

No comments:

Links

About Me