Re: Stuck in the past
@James 47 -- Gravis said "exploit", not "squander."
I recently experienced a Damascene conversion and, like many such converts, I am now set on a course of indiscriminate and aggressive proselytising. Ladies and gentlemen, place your ears in the amenable-to-bended position, and stand by to be swept along by the next great one-and-only true movement. The beginning In the …
The fact that some programmer, in an attempt to show the "benefit of Unicode", should use a 'double' variable for PI and only give 6 figures tells you they should be executed and their programs not!
But yes, you speak the truth - UTF-8 is better for all practical reasons because it won't break old software/code and yet it allows all characters you (and your customers/users) might want. Subject to matching system fonts - a rant for another day...
SMS messages are usually sent in GSM-7 aka SMSCII, a modified form of ASCII with some code points moved around, and some characters represented by 2-byte sequences; which also includes some accented characters and enough of the Greek alphabet to be able to write in capitals in Greek, making up the remainder with Latin characters that look like Greek characters. This way, 160 7-bit characters can fit into 140 8-bit bytes. And you get to use the << and >> operators.
Alternatively they can be sent in UCS-2, which is as near enough to UTF-16 as makes no difference; but then the message is limited to 70 characters.
There is no UTF-8 mode, though .....
I use UTF-8 for everything (some Welsh characters aren't supported in the usual European sets) - but could someone please give Microsoft a good slapping - just wasted ages trying to get data containing Welsh characters (ŵ and ŷ - see, el Reg can handle them) from an Excel spreadsheet via csv into a mySQL DB - nightmare! Excel output to csv can't do UTF-8. I ended up pasting into OpenOffice, then exporting.
Back in 2009 I posted a comment to the Reg containing astral plane characters (code points with a value above 0xffff). I got back an apologetic email saying that I'd broken their database and they'd had to remove them from the comment.
Some time later I found a bug in Thunderbird's treatment of astral plane characters. I tried to file a bug. Then I had to file a bug on Bugzilla complaining that it didn't handle astral plane characters properly... which was quite hard, as Bugzilla's bug tracker is also Bugzilla.
(All of these stem from the same underlying problem, which is MySQL assuming 16-bit Unicode. This is why 16-bit Unicode must die. MySQL too, of course.)
If "Rate This Article" still existed, that line alone would have got an 11 for this article.
Can't see how to sneak it in to the office conversation yet, but I'll give it serious thought.
[Yes there are computer people that don't read El Reg. Unbelievable but true.]
The comment about printing the £ (I hated the # sign) reminded me of my DOS days. I solved it by writing a small 90 byte machine code routine (most bytes were my credit line) that loaded through config.sys that redirected the print code to see the £ code rather than the hash code. Staff often asked me what the line
"Money added to system"
meant when they switched the machine on but then I always did have a weird sense of humor.
When I was but a lad, the BBC used internally a variant of the CEEFAX system to carry presentation messages (next item is, coming out three seconds early, etc.) around the country on a video display line that was stripped out before the signal went to the transmitter.
What the character set PROM didn't have was a £ sign.
Instead of using a separate PROM or even $deity$ help us an EPROM, the BBC designs department in its infinite wisdom built a whole chunk of logic that recognised the £ code and told the character generator to use the top half of a C and the bottom half of an E...
I don't recall ever seeing a message that used the £ sign...
As UTF-8 only represents characters up to 31 bits in length, the alternative of every character taking 32 bits still remains another valid, if wasteful, option.
UTF-8 is somewhat wasteful as well, often requiring three bytes instead of two, or two bytes instead of one; stateful encodings can do much better.
One has a choice: either stateful, shorter, but potentially fragile and non-self synchronising or UTF-8. Me? I choose UTF-8. But then I spend a large part of my programming life dealing with radio based comms protocols which means - by definition - I am rather strange.
Oh, and it doesn't help that I spent a lot of time in my formative years having to deal with 5 channel paper tape...
Unicode is a good standard and it was written by clever guys. There's nothing wrong with Unicode's approach of mapping each character to a code point, and adding an intermediate step requiring encoding it into bytes. Far better than the ugly mess of codepages that preceded Unicode.
UTF-8 is part of Unicode and it's a damn good encoding.
Well that's a common problem with ElReg, there are many authors who have never seen anything else than the little area they work in and believe that the whole world is like this.
They believe that E-Mail is as complex as Exchange, they believe that somehow IPv6 is amazingly difficult, and they believe that the world is still using UTF-16.
It's a bit like the people from Krikkit who due to their dark night skies have never seen even a glimps of the worlds out there.
> There's nothing wrong with Unicode's approach of mapping each character to a code point
Actually there is plenty of wrong with that. Because then you suddenly need the whole cartesian product of diacritics and the base characters.
The only one who I would trust to come up with a "good Unicode" would be Knuth.
I remember some very old ICL and Digital machines with 6-bit bytes (being a pedant I am using Byte as the number of bits to represent a character). One guy here still cannot type in lower-case, and I'm pretty sure he'd have a stroke if you sent him a document without a single upper-case letter in it.
But I also remember at least one of those Dec machines had a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing).
It now makes me smile a bit that we have huge monstor machines running bare-metal hypervisors, with each user having a virtual machine running its own a virtual copy of Windows, loading its own virtual copy of Excel. In the past, a single machine loaded one copy of 2020, and all the users shared it. No need to load 150 copies of the same thing, all repeating the same houskeeping tasks.
"a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing)."
VAX, perhaps. VAXes (and lots of others) can have "page faults" (or maybe other exceptions) part way through the processing of an instruction. If it's a page fault, the relevant data is loaded into memory by the OS, and the faulting instruction is resumed. If it was a potentially long running CISC instruction (such as POLY), it may or may not need to be restarted from the beginning. The instruction is not restarted from the beginning unless the "first part done" bit isn't set (which indicates that it needs to restart from the beginning, because the first part of the instruction hadn't completed.)
And why am I telling you this?
Because you need to know.
I spent a lot of time fighting with text encodings when designing the .mobi file format for Mobipocket (and later Kindle). The conclusion was also that UTF-8 wins everywhere. The self-sync feature is superb. As for the "hassle" of handling a variable-length character encoding, you soon realize that:
- in most cases, you need the length of your string in bytes (for memory allocation, string copying, ...)
- cases where you need to decode UTF-8 to code points are rare, mostly when you display those characters and then you usually display the whole string from first to last byte so going through the bytes in sequence to decode the code points is not wasteful.
- the typical case when you need to know the characters is parsing, BUT, ALL keywords and ALL control characters in ALL computer languages are below the 128 code point so you can actually parse UTF8 as if it was ASCII and never care about the multi-byte encoding outside of string litterals.
So yes, UTF-8 everywhere!
I hate to be a pedant (actually, that's a lie), but it's not strictly true that all control characters are below codepoint 128. There is the ECMA-35/ISO-2022 C1 control set designated at 128-159, mirroring the C0 control set with the high bit set. This is obviously incompatible with UTF-8 though, and so not available when you have UTF-8 designated with "ESC % G".
I'm not sure about your last point there. Lots of programming languages allow non-ASCII characters in identifiers (for example: http://golang.org/ref/spec#Identifiers), so, assuming you are not going to allow all non-ASCII characters in identifiers (Go allows 'ĉ' but not '£'), your lexer does need to identify characters. Also, you might want character constants to work beyond ASCII.
However, you typically don't need to decode UTF-8 in order to identify the end of a string constant or comment.
Discovered last week that Windows Notepad won't display all pasted UTF-8 characters - but it does preserve the binary values. So saving from Word in "TXT" UTF-8 format with an HTM suffix does appear correctly on a browser page.
Very useful for a hobby task that indexes public FaceBook and YouTube postings which can be written in just about any language. A quick screen scrape of Google Translate then combines a translation with the original.
Not just shot once, but repeatedly.
One of the principals of Unicode is to separate the character from the representation of the character. In other words, ASCII 65 (decimal) is "A". How your system chooses to display "A" is up to the system. The character is transmitted as decimal 65 no matter what the display representation is.
Unicode promptly goes on to rubbish this ideal.
Pre-Unicode Asian fonts had "full-width" representations of ASCII characters so displays that mixed ASCII and Japanese characters kept their formatting as the characters had the same width, while the usual ASCII characters were narrower and hence broke formatting.
Unfortunately this lives on in Unicode, shattering the idea that the display of the character is independent of the code point of the character because there are now two different Unicode code points that both print out a Latin-1 "A" (and also the rest of the alphabet and numbers and punctuation). In reality, the full width "A" should not be U+FF21, it should be decimal 65 with the renderer deciding if it should be full width or not.
This has caused me more than one problem in the past with things that sometimes correctly handle the full-width and ASCII mix and sometimes don't.