Hmmm...
Down with Unicode! Why 16 bits per character is a right pain in the ASCII
I recently experienced a Damascene conversion and, like many such converts, I am now set on a course of indiscriminate and aggressive proselytising. Ladies and gentlemen, place your ears in the amenable-to-bended position, and stand by to be swept along by the next great one-and-only true movement. The beginning In the …
-
-
Friday 4th October 2013 19:40 GMT Homer 1
Re: Hmmm...
Yes, that was my reaction too, but then I admit near-total ignorance on the subject, beyond what I've just read.
Simplistically it seems the best solution is to implement a sufficiently large encoding length to accommodate all possible characters, which was supposedly the goal of UTF-16, except Becker naively assumed that "16 bits ought to be enough for anyone" (to paraphrase a well-known fallacy).
Again, simplistically, the answer to these "enough for anyone" fallacies would seem to be dynamic allocation, as in dynamic arrays or linked lists, which is in fact what UTF-8 does, although in its case the dynamic allocation pertains to the encoding length of each member rather than the overall length of the array, if I'm reading the descriptions correctly.
UTF-16 does that too, apparently, thus defeating its original objective, but suffers from ASCII and endian compatibility issues and, probably more than anything else, Microsoft's typically retarded implementation.
So UTF-8 it is, then.
-
-
-
Friday 4th October 2013 09:40 GMT Marco van de Voort
Embacadero in the past usually narrowly followed Windows Policy in these kind of issues. Recently they seem to orient themselves more against Objective C, Mac (because of their iOS offerings, only their mobile offerings are LLVM), but that is also utf16
My guess is Embarcadero will adapt if their core targets adapt. Ranting against them (with baseless sentiment) is there for useless..
That is also my problem with this whole rant. The main problem is not UTF16, but there being two standards, with two opposing camps. Even if you thing UTF8 is superior, if your core platforms are utf16 oriented, you will spend your days fighting windmills
-
This post has been deleted by its author
-
-
-
Saturday 5th October 2013 21:55 GMT AOD
What else to expect from something as backwards as Delphi? Pining for the days of Turbo Pascal is like pining for the days of Lisp Machines, only without sense and good taste.
If' you're going to have a little rant, please get your facts straight. Delphi evolved from Turbo Pascal but it is a distinct product and a very sophisticated one at that. Please enlighten us as to why you regard Delphi as backward? Do you have direct development experience with it that you can share or is it just that it's non MS and therefore can't be any good?
From experience I can tell you that when it was introduced, it brought features that gave the competition a swift kick to the happy sack, including but not limited to:
A WYSIWYG menu editor for designing your forms. The sad equivalent in VB3 was truly pitiful.
Decent object orientated support in a strongly typed language (Object Pascal).
Support for building applications as a single EXE. No more DLL's to fling around the place if you preferred not to.
-
Monday 7th October 2013 11:32 GMT Philip Santilhano
Delphi backwards? I have a few choice unicode character for you!
Calling Delphi backwards ("backward" it should probably be!) shows a true ignorance of the language.
ASCII, Unicode 16 bit and UTF8, I just wish there was a standard that was universally accepted, and not rooted in the days of 8 bit machines.
-
-
Friday 4th October 2013 09:27 GMT MartinSullivan
There's UTF-8 and utf8 in Perl
The sainted Larry claims he can keep them separate in his head, but it baffles many a poor soul like me. And it has cause me to produce the odd bit of wombat-do-do in my time.
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8
I'm a former T.61 expert. Let's not go there.
-
-
Friday 4th October 2013 14:06 GMT Frumious Bandersnatch
Re: There's UTF-8 and utf8 in Perl
And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.
Don't let it put you off. Unicode in Perl more or less "just works". The only times I've had problems with it have been in trying to correctly convert stuff from other code pages and broken MS document formats. That, and sometimes forgetting to tell my database that the incoming data is UTF-8 rather than ASCII (though sometimes Perl needs a hint, too, to tell it not to do a spurious conversion).
Speaking of MS documents, I find it really incredible to come across HTML on the web that obviously came from MS Word initially and that has completely messed up rendering of some trivial glyphs (like em dash and currency symbols). I find it hard to believe that in this day and age, Word can't even convert to HTML properly. OK, so maybe the problem isn't with word, but with the options the user selected for the conversion, but still...
-
Friday 4th October 2013 14:12 GMT Gordon 11
Re: There's UTF-8 and utf8 in Perl
Don't let it put you off. Unicode in Perl more or less "just works".
Agreed. I wrote a script recently then, at the end, remembered that some bits of the data would be coming in with things like (un)"intelligent quotes". I set about looking at what I'd need to do, only to discover that it was all being handled correctly without me having to do anything special at all.
There is a great number of modules in Perl which do "what you need".
-
Saturday 5th October 2013 04:28 GMT Allan George Dyer
Re: There's UTF-8 and utf8 in Perl
@Frumious Bandersnatch, "completely messed up rendering of some trivial glyphs (like em dash and currency symbols)" - my guess would be the HTML was saved in cp1252, and the browser guessed (or was told) it was iso8859-1. They are almost the same, apart from those glyphs.
I was going to add a rant, but I couldn't decide whether it was agains Microsoft's "embrace, extend, extinguish", incorrectly configured web servers or browsers silently "being helpful" and changing the encoding they're applying so that you can never figure out whether you've configured your web server correctly. Basically, Verity's right.
-
Saturday 5th October 2013 11:37 GMT John Smith 19
" Frumious Bandersnatch Ignore"
"Speaking of MS documents, I find it really incredible to come across HTML on the web that obviously came from MS Word initially and that has completely messed up rendering of some trivial glyphs (like em dash and currency symbols)."
So that's the source of that annoying little f**k up.
Word --> HML.
Thanks for that. I've always wondered. I though it was something to do with IE not liking any web server but IIS.
Still f**king annoying.
-
-
Monday 7th October 2013 08:32 GMT Anonymous Coward
Re: There's UTF-8 and utf8 in Perl
And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.
Nah, Perl 5 isn't so bad. I still find it easier to bash out a quick hack in Perl than, say Python, or some other slightly less baroque language like Ruby... CPAN is something of a killer app.
Now Perl 6 on the other hand... I don't know if the designers coined the word "Twigil" and the concept it describes, but who ever did surely deserves a terrible, lasting punishment.
-
-
Tuesday 8th October 2013 17:42 GMT Michael Wojcik
Re: There's UTF-8 and utf8 in Perl
I'm no fan of Perl, but the utf8 / UTF-8 distinction is probably the best solution to a real problem.
Perl's original UTF-8 implementation ("utf8") was created before the format was standardized. Broadly speaking, it follows Postel's Interoperability Principle, and allows many sequences that were forbidden by the standard when it was finalized. That made it easier for people to start using UTF-8 with Perl.
Those sequences - such as non-minimal encodings - have bad security implications. They make it too easy to slip malicious data past poorly-designed filters (i.e., most filters), for example.
The later UTF-8 implementation follows the spec. It's good to have an implementation that follows the spec, and it's especially good when that implementation is a lot safer than the overly-permissive one it supersedes. But if Perl had simply dropped "utf8", it would have broken at least some old programs; and if it had made "utf8" a synonym for "UTF-8", some old data would have been rejected.
-
-
-
Saturday 5th October 2013 15:56 GMT Kristian Walsh
Re: I don't have a gripe against utf-8
Indeed. UTF-8 strings will sort in codepoint order if you give them to strcmp(), which is as good and bad as its behaviour in ASCII. However, anyone who thinks that strcmp() sorts strings in "alphabetical" order is at best living in a dream-world, or at worst, a hopeless xenophobe.
As for the other "problem", that of lenght: strlen() returns the length of a UTF-8 string in bytes, and outside of font rendering engines, that is all you ever need to know to write proper text-processing code. Any argument to the contrary is based on a misplaced notion that somehow a byte is a character (I blame C).
"Character" is a very slippery defitinion, is language sensitive ("rijstafel" is 9 letters long if you're English; only eight if you're Dutch), and doesn't always correspond to the number of symbols the user sees anyway: when the five codes 's','o','e','u','r' arerendered as the four glyphs "sœur", how many characters really are in the string? (both answers are equally right and wrong, btw)
-
Tuesday 8th October 2013 17:47 GMT Michael Wojcik
Re: I don't have a gripe against utf-8
Any argument to the contrary is based on a misplaced notion that somehow a byte is a character (I blame C).
It's not C's fault. In C, a byte is a character (ISO 9899-1999 3.7.1). It's the fault of programmers who don't understand that a "character" in C is not the same as a "character" in some arbitrary natural-language writing system. (Generally, these are the same people who don't understand that a "byte" in C is not an octet.)
-
-
-
-
-
This post has been deleted by its author
-
-
-
Friday 4th October 2013 11:12 GMT Philip Lewis
Re: @Verity
Indeed, one doesn't, and one is tempted to bemoan this lamentable situation. There was a time past when one could gleefully indulge in the subjective (and indeed the reflexive) willy nilly as it were, to one's own inner satisfaction and general merriment of all and sundry. One likes to avail oneself of these forms, if only for the inner satisfaction of reaffirming their very existence. Alas, such linguistic beauties have fallen away in our language, shunned by the masses for whom grammar is a small town in Eastern Prussia.
-
-
This post has been deleted by its author
-
-
-
-
Friday 4th October 2013 12:10 GMT Andrew Yeomans
The historical accident of little-endian
On a purely technical basis, little endian representations of numbers are much easier to parse and handle. I'm meaning proper numbers, not the arbitrary computer representations. Take the number 12345675679274658. Quck now, is that one quadrillion, twelve quadrillion, 123 trillion, or what? You are going to have to do a right-to-left scan of the number to find out.
The Arabs had it all sorted out, with little-ended numbers (written right-to-left of course). But when the West appropriated the idea a few centuries ago, they omitted to reflect them to convert between Arabic right-to-left and Western left-to-right writing direction. So we've ended with the current confusion.
Oh well, it could have been worse. We might have been using Roman numerals still, with no zero, if it hadn't been for the Arabs.
-
Friday 4th October 2013 14:23 GMT Frumious Bandersnatch
Re: The historical accident of little-endian
On a purely technical basis, little endian representations of numbers are much easier to parse and handle. I'm meaning proper numbers, not the arbitrary computer representations. Take the number 12345675679274658. Quck now, is that one quadrillion, twelve quadrillion, 123 trillion, or what? You are going to have to do a right-to-left scan of the number to find out.
Huh? That makes no sense:
* easier to parse? in all the (human, natural) languages that I know of, we start with the biggest quantity and work down (even in expressions like "four score and 7", "vingt et un" and "eleventy one")
* is that quadrillion, ... : you don't have to scan right to left---you just count how many digits there are (and last I checked, counting left to right gives the same answer as counting the other way)
You should have icon privileges revoked for such a silly post.
-
-
-
Saturday 5th October 2013 11:22 GMT ratfox
4th of October, 2013
To be honest, all proper coders know that logs files should be formatted as in log_2013_10_04_23_59_59.txt
And every reader of XKCD knows this too.
-
-
-
Friday 4th October 2013 15:05 GMT Daniel B.
Re: The historical accident of little-endian
"In all the (human, natural) languages that I know of, we start with the biggest quantity and work down (even in expressions like "four score and 7", "vingt et un" and "eleventy one")"
German uses little-endian for numbers < 100, though. "Zwei und vierzig". Quick, what number is that?
-
Friday 4th October 2013 17:36 GMT frobnicate
Re: The historical accident of little-endian
Historically, numerals in almost all languages are little-endians from "thirteen" (3+10) to "five and twenty". Operations, like addition, are performed from least to most significant digits and the new digits are added at the most-significant side. It is unnatural to do this right-to-left in the otherwise left-to-write oriented writing system. Because of this, *one* often finds oneself in a pain of printing a column of numbers right-adjusted (the only reasonable way to do this, so that scale is immediately visible).
Compare this with another ridiculous right-to-left vestige: the mathematical notation for function composition: f(g(x)), so cumbersome that mathematicians composing functions a lot (e.g., in category theory) adopt notation from programmers and write "g;f". But that at least we can blame on bad vodka Euler had. Fibonacci and his ilk who gave us big-endian numBerals have no excuse.
PS: the argument about "starting with the biggest quantity" makes no sense, because Arabs, who invented the thing, read from right to left and hence start with the least significant digit. Which put no hindrance on Arabian mathematics.
-
-
Saturday 5th October 2013 22:11 GMT Destroy All Monsters
Re: The historical accident of little-endian
Compare this with another ridiculous right-to-left vestige: the mathematical notation for function composition: f(g(x)), so cumbersome that mathematicians composing functions a lot (e.g., in category theory) adopt notation from programmers
I think you cannot into math.
It's written f∘g (x), with the ∘ generally being a bog-standard multiplication sign.
"Notation from programmers", indeed. Pchao.
-
-
-
Monday 7th October 2013 17:25 GMT cordwainer 1
Re: The historical accident of little-endian
Even as a non-programmer gasping for air attempting to follow these comments, I at least understand the difference between scanning that number right-to-left as opposed to left-to-right.
If one were going to approach it as a mathematical amateur - i.e., insert the commas that mark off the 1000s - one can not count off from the left three numbers at a time. One must start at the right and insert the commas every three numbers.
Yes, one can count them all, but why would one? Doing it right-to-left is how the average person would divide up the number so it made sense.
For example, it's how most non-programmers and non-mathmaticians approach a number such as 10000000. Quick, is it 1 million or 10 million? Go right-to-left 3 digits at a time, and you'll know a lot faster than if you try to approach it left-to-right
Anyway, that's what I got from the comment, and so when you write, "...that makes no sense", I have to say, "Uh, yes, it does make sense". It may not be how YOU do it, and it may not be how an "expert" does it. But it does make sense.
-
Tuesday 8th October 2013 18:15 GMT Michael Wojcik
Re: The historical accident of little-endian
I have to say, "Uh, yes, it does make sense". It may not be how YOU do it, and it may not be how an "expert" does it. But it does make sense.
Very well, how about "makes sense, but is wildly overstated"?
Go right-to-left 3 digits at a time, and you'll know a lot faster than if you try to approach it left-to-right
"a lot faster" is a ridiculous exaggeration. If I need to know the magnitude of some number written as a string of digits in Arabic notation - a task which I must admit does not come upon me all that often - and it's too long to simply apprehend the number digits at a glance, I'm perfectly happy to count left-to-right and convert it to scientific notation in my head. Problem solved. Going right-to-left with a three-digit stride is unlikely to be significantly faster.
More importantly, how often does this come up for the vast majority of people? Who devotes a significant portion of their life to visually determining the magnitude of printed numbers?
That's my argument with Andrew Yeomans; in his original post, he claimed "little endian representations of numbers are much easier to parse and handle". I've yet to see anyone making any sort of argument that could justify that adverb "much" - at best it's a trivial advantage - but in any event I suspect Yeomans spent as much time composing that post as he's lost in the past year, perhaps in the past decade, to inefficiencies in his number-parsing responsibilities, whatever those might be.
-
-
-
-
Friday 4th October 2013 12:47 GMT Roland6
Re: A good article, but...
A total absence of the work done in ISO on character sets in the late 80's and early 90's resulting in ISO 10646. A project I was involved with in the late 80's was to do with multiple character set handling on DEC VT220/240's, so I got very familiar with ISO 646, 2022, 8859 and 6429... To me both Unicode and UTF-8 left things to be desired, even though they were much simpler...
-
Friday 4th October 2013 14:14 GMT Irongut
Don't get me started on endianness. I regularly work with a file format that incudes both big and little endian numbers in the same data structure! What a fecking nightmare that is. I have to drag the spec out to check my code every time, there is no way to know which number should be in which format otherwise.
-
Friday 4th October 2013 10:09 GMT Steve Davies 3
Getting rid of UTF????
Welcome to the wonderful world of jibberish
It gets worse
I've seen html with the charset declared as UTF-8 but with the body encoded as Ebcidic. Doh!
Seriously, as someone who writes software that is used in many countries, it is SOP to use UTF-8 for everything. By insisting in that at least we don't have to mess with the horrible Microsoft code pages for languages like Kazak and Uzbek.
We switch to UTF-16 for China and Japan but I will agree with you there that the -16 implementations are broken. At least with UTF-8 you didn't have to worry too much about endianness but with -16 you do and many implementations only work half the time... :)
-
Friday 4th October 2013 15:14 GMT Michael H.F. Wilkinson
Re: Getting rid of UTF????
Hmm, EBCDIC
Now that takes me back
Back to the days of our CDC computer with its 6-bit bytes organized into 60 bit words using A STUPID FORM OF ASCII MORE-OR-LESS BUT WITH ONLY CAPITALS
Bliss? no, not at all. At least we no longer had to work with punched cards
Icon? Closest thing to "old git in reverie mode" icon
-
-
Friday 4th October 2013 22:13 GMT Anonymous Coward
Re: Getting rid of UTF????
"raise you DEC RADIX-50"
ANY SYMBOL WHOSE NAME CANNOT BE EXPRESSED IN THE CHARACTERS A TO Z OR 0 TO 9 OR DOT SPACE AND DOLLAR IS NOT WORTH A SHEET.
OF GREEN AND WHITE LINE PRINTER PAPER.
OBVIOUSLY.
NOW WHERE DID I LEAVE MY TELETYPE RIBBON. I THINK IT WAS BY THAT SHINY NEW LA36.
http://wickensonline.co.uk/declegacy/
-
-
-
-
-
-
Sunday 6th October 2013 16:10 GMT rleigh
Re: Agreement
While the BOM shouldn't matter, in many places it does in practice. A couple of examples:
Shell scripts starting with #!/bin/sh (or perl, python, etc.). The presence of the BOM changes the starting bytes of the file, making the shebang non-functional. Every tool handling shebangs would need patching to cope with this variant.
Concatenation of files containing BOMs. This leaves you with BOMs spread throughout the data stream. You then need to make sure that every tool handling the data can filter out or ignore BOMs. You can't usually do that either since you might have non-UTF8 binary data in the stream and stripping them out after the fact would mangle the data.
For these and other reasons, the simplest and most reliable solution is to never ever put BOMs in UTF-8 data. Shame on Microsoft for saving UTF-8 text with BOMs by default...
-
-
-
Friday 4th October 2013 10:22 GMT MacroRodent
Re: Make 'em pay
UTF-8 actually works pretty well for languages that use some variant of the Latin alphabet: a 2-byte sequence is needed every few characters, but the text does not actually expand much. As a Finnish speaker, with my ä:s and ö:s, I can live with it. But I could imagine the Chinese rebelling again. Don't they need 3 or 4 bytes per character all the time?
-
Friday 4th October 2013 18:29 GMT Ken Hagan
Re: Make 'em pay
"But I could imagine the Chinese rebelling again. Don't they need 3 or 4 bytes per character all the time?"
It's more like 3 or 4 bytes per syllable, and so actually they may have lower bandwidth costs. But text in any language is a prime candidate for compression during transmission, so everyone's bandwidth costs should be fairly similar for messages with the same semantic content.
-
Sunday 6th October 2013 02:08 GMT James Anderson
Re: Make 'em pay
They should pay as its them who created the problem. A cold war standoff between the Plutocratic Republic of China and the rest of the chinese speaking world led to the whole kit and keboodle being coded twice, once as "traditional chinese" as written in Hong Kong, Taiwan and Singapore and one a "Peoples Script" as used on the in the PRC.
Worse having discovered they could play with the standard -- well they continued to play. The premier with the big glasses whose name everybody forgets insisted his family rendition of his name got added in. In retaliation the capitalist faction as represented by the Hong Kong and Shanghai bank got their trade mark calligraphic rendering of the characters for Shanghai and Hong Kong got there own code points.
Can you imagine the out cry if "Oor Alec" demanded a separate set of code points so an independent Scotland could use an alphabet free and independent of the English. Or a certain hamburger company asked for there rendering of the letter M to be given its very own code point.
-
Friday 4th October 2013 10:24 GMT Paul Crawford
Re: Make 'em pay
No, it is down to reverse compatibility which is a BIG THING given the millions of lines of code written pre-Unicode/UTF-8.
Basically, in order to work the single byte options have to map to the old ASCII set (which are 7-bit due to the old parity issues from the serial comms days), and those extending to 2/3/4 bytes cover everything else (the "extended ASCII" of original IBM-PC, including the £ symbol and similar, which you might think is 'imperial').
-
Friday 4th October 2013 15:36 GMT John Sanders
Re: Make 'em pay
We westerners invented modern computing, it was mainly British and American scientists, and the restrictions of the day made provisioning resources for languages that use +100.000 characters a silly proposition.
I can not imagine coding in Chinese, neither can the Chinese apparently.
-
-
-
Friday 4th October 2013 10:34 GMT sorry, what?
ASCIIart and technological backwaters
Yes, you can write any language you choose (even those that they have on all the Star Trek, Star Wars and Stargate displays) using ASCII as long as you are happy to have one or two characters per page and construct it all as ASCIIart.
I had been a Java bean for over a decade, up until the start of the year when I found a job with an M$ based outfit. Something I rapidly spotted was how backwards so much of the M$ technology is. Don't get me wrong, there are some cool things too, but this discussion about UTF-8 as if it were something new and wonderous, and how tricky it is to use with certain platforms and languages, seems like something from the late 90s, not the 2010s!
-
-
Friday 4th October 2013 10:15 GMT Paul Crawford
Cardinal sin of computing
The fact that some programmer, in an attempt to show the "benefit of Unicode", should use a 'double' variable for PI and only give 6 figures tells you they should be executed and their programs not!
But yes, you speak the truth - UTF-8 is better for all practical reasons because it won't break old software/code and yet it allows all characters you (and your customers/users) might want. Subject to matching system fonts - a rant for another day...
-
-
Sunday 6th October 2013 19:26 GMT A J Stiles
SMSs are NOT UTF-8!
SMS messages are usually sent in GSM-7 aka SMSCII, a modified form of ASCII with some code points moved around, and some characters represented by 2-byte sequences; which also includes some accented characters and enough of the Greek alphabet to be able to write in capitals in Greek, making up the remainder with Latin characters that look like Greek characters. This way, 160 7-bit characters can fit into 140 8-bit bytes. And you get to use the << and >> operators.
Alternatively they can be sent in UCS-2, which is as near enough to UTF-16 as makes no difference; but then the message is limited to 70 characters.
There is no UTF-8 mode, though .....
-
Friday 4th October 2013 10:39 GMT Pen-y-gors
Good idea
I use UTF-8 for everything (some Welsh characters aren't supported in the usual European sets) - but could someone please give Microsoft a good slapping - just wasted ages trying to get data containing Welsh characters (ŵ and ŷ - see, el Reg can handle them) from an Excel spreadsheet via csv into a mySQL DB - nightmare! Excel output to csv can't do UTF-8. I ended up pasting into OpenOffice, then exporting.
-
Friday 4th October 2013 11:11 GMT David Given
Re: Good idea
Back in 2009 I posted a comment to the Reg containing astral plane characters (code points with a value above 0xffff). I got back an apologetic email saying that I'd broken their database and they'd had to remove them from the comment.
Some time later I found a bug in Thunderbird's treatment of astral plane characters. I tried to file a bug. Then I had to file a bug on Bugzilla complaining that it didn't handle astral plane characters properly... which was quite hard, as Bugzilla's bug tracker is also Bugzilla.
(All of these stem from the same underlying problem, which is MySQL assuming 16-bit Unicode. This is why 16-bit Unicode must die. MySQL too, of course.)
-
-
-
Friday 4th October 2013 22:20 GMT Anonymous Coward
Re: "All those extra holes made it easier to air cool in-memory databases."
If "Rate This Article" still existed, that line alone would have got an 11 for this article.
Can't see how to sneak it in to the office conversation yet, but I'll give it serious thought.
[Yes there are computer people that don't read El Reg. Unbelievable but true.]
-
-
Friday 4th October 2013 11:32 GMT miket82
Machine code £
The comment about printing the £ (I hated the # sign) reminded me of my DOS days. I solved it by writing a small 90 byte machine code routine (most bytes were my credit line) that loaded through config.sys that redirected the print code to see the £ code rather than the hash code. Staff often asked me what the line
"Money added to system"
meant when they switched the machine on but then I always did have a weird sense of humor.
-
Sunday 6th October 2013 16:20 GMT Neil Barnes
Re: Machine code £
When I was but a lad, the BBC used internally a variant of the CEEFAX system to carry presentation messages (next item is, coming out three seconds early, etc.) around the country on a video display line that was stripped out before the signal went to the transmitter.
What the character set PROM didn't have was a £ sign.
Instead of using a separate PROM or even $deity$ help us an EPROM, the BBC designs department in its infinite wisdom built a whole chunk of logic that recognised the £ code and told the character generator to use the top half of a C and the bottom half of an E...
I don't recall ever seeing a message that used the £ sign...
-
-
Friday 4th October 2013 11:32 GMT John Savard
Fixed Length
As UTF-8 only represents characters up to 31 bits in length, the alternative of every character taking 32 bits still remains another valid, if wasteful, option.
UTF-8 is somewhat wasteful as well, often requiring three bytes instead of two, or two bytes instead of one; stateful encodings can do much better.
-
Friday 4th October 2013 11:45 GMT Frederic Bloggs
Re: Fixed Length
One has a choice: either stateful, shorter, but potentially fragile and non-self synchronising or UTF-8. Me? I choose UTF-8. But then I spend a large part of my programming life dealing with radio based comms protocols which means - by definition - I am rather strange.
Oh, and it doesn't help that I spent a lot of time in my formative years having to deal with 5 channel paper tape...
-
-
Friday 4th October 2013 11:41 GMT artbristol
Should be titled "Down with UTF-16"
Unicode is a good standard and it was written by clever guys. There's nothing wrong with Unicode's approach of mapping each character to a code point, and adding an intermediate step requiring encoding it into bytes. Far better than the ugly mess of codepages that preceded Unicode.
UTF-8 is part of Unicode and it's a damn good encoding.
-
Friday 4th October 2013 16:05 GMT Christian Berger
Re: Should be titled "Down with UTF-16"
Well that's a common problem with ElReg, there are many authors who have never seen anything else than the little area they work in and believe that the whole world is like this.
They believe that E-Mail is as complex as Exchange, they believe that somehow IPv6 is amazingly difficult, and they believe that the world is still using UTF-16.
It's a bit like the people from Krikkit who due to their dark night skies have never seen even a glimps of the worlds out there.
-
Sunday 6th October 2013 10:10 GMT Destroy All Monsters
Re: Should be titled "Down with UTF-16"
> There's nothing wrong with Unicode's approach of mapping each character to a code point
Actually there is plenty of wrong with that. Because then you suddenly need the whole cartesian product of diacritics and the base characters.
The only one who I would trust to come up with a "good Unicode" would be Knuth.
-
-
-
-
Friday 4th October 2013 13:44 GMT BristolBachelor
Re: Control Data had it right
I remember some very old ICL and Digital machines with 6-bit bytes (being a pedant I am using Byte as the number of bits to represent a character). One guy here still cannot type in lower-case, and I'm pretty sure he'd have a stroke if you sent him a document without a single upper-case letter in it.
But I also remember at least one of those Dec machines had a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing).
It now makes me smile a bit that we have huge monstor machines running bare-metal hypervisors, with each user having a virtual machine running its own a virtual copy of Windows, loading its own virtual copy of Excel. In the past, a single machine loaded one copy of 2020, and all the users shared it. No need to load 150 copies of the same thing, all repeating the same houskeeping tasks.
-
Friday 4th October 2013 22:28 GMT Anonymous Coward
Re: Control Data had it right
"a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing)."
VAX, perhaps. VAXes (and lots of others) can have "page faults" (or maybe other exceptions) part way through the processing of an instruction. If it's a page fault, the relevant data is loaded into memory by the OS, and the faulting instruction is resumed. If it was a potentially long running CISC instruction (such as POLY), it may or may not need to be restarted from the beginning. The instruction is not restarted from the beginning unless the "first part done" bit isn't set (which indicates that it needs to restart from the beginning, because the first part of the instruction hadn't completed.)
And why am I telling you this?
Because you need to know.
-
-
-
-
Friday 4th October 2013 12:20 GMT clean_state
Bravo!
I spent a lot of time fighting with text encodings when designing the .mobi file format for Mobipocket (and later Kindle). The conclusion was also that UTF-8 wins everywhere. The self-sync feature is superb. As for the "hassle" of handling a variable-length character encoding, you soon realize that:
- in most cases, you need the length of your string in bytes (for memory allocation, string copying, ...)
- cases where you need to decode UTF-8 to code points are rare, mostly when you display those characters and then you usually display the whole string from first to last byte so going through the bytes in sequence to decode the code points is not wasteful.
- the typical case when you need to know the characters is parsing, BUT, ALL keywords and ALL control characters in ALL computer languages are below the 128 code point so you can actually parse UTF8 as if it was ASCII and never care about the multi-byte encoding outside of string litterals.
So yes, UTF-8 everywhere!
-
Friday 4th October 2013 23:16 GMT rleigh
Re: Bravo!
I hate to be a pedant (actually, that's a lie), but it's not strictly true that all control characters are below codepoint 128. There is the ECMA-35/ISO-2022 C1 control set designated at 128-159, mirroring the C0 control set with the high bit set. This is obviously incompatible with UTF-8 though, and so not available when you have UTF-8 designated with "ESC % G".
-
Monday 7th October 2013 13:21 GMT Anonymous Coward
Re: Bravo!
I'm not sure about your last point there. Lots of programming languages allow non-ASCII characters in identifiers (for example: http://golang.org/ref/spec#Identifiers), so, assuming you are not going to allow all non-ASCII characters in identifiers (Go allows 'ĉ' but not '£'), your lexer does need to identify characters. Also, you might want character constants to work beyond ASCII.
However, you typically don't need to decode UTF-8 in order to identify the end of a string constant or comment.
-
-
Friday 4th October 2013 13:09 GMT Anonymous Coward
Global posts
Discovered last week that Windows Notepad won't display all pasted UTF-8 characters - but it does preserve the binary values. So saving from Word in "TXT" UTF-8 format with an HTM suffix does appear correctly on a browser page.
Very useful for a hobby task that indexes public FaceBook and YouTube postings which can be written in just about any language. A quick screen scrape of Google Translate then combines a translation with the original.
-
Friday 4th October 2013 14:57 GMT An0n C0w4rd
Unicode needs to be taken out back and shot
Not just shot once, but repeatedly.
One of the principals of Unicode is to separate the character from the representation of the character. In other words, ASCII 65 (decimal) is "A". How your system chooses to display "A" is up to the system. The character is transmitted as decimal 65 no matter what the display representation is.
Unicode promptly goes on to rubbish this ideal.
Pre-Unicode Asian fonts had "full-width" representations of ASCII characters so displays that mixed ASCII and Japanese characters kept their formatting as the characters had the same width, while the usual ASCII characters were narrower and hence broke formatting.
Unfortunately this lives on in Unicode, shattering the idea that the display of the character is independent of the code point of the character because there are now two different Unicode code points that both print out a Latin-1 "A" (and also the rest of the alphabet and numbers and punctuation). In reality, the full width "A" should not be U+FF21, it should be decimal 65 with the renderer deciding if it should be full width or not.
This has caused me more than one problem in the past with things that sometimes correctly handle the full-width and ASCII mix and sometimes don't.
-
Friday 4th October 2013 15:54 GMT albaleo
Re: Unicode needs to be taken out back and shot
"the full width "A" should not be U+FF21, it should be decimal 65 with the renderer deciding if it should be full width or not."
I'm not sure I agree. How does the renderer decide? For example, in the following (if it displays), where English and Japanese are mixed, and the second upper-case A is part of a Japanese name.
A great future at Aテック.
-
Saturday 5th October 2013 01:29 GMT Steve Knox
Re: Unicode needs to be taken out back and shot
Rule 1: If the 'A' is part of a word which contains Japanese characters, use full-width to be compatible with the rest of the word to which it belongs. This covers your example.
However, it does not cover all other possibilities.
Rule 2: If the 'A' is part of a word consisting entirely of English characters, but which is nonetheless part of a sentence which primarily consists of Japanese words, use the full-width.
This rule may be and may need to be generalized to paragraph, section, even document level depending on the particular use case.
Otherwise, proportional should be acceptable, if not preferred.
NB to be fully international and general it would probably be best to replace "Japanese" and "English" with "full/fixed-width alphabet" and "variable/proportional-width alphabet" (or some similar even more appropriate terminology) in the preceding.
-
Saturday 5th October 2013 14:39 GMT Anonymous Coward
Re: Unicode needs to be taken out back and shot
Not sure your rules apply either really... lets have a quick browse on the arcade cabinets and control panels section of yahoo auctions I'm browsing for example; There are some power supplies for sale, one seller has written DCパック but someone else has written DCパック. Some people use full width latin for the starting prices.. some people don't. Some people have even managed to mix up half and full width latin in the same word or number. My wife's computer seems to default to using full width latin for everything whereas the input method on my machine doesn't seem to use full width for anything unless I go all the way down to the bottom of the candidates in the selection window..
-
-
-
-
Friday 4th October 2013 15:15 GMT Anonymous Coward
Java :D
Remember, dear Java coders, to specify a character set whenever you convert a String to bytes, or when using a reader or writer that implicitly does so on an underlying byte stream. Otherwise your default platform encoding will be used instead, and who knows if that is set to the same thing across multiple servers, or servers and clients...
And yes, you'll soon come to curse them for making UnsupportedEncodingException a checked exception, as if there was something you could do to recover from it.
-
Friday 4th October 2013 15:46 GMT Christian Berger
Wait? There are still people using 16-Bit characters?!
I'm sorry, but the last time I've seen those in the wild was around 2000. Back then Microsoft had a short spell with that. Nobody uses 16 bit character codes for any sort of external representation any more as people have found out it doesn't work for the very thing they were invented for, eliminating code pages.
-
Friday 4th October 2013 16:12 GMT Stevie
Bah!
Eschew all of the above and use only FIELDATA.
a) It does everything you need for programming a real computer, and does it in 3/4 the space by simply acknowledging that Johnny Foreigner doesn't matter.
2) It is also a friendly encoding scheme in that it only has caps. If everyone is shouting, no-one is. Thus a major annoyance on the intarwebs is removed as if it had never been.
#) Brought to you by Univac, proper computers for real programmers. Remember: If you can pick it up without a crane, it isn't a real computer, it's a toy. Don't put your important software on toy computers.
Plus: OS2200 - a mature, secure operating system with utilities people actually fixed in a timely manner so there are *no* "known bugs" still dragging their arse into theater twenty five years on and no buffer overrun attack scripts available on the web for the asking. Unix or WIndows? Don't make me laugh.
-
Friday 4th October 2013 17:23 GMT Herby
Could be worse?
We could all be using a Baudot coding scheme. They use 5 bits per character and include LTRS and FIGS shifts. Only the alphabet was encoded in the LTRS shift, and "special" characters were encoded in the FIGS shift.
A total of 26 LTRS, 26 FIGS, and CR, LF, FIGS, LTRS, SPACE, NULL.
A nice total of 55 actual code points, as you couldn't count FIGS, LTRS, or NULL.
Of course if you go back further, you were limited to a 48 character set for such mundane things as coding in FORTRAN. The character set had 26 letters, 10 digits, space, and '@', '=', '(', ')', '*', '$', ',', '.', '/', '+', '-'. Sometimes you replaced '@' with a single quote (').
If it was good enough for FORTRAN, it was good enough for me.
-
-
-
Friday 4th October 2013 19:22 GMT ThomH
Re: Down with Unicode (and UTF-8)!
Being a 1997 attempt to fix the problems stemming from a belief that "[t]he Unicode standard is a fixed-width scheme ... [that] uses 16-bit encoding", it was immediately irrelevant because UTF-8 had been presented in 1993. It's also modal, so lacks self synchronisation, and complicates things by defining character sets by language. As the paper acknowledges, 'a' is present separately as an English character, a French character, a German character, etc, etc, with the intention being that all those different 'a's are mapped back to the same thing after the fact.
-
-
Friday 4th October 2013 23:15 GMT Anonymous Coward
Looking forward to the next one ...
I'm looking forward to the next post on timezones. If you can handle strings then you are, possibly, ready to handle world time. Any computer structure as fundamental as time (in most real-world systems anyway) which can be screwed by politicans deciding they want to do it differently this year, is always good for a giggle.
-
Friday 4th October 2013 23:16 GMT Anonymous Coward
Best part of article
"code pages were something horrible and fussy that one hoped to get away with ignoring"
I work for an American software company selling software across countries who make heavy use of non-ASCII data.
It's a nightmare because as you say American devs just try to ignore the problem and happily map byte arrays to strings with no thought to what will happen outside US or Western Europe
Our software developed in Java is generally ok but the older stuff from the nineties developed on C is nothing but a headache
-
Saturday 5th October 2013 18:05 GMT Roland6
Re: Best part of article
Years back, a challenge we had on an international project was getting hold of a Japanese version of Windows (ie. a version of Windows that used and supported 16-bit character sets) , because MS in their wisdom didn't supply it as standard to resellers & SI's in the USA and Europe. I take it that things haven't improved significantly since then.
-
-
Saturday 5th October 2013 03:53 GMT raving angry loony
Cunning linguists
I sometimes work with linguists and their need for systems that can be used with several dozen languages from all over the world. They generally despair at the state of computing and character identification for non-English (or even non-latin-character based) languages. As far as they're concerned every system so far seems to have been created by quasi-illiterate uni-lingual English speakers or worse, people with only a beginner's understanding of the languages they're supposed to be transcribing.
-
Saturday 5th October 2013 05:03 GMT Henry Wertz 1
UTF-8 and internationalization
Well, UTF-32 (4 byte unicode) does accomodate all characters in a flat space (it even has a "user defined" space de facto split up so you can have like Klingon and Lord of the Rings fonts installed with their proper character codes. Yes indeed.
Here's the "meat" of UTF-8... wikipedia has a nice table which I cannot paste here, but the short of it is UTF-8 is 1 to 6 bytes, but the longest lengths are for characters "at the end" of the unicode code space, in practice most characters are 1-3 bytes. Unicode will encode some characters that are a character with an extra mark or two on it as a 2-byte character and one or more 2- byte modifier characters, which will be encoded in 3 bytes by UTF-8. byte 1 is 0xxxxxxx for a 7-bit character and always starts with 11xxxxxx for a multi-byte character. (It is 110xxxxx to indicate a 2-byte character through 1111110x for a 6-byte character. These lengths encode 7, 11, 16, 21, 26, and 31 bits of a 32-bit Unicode character. extra bytes are all 10xxxxxx.
That said *shrug*, as a programmer I find Android and Linux both have plenty good internationalization APIs available, and I avail of them so I don't have to worry about the details.
-
Saturday 5th October 2013 13:41 GMT Bagelsmonster
Smug mode
You're all light-weights. Some of us had to build our own hardware to get the job done.
http://en.wikipedia.org/wiki/Multilanguage_Electronic_Phototypesetting_System
MEPS is now software only. We are currently just shy of 600 languages published in print and on jw.org. To put that in context un.org is in 6 languages.
-
Saturday 5th October 2013 18:13 GMT Anonymous Coward
Benefit of UTF-8 is backwards compatibility
Everybody learned that characters are 8 bits and most programming languages, libraries, file formats, and other software have been designed around this assumption.
The brilliant thing about UTF-8 is that with minimal (or often no) modification to anything, everything that supported 8 bit characters also inherently, automatically "supports" multibyte characters. Programmers CAN, for the most part, ignore the fact that multibyte characters even exist.
As somebody who was forced to spend years being constantly annoyed by Microsoft's "widechar" software and APIs, UTF-8 is an awe-inspiring solution to the problem.
-
-
Sunday 6th October 2013 16:10 GMT Justigar
Re: EBCDIC
You say that jestingly, but we still have dealings with this.
We get data from a client in EBCDIC and we have to convert it to ASCII before it gets sent out again. It arrives on media that is older than I am once a week, every week. People think I'm playing space invaders in the corner when I have to transfer the data....
-
-
Sunday 6th October 2013 16:11 GMT ponga
Bah, UTF-8 is just a continuation of the usual anglo-saxon cultural imperialism: A-Z as first class citizens, with Johnny Overseas characters such as é, è, ö, ä, å, ç, æ, ø and ß treated as a regrettable necessity, if dealt with at all.
Frankly, I'm holding out for an encoding where everyone uses multibyte characters to represent all text:, including classic ASCII: that's the only way American software is ever going to be fully usable outside the good ole US of A. (Hmmmm... make that a necessary but insufficient condition.)