back to article Down with Unicode! Why 16 bits per character is a right pain in the ASCII

I recently experienced a Damascene conversion and, like many such converts, I am now set on a course of indiscriminate and aggressive proselytising. Ladies and gentlemen, place your ears in the amenable-to-bended position, and stand by to be swept along by the next great one-and-only true movement. The beginning In the …

COMMENTS

This topic is closed for new posts.

Page:

      1. Steve Knox

        Re: Stuck in the past

        @James 47 -- Gravis said "exploit", not "squander."

        1. Anonymous Coward
          Anonymous Coward

          Re: Stuck in the past

          My RAM does feel exploited ...

          ... oh ...

  1. Paul Crawford Silver badge

    Cardinal sin of computing

    The fact that some programmer, in an attempt to show the "benefit of Unicode", should use a 'double' variable for PI and only give 6 figures tells you they should be executed and their programs not!

    But yes, you speak the truth - UTF-8 is better for all practical reasons because it won't break old software/code and yet it allows all characters you (and your customers/users) might want. Subject to matching system fonts - a rant for another day...

  2. kurkosdr

    For the love of (deity), make your code uses UTF-8. The world has decided. Text files are UTF-8, websites are (should be) UTF-8, emails should be UTF-8 and SMSes are UTF-8. So, use UTF-8.

    1. Natalie Gritpants

      Unfortunately RDS isn't so when playing my MP3's via a personal FM transmitter I get weird characters in the track title on the car radio.

    2. MacroRodent
      FAIL

      SMS character sets

      SMSes are UTF-8

      Actually, they aren't. Multiple different character sets are allowed, none of them UTF-8. See http://en.wikipedia.org/wiki/GSM_03.38 for a description of the mess. Read it and groan...

      1. Ramazan

        Re: SMS character sets

        Actually SMS-es use UCS2BE (!), not utf8.

    3. JeeBee

      SMSs with a non-ASCII character in are sent in UCS2 (which is not exactly UTF-16) and they're a lot shorter per message part. Once you've dealt with surrogate pair issues (encoding 3/4 byte characters in UCS2) you may never regain sanity again.

    4. A J Stiles

      SMSs are NOT UTF-8!

      SMS messages are usually sent in GSM-7 aka SMSCII, a modified form of ASCII with some code points moved around, and some characters represented by 2-byte sequences; which also includes some accented characters and enough of the Greek alphabet to be able to write in capitals in Greek, making up the remainder with Latin characters that look like Greek characters. This way, 160 7-bit characters can fit into 140 8-bit bytes. And you get to use the << and >> operators.

      Alternatively they can be sent in UCS-2, which is as near enough to UTF-16 as makes no difference; but then the message is limited to 70 characters.

      There is no UTF-8 mode, though .....

  3. MrMur
    Devil

    All I can say is....

    try dealing with LMBCS

  4. Pen-y-gors

    Good idea

    I use UTF-8 for everything (some Welsh characters aren't supported in the usual European sets) - but could someone please give Microsoft a good slapping - just wasted ages trying to get data containing Welsh characters (ŵ and ŷ - see, el Reg can handle them) from an Excel spreadsheet via csv into a mySQL DB - nightmare! Excel output to csv can't do UTF-8. I ended up pasting into OpenOffice, then exporting.

    1. David Given

      Re: Good idea

      Back in 2009 I posted a comment to the Reg containing astral plane characters (code points with a value above 0xffff). I got back an apologetic email saying that I'd broken their database and they'd had to remove them from the comment.

      Some time later I found a bug in Thunderbird's treatment of astral plane characters. I tried to file a bug. Then I had to file a bug on Bugzilla complaining that it didn't handle astral plane characters properly... which was quite hard, as Bugzilla's bug tracker is also Bugzilla.

      (All of these stem from the same underlying problem, which is MySQL assuming 16-bit Unicode. This is why 16-bit Unicode must die. MySQL too, of course.)

      1. David Given

        Re: Good idea

        ...just tried to post something with astral plane characters and got back 'The post contains some characters we can't support'. MySQL FTW!

      2. JeeBee

        Re: Good idea

        MySQL's "utf8" type doesn't support 4-byte utf8, you need to use 'utf8mb4' for that!

  5. rcorrect

    Would be really interesting if computers were invented in China.

    1. Mage Silver badge

      Computers Invented in China?

      I thought indeed they were.

      Mechanical with Rods & Beads. c.f. adding MDCXIVL and XLVI

  6. Mage Silver badge

    Wonderfull.

    Why did no-one point out to me earlier that Unicode isn't 16 bits any more?

    I had noticed however that if you wanted Web Sites, Browsers & SQL all to talk nice that UTF-8 was best.

  7. Anonymous Coward
    Anonymous Coward

    Java

    The Java devs could have saved themselves a lot of bother by just calling "char" an unsigned short and being done with it. Now they just look silly by calling a character a "codepoint" and a utf-16 unit a "char".

    Mind, their punishment was to be swallowed whole by Oracle. Rough justice.

    1. Richard 12 Silver badge

      Re: Java

      I hate the datatype "char" and refuse point blank to use it.

      I use quint8/qint8 or uint8/int8 for an 8-bit unsigned/signed value (depending on whether I'm Cute at the time).

      "char" should be banned. It's confused.

  8. Steve Knox
    Happy

    "All those extra holes made it easier to air cool in-memory databases."

    Like Lego Technics, the holes just make them cooler!

    1. Anonymous Coward
      Anonymous Coward

      Re: "All those extra holes made it easier to air cool in-memory databases."

      If "Rate This Article" still existed, that line alone would have got an 11 for this article.

      Can't see how to sneak it in to the office conversation yet, but I'll give it serious thought.

      [Yes there are computer people that don't read El Reg. Unbelievable but true.]

  9. Joe Harrison

    Got to keep it

    If every character needs 16 bits rather than 8 then the NSA will only be able to store half as much of our stuff

  10. miket82

    Machine code £

    The comment about printing the £ (I hated the # sign) reminded me of my DOS days. I solved it by writing a small 90 byte machine code routine (most bytes were my credit line) that loaded through config.sys that redirected the print code to see the £ code rather than the hash code. Staff often asked me what the line

    "Money added to system"

    meant when they switched the machine on but then I always did have a weird sense of humor.

    1. Neil Barnes Silver badge

      Re: Machine code £

      When I was but a lad, the BBC used internally a variant of the CEEFAX system to carry presentation messages (next item is, coming out three seconds early, etc.) around the country on a video display line that was stripped out before the signal went to the transmitter.

      What the character set PROM didn't have was a £ sign.

      Instead of using a separate PROM or even $deity$ help us an EPROM, the BBC designs department in its infinite wisdom built a whole chunk of logic that recognised the £ code and told the character generator to use the top half of a C and the bottom half of an E...

      I don't recall ever seeing a message that used the £ sign...

  11. John Savard

    Fixed Length

    As UTF-8 only represents characters up to 31 bits in length, the alternative of every character taking 32 bits still remains another valid, if wasteful, option.

    UTF-8 is somewhat wasteful as well, often requiring three bytes instead of two, or two bytes instead of one; stateful encodings can do much better.

    1. Frederic Bloggs
      Holmes

      Re: Fixed Length

      One has a choice: either stateful, shorter, but potentially fragile and non-self synchronising or UTF-8. Me? I choose UTF-8. But then I spend a large part of my programming life dealing with radio based comms protocols which means - by definition - I am rather strange.

      Oh, and it doesn't help that I spent a lot of time in my formative years having to deal with 5 channel paper tape...

  12. artbristol

    Should be titled "Down with UTF-16"

    Unicode is a good standard and it was written by clever guys. There's nothing wrong with Unicode's approach of mapping each character to a code point, and adding an intermediate step requiring encoding it into bytes. Far better than the ugly mess of codepages that preceded Unicode.

    UTF-8 is part of Unicode and it's a damn good encoding.

    1. Christian Berger

      Re: Should be titled "Down with UTF-16"

      Well that's a common problem with ElReg, there are many authors who have never seen anything else than the little area they work in and believe that the whole world is like this.

      They believe that E-Mail is as complex as Exchange, they believe that somehow IPv6 is amazingly difficult, and they believe that the world is still using UTF-16.

      It's a bit like the people from Krikkit who due to their dark night skies have never seen even a glimps of the worlds out there.

      1. Roland6 Silver badge

        Re: Should be titled "Down with UTF-16"

        MS Exchange I didn't realise it was complex - obviously spent too much time working on enterprise systems.

    2. Destroy All Monsters Silver badge

      Re: Should be titled "Down with UTF-16"

      > There's nothing wrong with Unicode's approach of mapping each character to a code point

      Actually there is plenty of wrong with that. Because then you suddenly need the whole cartesian product of diacritics and the base characters.

      The only one who I would trust to come up with a "good Unicode" would be Knuth.

  13. Tromos

    Control Data had it right

    The old CDC mainframes used a 6-bit character set - and no multibyte codes (until some idiot went and wanted lower case put in too).

    1. Lars Silver badge
      Pint

      Re: Control Data had it right

      I those days memory was expensive so that some had 4 bit for numbers and 6 or 8 for characters.

      1. BristolBachelor Gold badge

        Re: Control Data had it right

        I remember some very old ICL and Digital machines with 6-bit bytes (being a pedant I am using Byte as the number of bits to represent a character). One guy here still cannot type in lower-case, and I'm pretty sure he'd have a stroke if you sent him a document without a single upper-case letter in it.

        But I also remember at least one of those Dec machines had a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing).

        It now makes me smile a bit that we have huge monstor machines running bare-metal hypervisors, with each user having a virtual machine running its own a virtual copy of Windows, loading its own virtual copy of Excel. In the past, a single machine loaded one copy of 2020, and all the users shared it. No need to load 150 copies of the same thing, all repeating the same houskeeping tasks.

        1. Anonymous Coward
          Anonymous Coward

          Re: Control Data had it right

          "a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing)."

          VAX, perhaps. VAXes (and lots of others) can have "page faults" (or maybe other exceptions) part way through the processing of an instruction. If it's a page fault, the relevant data is loaded into memory by the OS, and the faulting instruction is resumed. If it was a potentially long running CISC instruction (such as POLY), it may or may not need to be restarted from the beginning. The instruction is not restarted from the beginning unless the "first part done" bit isn't set (which indicates that it needs to restart from the beginning, because the first part of the instruction hadn't completed.)

          And why am I telling you this?

          Because you need to know.

  14. clean_state

    Bravo!

    I spent a lot of time fighting with text encodings when designing the .mobi file format for Mobipocket (and later Kindle). The conclusion was also that UTF-8 wins everywhere. The self-sync feature is superb. As for the "hassle" of handling a variable-length character encoding, you soon realize that:

    - in most cases, you need the length of your string in bytes (for memory allocation, string copying, ...)

    - cases where you need to decode UTF-8 to code points are rare, mostly when you display those characters and then you usually display the whole string from first to last byte so going through the bytes in sequence to decode the code points is not wasteful.

    - the typical case when you need to know the characters is parsing, BUT, ALL keywords and ALL control characters in ALL computer languages are below the 128 code point so you can actually parse UTF8 as if it was ASCII and never care about the multi-byte encoding outside of string litterals.

    So yes, UTF-8 everywhere!

    1. rleigh

      Re: Bravo!

      I hate to be a pedant (actually, that's a lie), but it's not strictly true that all control characters are below codepoint 128. There is the ECMA-35/ISO-2022 C1 control set designated at 128-159, mirroring the C0 control set with the high bit set. This is obviously incompatible with UTF-8 though, and so not available when you have UTF-8 designated with "ESC % G".

    2. Anonymous Coward
      Anonymous Coward

      Re: Bravo!

      I'm not sure about your last point there. Lots of programming languages allow non-ASCII characters in identifiers (for example: http://golang.org/ref/spec#Identifiers), so, assuming you are not going to allow all non-ASCII characters in identifiers (Go allows 'ĉ' but not '£'), your lexer does need to identify characters. Also, you might want character constants to work beyond ASCII.

      However, you typically don't need to decode UTF-8 in order to identify the end of a string constant or comment.

  15. Admiral Grace Hopper

    The Youth Of Today

    Some of us still dream in EBCDIC.

  16. Anonymous Coward
    Anonymous Coward

    "public static final double π = 3.14159;"

    No its not.

    1. Uncle Slacky Silver badge
      Joke

      Re: "public static final double π = 3.14159;"

      It is...for suitably small values of π...

    2. Suricou Raven

      Re: "public static final double π = 3.14159;"

      public static final double π = 3.14159;//ish

    3. Annihilator
      Boffin

      Re: "public static final double π = 3.14159;"

      "Scientists, scientists, please. Looking for some order. Some order, please, with the eyes forward and the hands neatly folded and the paying attention ... PI IS EXACTLY THREE!!"

      1. Adam 1

        Re: "public static final double π = 3.14159;"

        I would like three pies

      2. Primus Secundus Tertius

        Re: "public static final double π = 3.14159;"

        public static final double π = 355/113;

  17. Anonymous Coward
    Anonymous Coward

    Joel Spolsky

    "he royally patronises programmers"

    He sure does. All the time. And not just about Unicode.

  18. disgruntled yank

    Mr. U will not be missed

    (see ee cummings).

    I must say that Perl makes it not too painful to deal with Unicode.

    I am slight disappointed with Ms. Stob, though, for not riffing on The U and the Non-U...

  19. Anonymous Coward
    Anonymous Coward

    Global posts

    Discovered last week that Windows Notepad won't display all pasted UTF-8 characters - but it does preserve the binary values. So saving from Word in "TXT" UTF-8 format with an HTM suffix does appear correctly on a browser page.

    Very useful for a hobby task that indexes public FaceBook and YouTube postings which can be written in just about any language. A quick screen scrape of Google Translate then combines a translation with the original.

  20. joeldillon

    Err....Qt has always tended to use UTF-8, not plain old UCS-2, for file i/o...

    (Also, one 16-bit value equals one character wasn't true even from the start, even without taking Chinese into account; consider polytonic Greek for example)

  21. Anonymous Coward
    Anonymous Coward

    Go Forth

    I've been using UTF-8 in my Forth code for years; it's nice to be able to use function names (all right, "words" as we right-minded Forth programmers call them) maths and logical symbols in them.

  22. An0n C0w4rd

    Unicode needs to be taken out back and shot

    Not just shot once, but repeatedly.

    One of the principals of Unicode is to separate the character from the representation of the character. In other words, ASCII 65 (decimal) is "A". How your system chooses to display "A" is up to the system. The character is transmitted as decimal 65 no matter what the display representation is.

    Unicode promptly goes on to rubbish this ideal.

    Pre-Unicode Asian fonts had "full-width" representations of ASCII characters so displays that mixed ASCII and Japanese characters kept their formatting as the characters had the same width, while the usual ASCII characters were narrower and hence broke formatting.

    Unfortunately this lives on in Unicode, shattering the idea that the display of the character is independent of the code point of the character because there are now two different Unicode code points that both print out a Latin-1 "A" (and also the rest of the alphabet and numbers and punctuation). In reality, the full width "A" should not be U+FF21, it should be decimal 65 with the renderer deciding if it should be full width or not.

    This has caused me more than one problem in the past with things that sometimes correctly handle the full-width and ASCII mix and sometimes don't.

Page:

This topic is closed for new posts.