back to article Down with Unicode! Why 16 bits per character is a right pain in the ASCII

I recently experienced a Damascene conversion and, like many such converts, I am now set on a course of indiscriminate and aggressive proselytising. Ladies and gentlemen, place your ears in the amenable-to-bended position, and stand by to be swept along by the next great one-and-only true movement. The beginning In the …

COMMENTS

This topic is closed for new posts.
  1. J. R. Hartley

    Hmmm...

    1. Homer 1
      Alien

      Re: Hmmm...

      Yes, that was my reaction too, but then I admit near-total ignorance on the subject, beyond what I've just read.

      Simplistically it seems the best solution is to implement a sufficiently large encoding length to accommodate all possible characters, which was supposedly the goal of UTF-16, except Becker naively assumed that "16 bits ought to be enough for anyone" (to paraphrase a well-known fallacy).

      Again, simplistically, the answer to these "enough for anyone" fallacies would seem to be dynamic allocation, as in dynamic arrays or linked lists, which is in fact what UTF-8 does, although in its case the dynamic allocation pertains to the encoding length of each member rather than the overall length of the array, if I'm reading the descriptions correctly.

      UTF-16 does that too, apparently, thus defeating its original objective, but suffers from ASCII and endian compatibility issues and, probably more than anything else, Microsoft's typically retarded implementation.

      So UTF-8 it is, then.

      1. ratfox
        Go

        Re: Hmmm...

        My reaction was: AMEN, sister!

  2. Aaron Miller

    "...my fellow Delphi users should notice that Embarcadero has dropped support for the UTF8String type..."

    What else to expect from something as backwards as Delphi? Pining for the days of Turbo Pascal is like pining for the days of Lisp Machines, only without sense and good taste.

    1. Marco van de Voort

      Embacadero in the past usually narrowly followed Windows Policy in these kind of issues. Recently they seem to orient themselves more against Objective C, Mac (because of their iOS offerings, only their mobile offerings are LLVM), but that is also utf16

      My guess is Embarcadero will adapt if their core targets adapt. Ranting against them (with baseless sentiment) is there for useless..

      That is also my problem with this whole rant. The main problem is not UTF16, but there being two standards, with two opposing camps. Even if you thing UTF8 is superior, if your core platforms are utf16 oriented, you will spend your days fighting windmills

      1. This post has been deleted by its author

    2. Just_this_guy
      Happy

      Ah, Turbo Pascal...

      1. DropBear
        Childcatcher

        Bah, humbug!

        Turbo What now...? BASIC is the bee's knees...!

    3. Pirate Dave Silver badge

      @Aaron Miller

      If El Reg would allow it, I'd give my whole day's quota of downvotes to your comment. Blaspheme not against the mighty Turbo Pascal, for it was Holy and did give many of us reason to stay in CompSci instead of switching to History or Psych.

      Cretin.

    4. Anonymous Coward
      Anonymous Coward

      > Pining for the days of Turbo Pascal is like pining for the days of Lisp Machines

      You better leave LISP out of this, understood?

    5. AOD

      What else to expect from something as backwards as Delphi? Pining for the days of Turbo Pascal is like pining for the days of Lisp Machines, only without sense and good taste.

      If' you're going to have a little rant, please get your facts straight. Delphi evolved from Turbo Pascal but it is a distinct product and a very sophisticated one at that. Please enlighten us as to why you regard Delphi as backward? Do you have direct development experience with it that you can share or is it just that it's non MS and therefore can't be any good?

      From experience I can tell you that when it was introduced, it brought features that gave the competition a swift kick to the happy sack, including but not limited to:

      A WYSIWYG menu editor for designing your forms. The sad equivalent in VB3 was truly pitiful.

      Decent object orientated support in a strongly typed language (Object Pascal).

      Support for building applications as a single EXE. No more DLL's to fling around the place if you preferred not to.

    6. Philip Santilhano

      Delphi backwards? I have a few choice unicode character for you!

      Calling Delphi backwards ("backward" it should probably be!) shows a true ignorance of the language.

      ASCII, Unicode 16 bit and UTF8, I just wish there was a standard that was universally accepted, and not rooted in the days of 8 bit machines.

  3. MartinSullivan

    There's UTF-8 and utf8 in Perl

    The sainted Larry claims he can keep them separate in his head, but it baffles many a poor soul like me. And it has cause me to produce the odd bit of wombat-do-do in my time.

    http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

    I'm a former T.61 expert. Let's not go there.

    1. Dan 55 Silver badge
      Trollface

      Re: There's UTF-8 and utf8 in Perl

      Of course Larry can keep "UTF-8" and "utf8" apart in is head. For those who don't know, Perl is the language where if your cat walks across the keyboard then the text generated can be successfully run in Perl as a Turing-complete program.

      1. Allan George Dyer
        Trollface

        Re: There's UTF-8 and utf8 in Perl

        @Dan 55 - and it will do something useful in the real world That is how brilliant Perl is!!

        I take your Troll, and raise it.

    2. silent_count
      Pint

      Re: There's UTF-8 and utf8 in Perl

      And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.

      a beer for you, Mr Sullivan. You've saved me more pain than I knew was lurking in my future.

      1. Frumious Bandersnatch

        Re: There's UTF-8 and utf8 in Perl

        And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.

        Don't let it put you off. Unicode in Perl more or less "just works". The only times I've had problems with it have been in trying to correctly convert stuff from other code pages and broken MS document formats. That, and sometimes forgetting to tell my database that the incoming data is UTF-8 rather than ASCII (though sometimes Perl needs a hint, too, to tell it not to do a spurious conversion).

        Speaking of MS documents, I find it really incredible to come across HTML on the web that obviously came from MS Word initially and that has completely messed up rendering of some trivial glyphs (like em dash and currency symbols). I find it hard to believe that in this day and age, Word can't even convert to HTML properly. OK, so maybe the problem isn't with word, but with the options the user selected for the conversion, but still...

        1. Gordon 11

          Re: There's UTF-8 and utf8 in Perl

          Don't let it put you off. Unicode in Perl more or less "just works".

          Agreed. I wrote a script recently then, at the end, remembered that some bits of the data would be coming in with things like (un)"intelligent quotes". I set about looking at what I'd need to do, only to discover that it was all being handled correctly without me having to do anything special at all.

          There is a great number of modules in Perl which do "what you need".

          1. Anonymous Coward
            Anonymous Coward

            Re: There's UTF-8 and utf8 in Perl

            "There is a great number of modules in Perl..."

            Yeh... ... yeh there is.

        2. Allan George Dyer

          Re: There's UTF-8 and utf8 in Perl

          @Frumious Bandersnatch, "completely messed up rendering of some trivial glyphs (like em dash and currency symbols)" - my guess would be the HTML was saved in cp1252, and the browser guessed (or was told) it was iso8859-1. They are almost the same, apart from those glyphs.

          I was going to add a rant, but I couldn't decide whether it was agains Microsoft's "embrace, extend, extinguish", incorrectly configured web servers or browsers silently "being helpful" and changing the encoding they're applying so that you can never figure out whether you've configured your web server correctly. Basically, Verity's right.

        3. John Smith 19 Gold badge
          Unhappy

          " Frumious Bandersnatch Ignore"

          "Speaking of MS documents, I find it really incredible to come across HTML on the web that obviously came from MS Word initially and that has completely messed up rendering of some trivial glyphs (like em dash and currency symbols)."

          So that's the source of that annoying little f**k up.

          Word --> HML.

          Thanks for that. I've always wondered. I though it was something to do with IE not liking any web server but IIS.

          Still f**king annoying.

      2. Anonymous Coward
        Anonymous Coward

        Re: There's UTF-8 and utf8 in Perl

        And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.

        Nah, Perl 5 isn't so bad. I still find it easier to bash out a quick hack in Perl than, say Python, or some other slightly less baroque language like Ruby... CPAN is something of a killer app.

        Now Perl 6 on the other hand... I don't know if the designers coined the word "Twigil" and the concept it describes, but who ever did surely deserves a terrible, lasting punishment.

    3. Michael Wojcik Silver badge

      Re: There's UTF-8 and utf8 in Perl

      I'm no fan of Perl, but the utf8 / UTF-8 distinction is probably the best solution to a real problem.

      Perl's original UTF-8 implementation ("utf8") was created before the format was standardized. Broadly speaking, it follows Postel's Interoperability Principle, and allows many sequences that were forbidden by the standard when it was finalized. That made it easier for people to start using UTF-8 with Perl.

      Those sequences - such as non-minimal encodings - have bad security implications. They make it too easy to slip malicious data past poorly-designed filters (i.e., most filters), for example.

      The later UTF-8 implementation follows the spec. It's good to have an implementation that follows the spec, and it's especially good when that implementation is a lot safer than the overly-permissive one it supersedes. But if Perl had simply dropped "utf8", it would have broken at least some old programs; and if it had made "utf8" a synonym for "UTF-8", some old data would have been rejected.

  4. cyborg
    Trollface

    UTF? WTF!

  5. Neil Barnes Silver badge

    I don't have a gripe against utf-8

    In fact it strikes me as a pretty good idea.

    It's just that standard C doesn't have to tools to talk to it!

    1. John Hughes

      Re: I don't have a gripe against utf-8

      "It's just that standard C doesn't have to tools to talk to it!"

      The whole point of utf-8 is that you don't need special tools to talk to it.

      1. Kristian Walsh Silver badge

        Re: I don't have a gripe against utf-8

        Indeed. UTF-8 strings will sort in codepoint order if you give them to strcmp(), which is as good and bad as its behaviour in ASCII. However, anyone who thinks that strcmp() sorts strings in "alphabetical" order is at best living in a dream-world, or at worst, a hopeless xenophobe.

        As for the other "problem", that of lenght: strlen() returns the length of a UTF-8 string in bytes, and outside of font rendering engines, that is all you ever need to know to write proper text-processing code. Any argument to the contrary is based on a misplaced notion that somehow a byte is a character (I blame C).

        "Character" is a very slippery defitinion, is language sensitive ("rijstafel" is 9 letters long if you're English; only eight if you're Dutch), and doesn't always correspond to the number of symbols the user sees anyway: when the five codes 's','o','e','u','r' arerendered as the four glyphs "sœur", how many characters really are in the string? (both answers are equally right and wrong, btw)

        1. Michael Wojcik Silver badge

          Re: I don't have a gripe against utf-8

          Any argument to the contrary is based on a misplaced notion that somehow a byte is a character (I blame C).

          It's not C's fault. In C, a byte is a character (ISO 9899-1999 3.7.1). It's the fault of programmers who don't understand that a "character" in C is not the same as a "character" in some arbitrary natural-language writing system. (Generally, these are the same people who don't understand that a "byte" in C is not an octet.)

  6. deive
    Coat

    utf-8 ftw

    that is all.

  7. monkeyfish

    Linux users ... who regarded GUIs in general as a barely satisfactory system for marshalling their half dozen terminal sessions.

    Classic.

    1. This post has been deleted by its author

    2. bob, mon!
      Linux

      ... their half dozen terminal sessions.

      By sheerest coincidence, I have six tabs going in my Konsole shell window at the moment. That's on this machine.

      And I code in C and Python, for stdin, stdout, and stderr.

      1. Michael H.F. Wilkinson Silver badge
        Joke

        Re: ... their half dozen terminal sessions.

        HALF dozen? HALF dozen??

        I trust you mean half dozen on each desktop!!!

        1. Anonymous Coward
          Anonymous Coward

          Re: ... their half dozen terminal sessions.

          > I trust you mean half dozen on each desktop!!!

          Come on, Wilkinson. This is not the 80's any more. These are the days of retina displays.

          That's at least a dozen terminals per desktop. Not counting tabbed sessions.

          You've got to embrace progress.

          1. Anonymous Coward
            Anonymous Coward

            Re: ... their half dozen terminal sessions.

            screen ftw. Some of my sessions are now firm family friends.

      2. John Gamble

        Re: ... their half dozen terminal sessions.

        I ... rarely had more than three sessions going at a time. I feel so inadequate now.

      3. HippyFreetard

        Re: ... their half dozen terminal sessions.

        I only have three open, but they're covered in dvtm :)

    3. flambard

      Hey hold it

      I use Gimp now too!!

  8. Mike Bell

    Fair enough

    You'll notice that the 4th line of HTML defining this page is

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    like most websites these days

  9. frank ly
    Happy

    @Verity

    Thank you for using 'one' as a pronoun. One doesn't get to see it often nowadays, if at all.

    1. Philip Lewis
      Headmaster

      Re: @Verity

      Indeed, one doesn't, and one is tempted to bemoan this lamentable situation. There was a time past when one could gleefully indulge in the subjective (and indeed the reflexive) willy nilly as it were, to one's own inner satisfaction and general merriment of all and sundry. One likes to avail oneself of these forms, if only for the inner satisfaction of reaffirming their very existence. Alas, such linguistic beauties have fallen away in our language, shunned by the masses for whom grammar is a small town in Eastern Prussia.

      1. SpamBot

        Re: @Verity

        Surely any small town in Eastern Prussia is now in Western Poland and spelled entirely differently?

        1. CN Hill

          Re: @Verity

          True, but you can't spell the new name in UTF-8.

        2. This post has been deleted by its author

        3. Radbruch1929
          Headmaster

          Re: @Verity

          I believe they are now in north-eastern Poland and north western Russia (Kaliningrad oblast).

          1. Destroy All Monsters Silver badge

            Re: @Verity

            I thought all of those had been slash-and-burned by the advance of Stalin's army (and possibly the retreat of Hitler's, too) so why name them at all?

        4. Anonymous Coward
          Anonymous Coward

          Re: @Verity

          Former Eastern Prussia is actually North Eastern Poland. Except for the part which is now the Russian Kaliningrad enclave.

      2. Someone Else Silver badge
        Coat

        @ Phillip Lewis Re: @Verity

        Alas, such linguistic beauties have fallen away in our language, shunned by the masses for whom grammar is a small town in Eastern Prussia.

        Nah, grammar is the wife of grampar...

      3. Philip Lewis

        Re: @Verity

        Sadly one cannot edit one's own posts here at The Reg. "subjective" should of course have been spelled "subjunctive". One apologises, humbly.

        Which reminds me, where is my silver badge Mr. Moderator?

  10. Dan 55 Silver badge

    A good article, but I'm rather disappointed that it passed up the chance to mention how endianness can feck things up. Little endian (x86/Windows) being COMPLETELY WRONG of course.

    1. Anonymous Blowhard

      Would it help if systems switched from big-endian to little-endian, and vice versa, on each reboot? Many problems could then be fixed by turning it off and on again.

      1. phuzz Silver badge

        At least is has a bloody end, don't get me started on bloody middle-endian US date formats :(

    2. Andrew Yeomans
      Headmaster

      The historical accident of little-endian

      On a purely technical basis, little endian representations of numbers are much easier to parse and handle. I'm meaning proper numbers, not the arbitrary computer representations. Take the number 12345675679274658. Quck now, is that one quadrillion, twelve quadrillion, 123 trillion, or what? You are going to have to do a right-to-left scan of the number to find out.

      The Arabs had it all sorted out, with little-ended numbers (written right-to-left of course). But when the West appropriated the idea a few centuries ago, they omitted to reflect them to convert between Arabic right-to-left and Western left-to-right writing direction. So we've ended with the current confusion.

      Oh well, it could have been worse. We might have been using Roman numerals still, with no zero, if it hadn't been for the Arabs.

      1. Frumious Bandersnatch

        Re: The historical accident of little-endian

        On a purely technical basis, little endian representations of numbers are much easier to parse and handle. I'm meaning proper numbers, not the arbitrary computer representations. Take the number 12345675679274658. Quck now, is that one quadrillion, twelve quadrillion, 123 trillion, or what? You are going to have to do a right-to-left scan of the number to find out.

        Huh? That makes no sense:

        * easier to parse? in all the (human, natural) languages that I know of, we start with the biggest quantity and work down (even in expressions like "four score and 7", "vingt et un" and "eleventy one")

        * is that quadrillion, ... : you don't have to scan right to left---you just count how many digits there are (and last I checked, counting left to right gives the same answer as counting the other way)

        You should have icon privileges revoked for such a silly post.

        1. John H Woods Silver badge
          Joke

          Re: The historical accident of little-endian

          "In all the (human, natural) languages that I know of, we start with the biggest quantity and work down" said Frumious Bandersnatch on the 4th of October, 2013.

          1. Frumious Bandersnatch

            Re: The historical accident of little-endian

            said Frumious Bandersnatch on the 4th of October, 2013

            Ah... touché! I had to read it several times to figure out what the problem was. I do prefer the Japanese date system, but the point is well taken.

            1. ratfox

              4th of October, 2013

              To be honest, all proper coders know that logs files should be formatted as in log_2013_10_04_23_59_59.txt

              And every reader of XKCD knows this too.

        2. Daniel B.

          Re: The historical accident of little-endian

          "In all the (human, natural) languages that I know of, we start with the biggest quantity and work down (even in expressions like "four score and 7", "vingt et un" and "eleventy one")"

          German uses little-endian for numbers < 100, though. "Zwei und vierzig". Quick, what number is that?

          1. Someone Else Silver badge
            Devil

            @ Daniel B. Re: The historical accident of little-endian

            "Zwei und vierzig". Quick, what number is that?

            Trick question; it's not a number, it's the answer to the Universe...

          2. Allan George Dyer

            Re: The historical accident of little-endian

            Also used for time in English as recently as the middle-20th century, such as, "five and twenty to four", as my Aunt used to say. How quaint.

        3. frobnicate

          Re: The historical accident of little-endian

          Historically, numerals in almost all languages are little-endians from "thirteen" (3+10) to "five and twenty". Operations, like addition, are performed from least to most significant digits and the new digits are added at the most-significant side. It is unnatural to do this right-to-left in the otherwise left-to-write oriented writing system. Because of this, *one* often finds oneself in a pain of printing a column of numbers right-adjusted (the only reasonable way to do this, so that scale is immediately visible).

          Compare this with another ridiculous right-to-left vestige: the mathematical notation for function composition: f(g(x)), so cumbersome that mathematicians composing functions a lot (e.g., in category theory) adopt notation from programmers and write "g;f". But that at least we can blame on bad vodka Euler had. Fibonacci and his ilk who gave us big-endian numBerals have no excuse.

          PS: the argument about "starting with the biggest quantity" makes no sense, because Arabs, who invented the thing, read from right to left and hence start with the least significant digit. Which put no hindrance on Arabian mathematics.

          1. Anonymous Coward
            Anonymous Coward

            Re: The historical accident of little-endian

            > Compare this with another ridiculous right-to-left vestige: the mathematical notation for function composition: f(g(x)), so cumbersome that mathematicians composing functions a lot (e.g., in category theory) adopt notation from programmers

            Or RPN.

            1. Destroy All Monsters Silver badge
              Facepalm

              Re: The historical accident of little-endian

              Compare this with another ridiculous right-to-left vestige: the mathematical notation for function composition: f(g(x)), so cumbersome that mathematicians composing functions a lot (e.g., in category theory) adopt notation from programmers

              I think you cannot into math.

              It's written f∘g (x), with the ∘ generally being a bog-standard multiplication sign.

              "Notation from programmers", indeed. Pchao.

        4. Anonymous Coward
          Anonymous Coward

          Re: The historical accident of little-endian

          > in all the (human, natural) languages that I know of, we start with the biggest quantity and work down

          I assume you do not speak German, Czech, Arabic, or (stuffy) Hebrew then?

        5. Anonymous Coward
          Anonymous Coward

          Re: The historical accident of little-endian

          in all the (human, natural) languages that I know of, we start with the biggest quantity and work down (even in expressions like "four score and 7", "vingt et un" and "eleventy one")

          The time here is currently twenty to nine, as it happens.

        6. cordwainer 1

          Re: The historical accident of little-endian

          Even as a non-programmer gasping for air attempting to follow these comments, I at least understand the difference between scanning that number right-to-left as opposed to left-to-right.

          If one were going to approach it as a mathematical amateur - i.e., insert the commas that mark off the 1000s - one can not count off from the left three numbers at a time. One must start at the right and insert the commas every three numbers.

          Yes, one can count them all, but why would one? Doing it right-to-left is how the average person would divide up the number so it made sense.

          For example, it's how most non-programmers and non-mathmaticians approach a number such as 10000000. Quick, is it 1 million or 10 million? Go right-to-left 3 digits at a time, and you'll know a lot faster than if you try to approach it left-to-right

          Anyway, that's what I got from the comment, and so when you write, "...that makes no sense", I have to say, "Uh, yes, it does make sense". It may not be how YOU do it, and it may not be how an "expert" does it. But it does make sense.

          1. Michael Wojcik Silver badge

            Re: The historical accident of little-endian

            I have to say, "Uh, yes, it does make sense". It may not be how YOU do it, and it may not be how an "expert" does it. But it does make sense.

            Very well, how about "makes sense, but is wildly overstated"?

            Go right-to-left 3 digits at a time, and you'll know a lot faster than if you try to approach it left-to-right

            "a lot faster" is a ridiculous exaggeration. If I need to know the magnitude of some number written as a string of digits in Arabic notation - a task which I must admit does not come upon me all that often - and it's too long to simply apprehend the number digits at a glance, I'm perfectly happy to count left-to-right and convert it to scientific notation in my head. Problem solved. Going right-to-left with a three-digit stride is unlikely to be significantly faster.

            More importantly, how often does this come up for the vast majority of people? Who devotes a significant portion of their life to visually determining the magnitude of printed numbers?

            That's my argument with Andrew Yeomans; in his original post, he claimed "little endian representations of numbers are much easier to parse and handle". I've yet to see anyone making any sort of argument that could justify that adverb "much" - at best it's a trivial advantage - but in any event I suspect Yeomans spent as much time composing that post as he's lost in the past year, perhaps in the past decade, to inefficiencies in his number-parsing responsibilities, whatever those might be.

      2. JLV
        Joke

        Re: The historical accident of little-endian

        >Oh well, it could have been worse. We might have been using Roman numerals still, with no zero, if it hadn't been for the Arabs.

        Well then, "intro to unit testing" blogs would all have the obligatory conversion to Arab numbers ;-)

      3. 尼尔

        Re: The historical accident of little-endian

        Well, looking through an old passport with arabic writing in it I was able to confirm that while text goes right to left, numbers go left to right.

      4. harmjschoonhoven
        Megaphone

        Re: The historical accident of little-endian

        Arabic text is written right-to-left, but numbers are written left-to-right

        in the opposite direction of the script (Teach yourself Arabic by J.R. Smart, page 33).

    3. Roland6 Silver badge

      Re: A good article, but...

      A total absence of the work done in ISO on character sets in the late 80's and early 90's resulting in ISO 10646. A project I was involved with in the late 80's was to do with multiple character set handling on DEC VT220/240's, so I got very familiar with ISO 646, 2022, 8859 and 6429... To me both Unicode and UTF-8 left things to be desired, even though they were much simpler...

    4. Irongut

      Don't get me started on endianness. I regularly work with a file format that incudes both big and little endian numbers in the same data structure! What a fecking nightmare that is. I have to drag the spec out to check my code every time, there is no way to know which number should be in which format otherwise.

    5. Frumious Bandersnatch

      Little endian (x86/Windows) being COMPLETELY WRONG of course.

      All my machines here (bar one) are little-endian. They're all running Linux, so it's not an OS-specific thing. You have to blame the CPU manufacturers.

      1. Ben 56

        Re: Little endian (x86/Windows) being COMPLETELY WRONG of course.

        Unless you happen to be using Java.

        1. Roland6 Silver badge

          Re: Little endian (x86/Windows) being COMPLETELY WRONG of course.

          >Unless you happen to be using Java.

          That depends upon whether you are using Java on Unix or Java on Windows and whether both are sharing the same backend DB...

    6. Someone Else Silver badge
      Coat

      @Dan 55

      I assume you also mean that little endian (x86/Mac) and little endian (x86/Linux) is equally completely wrong...

  11. Steve Crook

    Ahhh yes, windows and Unicode

    It bring back all sorts of memories. None of them pleasant. But then I can't say I had many (any?) pleasant memories coding windows GUIs either direct to API or using the horror that was MFC. Unicode was just part of the shit I had to put up with.

  12. Steve Davies 3 Silver badge
    Boffin

    Getting rid of UTF????

    Welcome to the wonderful world of jibberish

    It gets worse

    I've seen html with the charset declared as UTF-8 but with the body encoded as Ebcidic. Doh!

    Seriously, as someone who writes software that is used in many countries, it is SOP to use UTF-8 for everything. By insisting in that at least we don't have to mess with the horrible Microsoft code pages for languages like Kazak and Uzbek.

    We switch to UTF-16 for China and Japan but I will agree with you there that the -16 implementations are broken. At least with UTF-8 you didn't have to worry too much about endianness but with -16 you do and many implementations only work half the time... :)

    1. Michael H.F. Wilkinson Silver badge
      Holmes

      Re: Getting rid of UTF????

      Hmm, EBCDIC

      Now that takes me back

      Back to the days of our CDC computer with its 6-bit bytes organized into 60 bit words using A STUPID FORM OF ASCII MORE-OR-LESS BUT WITH ONLY CAPITALS

      Bliss? no, not at all. At least we no longer had to work with punched cards

      Icon? Closest thing to "old git in reverie mode" icon

      1. Someone Else Silver badge
        Coat

        Re: Getting rid of UTF????

        I'll see your 6-bit CDC characters and raise you DEC RADIX-50!

        And then, I'll get me coat....

        1. Anonymous Coward
          Anonymous Coward

          Re: Getting rid of UTF????

          "raise you DEC RADIX-50"

          ANY SYMBOL WHOSE NAME CANNOT BE EXPRESSED IN THE CHARACTERS A TO Z OR 0 TO 9 OR DOT SPACE AND DOLLAR IS NOT WORTH A SHEET.

          OF GREEN AND WHITE LINE PRINTER PAPER.

          OBVIOUSLY.

          NOW WHERE DID I LEAVE MY TELETYPE RIBBON. I THINK IT WAS BY THAT SHINY NEW LA36.

          http://wickensonline.co.uk/declegacy/

  13. ByeLaw101

    Agreement

    I hate having to navigate around the different encoding types, especially when the customer isn't sure on how they encoded the data in the first place! Even UTF 8 has it's issues... with or without BOM?

    Great article, made me laugh.

    1. MacroRodent

      Re: Agreement

      Even UTF 8 has it's issues... with or without BOM?

      Isn't BOM just a NOP in UTF-8 (http://en.wikipedia.org/wiki/Byte_order_mark). It can be present, or not, and does not matter either way.

      1. Frumious Bandersnatch

        Re: Agreement

        Isn't BOM just a NOP in UTF-8?

        Not if you use it for steganography...

      2. rleigh

        Re: Agreement

        While the BOM shouldn't matter, in many places it does in practice. A couple of examples:

        Shell scripts starting with #!/bin/sh (or perl, python, etc.). The presence of the BOM changes the starting bytes of the file, making the shebang non-functional. Every tool handling shebangs would need patching to cope with this variant.

        Concatenation of files containing BOMs. This leaves you with BOMs spread throughout the data stream. You then need to make sure that every tool handling the data can filter out or ignore BOMs. You can't usually do that either since you might have non-UTF8 binary data in the stream and stripping them out after the fact would mangle the data.

        For these and other reasons, the simplest and most reliable solution is to never ever put BOMs in UTF-8 data. Shame on Microsoft for saving UTF-8 text with BOMs by default...

  14. albaleo

    Make 'em pay

    Isn't utf-8 just a new form of the Flanders imperialism? One-byte characters for us chaps, and some other variable number for the funny writing people. It's just a plot to rule the world by keeping their bandwidth costs high.

    1. MacroRodent

      Re: Make 'em pay

      UTF-8 actually works pretty well for languages that use some variant of the Latin alphabet: a 2-byte sequence is needed every few characters, but the text does not actually expand much. As a Finnish speaker, with my ä:s and ö:s, I can live with it. But I could imagine the Chinese rebelling again. Don't they need 3 or 4 bytes per character all the time?

      1. John 62

        Re: Make 'em pay

        Well, it's China's own fault for using a writing system with an entropy that demands 3 bytes all on its own, without help from other scripts.

      2. MondoMan
        Joke

        Re: Make the Finns pay

        I thought the Finnish language elegantly boosts its bytes-per-word encoding needs by doubling so many letters even *before* a word gets encoded :)

      3. Ken Hagan Gold badge

        Re: Make 'em pay

        "But I could imagine the Chinese rebelling again. Don't they need 3 or 4 bytes per character all the time?"

        It's more like 3 or 4 bytes per syllable, and so actually they may have lower bandwidth costs. But text in any language is a prime candidate for compression during transmission, so everyone's bandwidth costs should be fairly similar for messages with the same semantic content.

      4. James Anderson

        Re: Make 'em pay

        They should pay as its them who created the problem. A cold war standoff between the Plutocratic Republic of China and the rest of the chinese speaking world led to the whole kit and keboodle being coded twice, once as "traditional chinese" as written in Hong Kong, Taiwan and Singapore and one a "Peoples Script" as used on the in the PRC.

        Worse having discovered they could play with the standard -- well they continued to play. The premier with the big glasses whose name everybody forgets insisted his family rendition of his name got added in. In retaliation the capitalist faction as represented by the Hong Kong and Shanghai bank got their trade mark calligraphic rendering of the characters for Shanghai and Hong Kong got there own code points.

        Can you imagine the out cry if "Oor Alec" demanded a separate set of code points so an independent Scotland could use an alphabet free and independent of the English. Or a certain hamburger company asked for there rendering of the letter M to be given its very own code point.

    2. Paul Crawford Silver badge

      Re: Make 'em pay

      No, it is down to reverse compatibility which is a BIG THING given the millions of lines of code written pre-Unicode/UTF-8.

      Basically, in order to work the single byte options have to map to the old ASCII set (which are 7-bit due to the old parity issues from the serial comms days), and those extending to 2/3/4 bytes cover everything else (the "extended ASCII" of original IBM-PC, including the £ symbol and similar, which you might think is 'imperial').

      1. Roland6 Silver badge

        Re: Make 'em pay

        >given the millions of lines of code written pre-Unicode/UTF-8.

        I wasn't aware of a compiler that actually parsed Unicode or UTF-8, they all seemed to just support ASCII and hence required much use of backslash sequences etc. if you wanted to use 'fancy' characters.

    3. John Sanders
      Boffin

      Re: Make 'em pay

      We westerners invented modern computing, it was mainly British and American scientists, and the restrictions of the day made provisioning resources for languages that use +100.000 characters a silly proposition.

      I can not imagine coding in Chinese, neither can the Chinese apparently.

  15. Anonymous Coward
    Anonymous Coward

    If you cant say what you want to say using ASCII, you are not trying hard enough.

    1. sorry, what?
      WTF?

      ASCIIart and technological backwaters

      Yes, you can write any language you choose (even those that they have on all the Star Trek, Star Wars and Stargate displays) using ASCII as long as you are happy to have one or two characters per page and construct it all as ASCIIart.

      I had been a Java bean for over a decade, up until the start of the year when I found a job with an M$ based outfit. Something I rapidly spotted was how backwards so much of the M$ technology is. Don't get me wrong, there are some cool things too, but this discussion about UTF-8 as if it were something new and wonderous, and how tricky it is to use with certain platforms and languages, seems like something from the late 90s, not the 2010s!

    2. Dan 55 Silver badge
      Joke

      And if they don't understand, just REPEAT IT IN CAPITALS.

      1. Frumious Bandersnatch

        And if they don't understand, just REPEAT IT IN CAPITALS.

        I prefer to just xor it with spaces.

  16. Gravis Ultrasound
    Trollface

    Stuck in the past

    What the deal with utf-8 vs utf-16?

    Even mobiles have 64 bit processors now. Is it too much to ask for a 64 bit character set? The current crop of software developers never learned to fully exploit the available computer resources.

    1. James 47

      Re: Stuck in the past

      You're not a FireFox user then

      1. Steve Knox

        Re: Stuck in the past

        @James 47 -- Gravis said "exploit", not "squander."

        1. Anonymous Coward
          Anonymous Coward

          Re: Stuck in the past

          My RAM does feel exploited ...

          ... oh ...

  17. Paul Crawford Silver badge

    Cardinal sin of computing

    The fact that some programmer, in an attempt to show the "benefit of Unicode", should use a 'double' variable for PI and only give 6 figures tells you they should be executed and their programs not!

    But yes, you speak the truth - UTF-8 is better for all practical reasons because it won't break old software/code and yet it allows all characters you (and your customers/users) might want. Subject to matching system fonts - a rant for another day...

  18. kurkosdr

    For the love of (deity), make your code uses UTF-8. The world has decided. Text files are UTF-8, websites are (should be) UTF-8, emails should be UTF-8 and SMSes are UTF-8. So, use UTF-8.

    1. Natalie Gritpants

      Unfortunately RDS isn't so when playing my MP3's via a personal FM transmitter I get weird characters in the track title on the car radio.

    2. MacroRodent
      FAIL

      SMS character sets

      SMSes are UTF-8

      Actually, they aren't. Multiple different character sets are allowed, none of them UTF-8. See http://en.wikipedia.org/wiki/GSM_03.38 for a description of the mess. Read it and groan...

      1. Ramazan

        Re: SMS character sets

        Actually SMS-es use UCS2BE (!), not utf8.

    3. JeeBee

      SMSs with a non-ASCII character in are sent in UCS2 (which is not exactly UTF-16) and they're a lot shorter per message part. Once you've dealt with surrogate pair issues (encoding 3/4 byte characters in UCS2) you may never regain sanity again.

    4. A J Stiles

      SMSs are NOT UTF-8!

      SMS messages are usually sent in GSM-7 aka SMSCII, a modified form of ASCII with some code points moved around, and some characters represented by 2-byte sequences; which also includes some accented characters and enough of the Greek alphabet to be able to write in capitals in Greek, making up the remainder with Latin characters that look like Greek characters. This way, 160 7-bit characters can fit into 140 8-bit bytes. And you get to use the << and >> operators.

      Alternatively they can be sent in UCS-2, which is as near enough to UTF-16 as makes no difference; but then the message is limited to 70 characters.

      There is no UTF-8 mode, though .....

  19. MrMur
    Devil

    All I can say is....

    try dealing with LMBCS

  20. Pen-y-gors

    Good idea

    I use UTF-8 for everything (some Welsh characters aren't supported in the usual European sets) - but could someone please give Microsoft a good slapping - just wasted ages trying to get data containing Welsh characters (ŵ and ŷ - see, el Reg can handle them) from an Excel spreadsheet via csv into a mySQL DB - nightmare! Excel output to csv can't do UTF-8. I ended up pasting into OpenOffice, then exporting.

    1. David Given

      Re: Good idea

      Back in 2009 I posted a comment to the Reg containing astral plane characters (code points with a value above 0xffff). I got back an apologetic email saying that I'd broken their database and they'd had to remove them from the comment.

      Some time later I found a bug in Thunderbird's treatment of astral plane characters. I tried to file a bug. Then I had to file a bug on Bugzilla complaining that it didn't handle astral plane characters properly... which was quite hard, as Bugzilla's bug tracker is also Bugzilla.

      (All of these stem from the same underlying problem, which is MySQL assuming 16-bit Unicode. This is why 16-bit Unicode must die. MySQL too, of course.)

      1. David Given

        Re: Good idea

        ...just tried to post something with astral plane characters and got back 'The post contains some characters we can't support'. MySQL FTW!

      2. JeeBee

        Re: Good idea

        MySQL's "utf8" type doesn't support 4-byte utf8, you need to use 'utf8mb4' for that!

  21. rcorrect

    Would be really interesting if computers were invented in China.

    1. Mage Silver badge

      Computers Invented in China?

      I thought indeed they were.

      Mechanical with Rods & Beads. c.f. adding MDCXIVL and XLVI

  22. Mage Silver badge

    Wonderfull.

    Why did no-one point out to me earlier that Unicode isn't 16 bits any more?

    I had noticed however that if you wanted Web Sites, Browsers & SQL all to talk nice that UTF-8 was best.

  23. Anonymous Coward
    Anonymous Coward

    Java

    The Java devs could have saved themselves a lot of bother by just calling "char" an unsigned short and being done with it. Now they just look silly by calling a character a "codepoint" and a utf-16 unit a "char".

    Mind, their punishment was to be swallowed whole by Oracle. Rough justice.

    1. Richard 12 Silver badge

      Re: Java

      I hate the datatype "char" and refuse point blank to use it.

      I use quint8/qint8 or uint8/int8 for an 8-bit unsigned/signed value (depending on whether I'm Cute at the time).

      "char" should be banned. It's confused.

  24. Steve Knox
    Happy

    "All those extra holes made it easier to air cool in-memory databases."

    Like Lego Technics, the holes just make them cooler!

    1. Anonymous Coward
      Anonymous Coward

      Re: "All those extra holes made it easier to air cool in-memory databases."

      If "Rate This Article" still existed, that line alone would have got an 11 for this article.

      Can't see how to sneak it in to the office conversation yet, but I'll give it serious thought.

      [Yes there are computer people that don't read El Reg. Unbelievable but true.]

  25. Joe Harrison

    Got to keep it

    If every character needs 16 bits rather than 8 then the NSA will only be able to store half as much of our stuff

  26. miket82

    Machine code £

    The comment about printing the £ (I hated the # sign) reminded me of my DOS days. I solved it by writing a small 90 byte machine code routine (most bytes were my credit line) that loaded through config.sys that redirected the print code to see the £ code rather than the hash code. Staff often asked me what the line

    "Money added to system"

    meant when they switched the machine on but then I always did have a weird sense of humor.

    1. Neil Barnes Silver badge

      Re: Machine code £

      When I was but a lad, the BBC used internally a variant of the CEEFAX system to carry presentation messages (next item is, coming out three seconds early, etc.) around the country on a video display line that was stripped out before the signal went to the transmitter.

      What the character set PROM didn't have was a £ sign.

      Instead of using a separate PROM or even $deity$ help us an EPROM, the BBC designs department in its infinite wisdom built a whole chunk of logic that recognised the £ code and told the character generator to use the top half of a C and the bottom half of an E...

      I don't recall ever seeing a message that used the £ sign...

  27. John Savard

    Fixed Length

    As UTF-8 only represents characters up to 31 bits in length, the alternative of every character taking 32 bits still remains another valid, if wasteful, option.

    UTF-8 is somewhat wasteful as well, often requiring three bytes instead of two, or two bytes instead of one; stateful encodings can do much better.

    1. Frederic Bloggs
      Holmes

      Re: Fixed Length

      One has a choice: either stateful, shorter, but potentially fragile and non-self synchronising or UTF-8. Me? I choose UTF-8. But then I spend a large part of my programming life dealing with radio based comms protocols which means - by definition - I am rather strange.

      Oh, and it doesn't help that I spent a lot of time in my formative years having to deal with 5 channel paper tape...

  28. artbristol

    Should be titled "Down with UTF-16"

    Unicode is a good standard and it was written by clever guys. There's nothing wrong with Unicode's approach of mapping each character to a code point, and adding an intermediate step requiring encoding it into bytes. Far better than the ugly mess of codepages that preceded Unicode.

    UTF-8 is part of Unicode and it's a damn good encoding.

    1. Christian Berger

      Re: Should be titled "Down with UTF-16"

      Well that's a common problem with ElReg, there are many authors who have never seen anything else than the little area they work in and believe that the whole world is like this.

      They believe that E-Mail is as complex as Exchange, they believe that somehow IPv6 is amazingly difficult, and they believe that the world is still using UTF-16.

      It's a bit like the people from Krikkit who due to their dark night skies have never seen even a glimps of the worlds out there.

      1. Roland6 Silver badge

        Re: Should be titled "Down with UTF-16"

        MS Exchange I didn't realise it was complex - obviously spent too much time working on enterprise systems.

    2. Destroy All Monsters Silver badge

      Re: Should be titled "Down with UTF-16"

      > There's nothing wrong with Unicode's approach of mapping each character to a code point

      Actually there is plenty of wrong with that. Because then you suddenly need the whole cartesian product of diacritics and the base characters.

      The only one who I would trust to come up with a "good Unicode" would be Knuth.

  29. Tromos

    Control Data had it right

    The old CDC mainframes used a 6-bit character set - and no multibyte codes (until some idiot went and wanted lower case put in too).

    1. Lars Silver badge
      Pint

      Re: Control Data had it right

      I those days memory was expensive so that some had 4 bit for numbers and 6 or 8 for characters.

      1. BristolBachelor Gold badge

        Re: Control Data had it right

        I remember some very old ICL and Digital machines with 6-bit bytes (being a pedant I am using Byte as the number of bits to represent a character). One guy here still cannot type in lower-case, and I'm pretty sure he'd have a stroke if you sent him a document without a single upper-case letter in it.

        But I also remember at least one of those Dec machines had a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing).

        It now makes me smile a bit that we have huge monstor machines running bare-metal hypervisors, with each user having a virtual machine running its own a virtual copy of Windows, loading its own virtual copy of Excel. In the past, a single machine loaded one copy of 2020, and all the users shared it. No need to load 150 copies of the same thing, all repeating the same houskeeping tasks.

        1. Anonymous Coward
          Anonymous Coward

          Re: Control Data had it right

          "a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing)."

          VAX, perhaps. VAXes (and lots of others) can have "page faults" (or maybe other exceptions) part way through the processing of an instruction. If it's a page fault, the relevant data is loaded into memory by the OS, and the faulting instruction is resumed. If it was a potentially long running CISC instruction (such as POLY), it may or may not need to be restarted from the beginning. The instruction is not restarted from the beginning unless the "first part done" bit isn't set (which indicates that it needs to restart from the beginning, because the first part of the instruction hadn't completed.)

          And why am I telling you this?

          Because you need to know.

  30. clean_state

    Bravo!

    I spent a lot of time fighting with text encodings when designing the .mobi file format for Mobipocket (and later Kindle). The conclusion was also that UTF-8 wins everywhere. The self-sync feature is superb. As for the "hassle" of handling a variable-length character encoding, you soon realize that:

    - in most cases, you need the length of your string in bytes (for memory allocation, string copying, ...)

    - cases where you need to decode UTF-8 to code points are rare, mostly when you display those characters and then you usually display the whole string from first to last byte so going through the bytes in sequence to decode the code points is not wasteful.

    - the typical case when you need to know the characters is parsing, BUT, ALL keywords and ALL control characters in ALL computer languages are below the 128 code point so you can actually parse UTF8 as if it was ASCII and never care about the multi-byte encoding outside of string litterals.

    So yes, UTF-8 everywhere!

    1. rleigh

      Re: Bravo!

      I hate to be a pedant (actually, that's a lie), but it's not strictly true that all control characters are below codepoint 128. There is the ECMA-35/ISO-2022 C1 control set designated at 128-159, mirroring the C0 control set with the high bit set. This is obviously incompatible with UTF-8 though, and so not available when you have UTF-8 designated with "ESC % G".

    2. Anonymous Coward
      Anonymous Coward

      Re: Bravo!

      I'm not sure about your last point there. Lots of programming languages allow non-ASCII characters in identifiers (for example: http://golang.org/ref/spec#Identifiers), so, assuming you are not going to allow all non-ASCII characters in identifiers (Go allows 'ĉ' but not '£'), your lexer does need to identify characters. Also, you might want character constants to work beyond ASCII.

      However, you typically don't need to decode UTF-8 in order to identify the end of a string constant or comment.

  31. Admiral Grace Hopper

    The Youth Of Today

    Some of us still dream in EBCDIC.

  32. Anonymous Coward
    Anonymous Coward

    "public static final double π = 3.14159;"

    No its not.

    1. Uncle Slacky Silver badge
      Joke

      Re: "public static final double π = 3.14159;"

      It is...for suitably small values of π...

    2. Suricou Raven

      Re: "public static final double π = 3.14159;"

      public static final double π = 3.14159;//ish

    3. Annihilator
      Boffin

      Re: "public static final double π = 3.14159;"

      "Scientists, scientists, please. Looking for some order. Some order, please, with the eyes forward and the hands neatly folded and the paying attention ... PI IS EXACTLY THREE!!"

      1. Adam 1

        Re: "public static final double π = 3.14159;"

        I would like three pies

      2. Primus Secundus Tertius

        Re: "public static final double π = 3.14159;"

        public static final double π = 355/113;

  33. Anonymous Coward
    Anonymous Coward

    Joel Spolsky

    "he royally patronises programmers"

    He sure does. All the time. And not just about Unicode.

  34. disgruntled yank

    Mr. U will not be missed

    (see ee cummings).

    I must say that Perl makes it not too painful to deal with Unicode.

    I am slight disappointed with Ms. Stob, though, for not riffing on The U and the Non-U...

  35. Anonymous Coward
    Anonymous Coward

    Global posts

    Discovered last week that Windows Notepad won't display all pasted UTF-8 characters - but it does preserve the binary values. So saving from Word in "TXT" UTF-8 format with an HTM suffix does appear correctly on a browser page.

    Very useful for a hobby task that indexes public FaceBook and YouTube postings which can be written in just about any language. A quick screen scrape of Google Translate then combines a translation with the original.

  36. joeldillon

    Err....Qt has always tended to use UTF-8, not plain old UCS-2, for file i/o...

    (Also, one 16-bit value equals one character wasn't true even from the start, even without taking Chinese into account; consider polytonic Greek for example)

  37. Anonymous Coward
    Anonymous Coward

    Go Forth

    I've been using UTF-8 in my Forth code for years; it's nice to be able to use function names (all right, "words" as we right-minded Forth programmers call them) maths and logical symbols in them.

  38. An0n C0w4rd

    Unicode needs to be taken out back and shot

    Not just shot once, but repeatedly.

    One of the principals of Unicode is to separate the character from the representation of the character. In other words, ASCII 65 (decimal) is "A". How your system chooses to display "A" is up to the system. The character is transmitted as decimal 65 no matter what the display representation is.

    Unicode promptly goes on to rubbish this ideal.

    Pre-Unicode Asian fonts had "full-width" representations of ASCII characters so displays that mixed ASCII and Japanese characters kept their formatting as the characters had the same width, while the usual ASCII characters were narrower and hence broke formatting.

    Unfortunately this lives on in Unicode, shattering the idea that the display of the character is independent of the code point of the character because there are now two different Unicode code points that both print out a Latin-1 "A" (and also the rest of the alphabet and numbers and punctuation). In reality, the full width "A" should not be U+FF21, it should be decimal 65 with the renderer deciding if it should be full width or not.

    This has caused me more than one problem in the past with things that sometimes correctly handle the full-width and ASCII mix and sometimes don't.

    1. albaleo

      Re: Unicode needs to be taken out back and shot

      "the full width "A" should not be U+FF21, it should be decimal 65 with the renderer deciding if it should be full width or not."

      I'm not sure I agree. How does the renderer decide? For example, in the following (if it displays), where English and Japanese are mixed, and the second upper-case A is part of a Japanese name.

      A great future at Aテック.

      1. Steve Knox
        Pirate

        Re: Unicode needs to be taken out back and shot

        Rule 1: If the 'A' is part of a word which contains Japanese characters, use full-width to be compatible with the rest of the word to which it belongs. This covers your example.

        However, it does not cover all other possibilities.

        Rule 2: If the 'A' is part of a word consisting entirely of English characters, but which is nonetheless part of a sentence which primarily consists of Japanese words, use the full-width.

        This rule may be and may need to be generalized to paragraph, section, even document level depending on the particular use case.

        Otherwise, proportional should be acceptable, if not preferred.

        NB to be fully international and general it would probably be best to replace "Japanese" and "English" with "full/fixed-width alphabet" and "variable/proportional-width alphabet" (or some similar even more appropriate terminology) in the preceding.

        1. Anonymous Coward
          Anonymous Coward

          Re: Unicode needs to be taken out back and shot

          Not sure your rules apply either really... lets have a quick browse on the arcade cabinets and control panels section of yahoo auctions I'm browsing for example; There are some power supplies for sale, one seller has written DCパック but someone else has written DCパック. Some people use full width latin for the starting prices.. some people don't. Some people have even managed to mix up half and full width latin in the same word or number. My wife's computer seems to default to using full width latin for everything whereas the input method on my machine doesn't seem to use full width for anything unless I go all the way down to the bottom of the candidates in the selection window..

          1. Roland6 Silver badge

            Re: Unicode needs to be taken out back and shot

            Up voted this conversation thread as very interesting and enlightening.

  39. J.G.Harston Silver badge

    Unicode? Hah! I remember fighting with BIG-5.

  40. Anonymous Coward
    Anonymous Coward

    Java :D

    Remember, dear Java coders, to specify a character set whenever you convert a String to bytes, or when using a reader or writer that implicitly does so on an underlying byte stream. Otherwise your default platform encoding will be used instead, and who knows if that is set to the same thing across multiple servers, or servers and clients...

    And yes, you'll soon come to curse them for making UnsupportedEncodingException a checked exception, as if there was something you could do to recover from it.

    1. Roland6 Silver badge

      Re: Java :D

      And to ensure that the character set specified is the same as that used by the DBMS (which can often use a different character set to the host platform...

  41. Anonymous Coward
    Anonymous Coward

    Has Joel signed on?

    Or Raymond Chen? http://blogs.msdn.com/b/oldnewthing/

    Or Michael Kaplan? http://blogs.msdn.com/b/michkap/

    If they can get those three to sign on, then it might go somewhere.

  42. Christian Berger

    Wait? There are still people using 16-Bit characters?!

    I'm sorry, but the last time I've seen those in the wild was around 2000. Back then Microsoft had a short spell with that. Nobody uses 16 bit character codes for any sort of external representation any more as people have found out it doesn't work for the very thing they were invented for, eliminating code pages.

  43. Tim 11

    UTF-8 IS unicode

    I know you said you were refusing to be rigorous in use of the correct terminology but side effect of this is that the title of your article is completely wrong. you're complaining about UTF-16 and if you use UTF8 you are also using unicode

  44. Stevie

    Bah!

    Eschew all of the above and use only FIELDATA.

    a) It does everything you need for programming a real computer, and does it in 3/4 the space by simply acknowledging that Johnny Foreigner doesn't matter.

    2) It is also a friendly encoding scheme in that it only has caps. If everyone is shouting, no-one is. Thus a major annoyance on the intarwebs is removed as if it had never been.

    #) Brought to you by Univac, proper computers for real programmers. Remember: If you can pick it up without a crane, it isn't a real computer, it's a toy. Don't put your important software on toy computers.

    Plus: OS2200 - a mature, secure operating system with utilities people actually fixed in a timely manner so there are *no* "known bugs" still dragging their arse into theater twenty five years on and no buffer overrun attack scripts available on the web for the asking. Unix or WIndows? Don't make me laugh.

  45. Annihilator
    Happy

    " It currently contains around 110,000 characters. You will have noticed that this is considerably above the original two-byte limit"

    Na, not *considerably* above. Just a lowly bit above it.

  46. Herby

    Could be worse?

    We could all be using a Baudot coding scheme. They use 5 bits per character and include LTRS and FIGS shifts. Only the alphabet was encoded in the LTRS shift, and "special" characters were encoded in the FIGS shift.

    A total of 26 LTRS, 26 FIGS, and CR, LF, FIGS, LTRS, SPACE, NULL.

    A nice total of 55 actual code points, as you couldn't count FIGS, LTRS, or NULL.

    Of course if you go back further, you were limited to a 48 character set for such mundane things as coding in FORTRAN. The character set had 26 letters, 10 digits, space, and '@', '=', '(', ')', '*', '$', ',', '.', '/', '+', '-'. Sometimes you replaced '@' with a single quote (').

    If it was good enough for FORTRAN, it was good enough for me.

  47. Will Godfrey Silver badge
    Happy

    But Wait!

    Does this all mean that ASCII art is on the rebound?

    1. ratfox

      Re: But Wait!

      You can hardly call this ASCII

      ಠ_ಠ

  48. Daniel von Asmuth
    Coffee/keyboard

    Royal Mail

    I noticed as a boy that there are two kind of postage stamps: the first kind has the name of the country of origin written on it in some language, the other kind is British and may feature a portrait of her Majesty.

    1. Richard Plinston

      Re: Royal Mail

      I noticed as a boy .. a portrait of her Majesty.

      When I was a young boy I noticed that it was His Majesty.

  49. Mikey D

    Down with Unicode (and UTF-8)!

    What happened to Multicode?

    http://faculty.kfupm.edu.sa/COE/mudawar/publications/1997.Multicode.IEEEComputer.pdf

    1. ThomH

      Re: Down with Unicode (and UTF-8)!

      Being a 1997 attempt to fix the problems stemming from a belief that "[t]he Unicode standard is a fixed-width scheme ... [that] uses 16-bit encoding", it was immediately irrelevant because UTF-8 had been presented in 1993. It's also modal, so lacks self synchronisation, and complicates things by defining character sets by language. As the paper acknowledges, 'a' is present separately as an English character, a French character, a German character, etc, etc, with the intention being that all those different 'a's are mapped back to the same thing after the fact.

    2. Primus Secundus Tertius

      Re: Down with Unicode (and UTF-8)!

      I have been thinking about inventing duocode, in the hope it could become a single standard. Everything with the uni- prefix forks off fifty fanciful fellow-versions.

  50. bofh80

    This guy needs a new job

    In journalism. Oh my this is refreshing, please more from this guy, give him a spot and pay for his articles. Much better! :P

    /ducks

  51. Anonymous Coward
    Anonymous Coward

    Character encodings in real life

    Surprised that none else has posted this: https://en.wikipedia.org/wiki/File:Letter_to_Russia_with_krokozyabry.jpg

    1. Destroy All Monsters Silver badge

      Re: Character encodings in real life

      Not bad

      I can now add MOJIBAKE (character gibberish shite output crud) to my vocabulary.

      Really, that should be in the Hacker's Dictionary.

  52. PyLETS

    ˙ƃuılıǝɔ ǝɥʇ ƃuolɐ ƃuıʞlɐʍ puɐ sʇooq ɔıʇǝuƃɐɯ ʎɯ ƃuıɹɐǝʍ ɯ,ı uǝɥʍ qoʇs ʎʇıɹǝʌ ƃuıpɐǝɹ pǝʎoɾuǝ sʎɐʍlɐ ǝʌɐɥ ı

    1. Anonymous Coward
      Anonymous Coward

      Should I upvote that?

      Or would it be misinterpreted?

      1. Stuart Moore

        Re: Should I upvote that?

        It currently has 6 upvotes.

        I needed to think very hard about whether that was 'six' or 'nine'. If I'm honest I'm still not sure...

  53. Anonymous Coward
    Anonymous Coward

    Looking forward to the next one ...

    I'm looking forward to the next post on timezones. If you can handle strings then you are, possibly, ready to handle world time. Any computer structure as fundamental as time (in most real-world systems anyway) which can be screwed by politicans deciding they want to do it differently this year, is always good for a giggle.

  54. Anonymous Coward
    Anonymous Coward

    Best part of article

    "code pages were something horrible and fussy that one hoped to get away with ignoring"

    I work for an American software company selling software across countries who make heavy use of non-ASCII data.

    It's a nightmare because as you say American devs just try to ignore the problem and happily map byte arrays to strings with no thought to what will happen outside US or Western Europe

    Our software developed in Java is generally ok but the older stuff from the nineties developed on C is nothing but a headache

    1. Roland6 Silver badge

      Re: Best part of article

      Years back, a challenge we had on an international project was getting hold of a Japanese version of Windows (ie. a version of Windows that used and supported 16-bit character sets) , because MS in their wisdom didn't supply it as standard to resellers & SI's in the USA and Europe. I take it that things haven't improved significantly since then.

  55. Zack Mollusc

    Solved!

    Well, duh! The answer is simple, adopt my own patented system entitled Uniercode. It uses 17 bits to represent each character, thus solving the problem forever.

    1. Roland6 Silver badge
      Joke

      Re: Solved!

      I think the solution is UTFv6, this uses 128-bits per character, so has more than enough space for all current and future character sets...

  56. raving angry loony

    Cunning linguists

    I sometimes work with linguists and their need for systems that can be used with several dozen languages from all over the world. They generally despair at the state of computing and character identification for non-English (or even non-latin-character based) languages. As far as they're concerned every system so far seems to have been created by quasi-illiterate uni-lingual English speakers or worse, people with only a beginner's understanding of the languages they're supposed to be transcribing.

  57. Henry Wertz 1 Gold badge

    UTF-8 and internationalization

    Well, UTF-32 (4 byte unicode) does accomodate all characters in a flat space (it even has a "user defined" space de facto split up so you can have like Klingon and Lord of the Rings fonts installed with their proper character codes. Yes indeed.

    Here's the "meat" of UTF-8... wikipedia has a nice table which I cannot paste here, but the short of it is UTF-8 is 1 to 6 bytes, but the longest lengths are for characters "at the end" of the unicode code space, in practice most characters are 1-3 bytes. Unicode will encode some characters that are a character with an extra mark or two on it as a 2-byte character and one or more 2- byte modifier characters, which will be encoded in 3 bytes by UTF-8. byte 1 is 0xxxxxxx for a 7-bit character and always starts with 11xxxxxx for a multi-byte character. (It is 110xxxxx to indicate a 2-byte character through 1111110x for a 6-byte character. These lengths encode 7, 11, 16, 21, 26, and 31 bits of a 32-bit Unicode character. extra bytes are all 10xxxxxx.

    That said *shrug*, as a programmer I find Android and Linux both have plenty good internationalization APIs available, and I avail of them so I don't have to worry about the details.

  58. Bagelsmonster

    Smug mode

    You're all light-weights. Some of us had to build our own hardware to get the job done.

    http://en.wikipedia.org/wiki/Multilanguage_Electronic_Phototypesetting_System

    MEPS is now software only. We are currently just shy of 600 languages published in print and on jw.org. To put that in context un.org is in 6 languages.

    1. Roland6 Silver badge

      Re: Smug mode

      Where's the YouTube video of the MEPS computer in action !

  59. AncientJohn

    As a long retired programmer who started out using binary I love the comments on Verity's articles.

    1. Anonymous Coward
      Anonymous Coward

      > As a long retired programmer who started out using binary

      Bah! Binary!

      In *my* day we couldn't afford two symbols. So unary it was, young lad.

  60. Anonymous Coward
    Anonymous Coward

    This is still a problem?

    All of my machines have been fully UTF-8 for ages..

    1. Real Ale is Best

      Re: This is still a problem?

      I think it's still a problem for Americans who still think the world is written in ASCII on Letter sized paper.

  61. Anonymous Coward
    Anonymous Coward

    Benefit of UTF-8 is backwards compatibility

    Everybody learned that characters are 8 bits and most programming languages, libraries, file formats, and other software have been designed around this assumption.

    The brilliant thing about UTF-8 is that with minimal (or often no) modification to anything, everything that supported 8 bit characters also inherently, automatically "supports" multibyte characters. Programmers CAN, for the most part, ignore the fact that multibyte characters even exist.

    As somebody who was forced to spend years being constantly annoyed by Microsoft's "widechar" software and APIs, UTF-8 is an awe-inspiring solution to the problem.

  62. Andy Davies

    We never had these problems with EBCDIC - although variable length coding of characters goes back at least as far as Morse code.

    1. Destroy All Monsters Silver badge

      I thought EBCDIC was a single permaproblem?

      Yay for a non-continuous mapping of roman characters. About on the same level as the original PC design.

  63. William Higinbotham

    EBCDIC

    Remember Esoteric IBM EBCDIC? en.wikipedia.org/wiki/EBCDIC

    1. Justigar

      Re: EBCDIC

      You say that jestingly, but we still have dealings with this.

      We get data from a client in EBCDIC and we have to convert it to ASCII before it gets sent out again. It arrives on media that is older than I am once a week, every week. People think I'm playing space invaders in the corner when I have to transfer the data....

      1. Anonymous Coward
        Anonymous Coward

        Re: EBCDIC

        Of course, it's not always possible to convert from EBCDIC to ASCII, because of '¦', '¬', ...

    2. Michael Wojcik Silver badge

      Re: EBCDIC

      "Esoteric"? The trade in EBCDIC systems continues unabated, and there's a goodly amount of EBCDIC data processed on ASCII machines, for that matter.

  64. Dick Pountain

    the horror, the horror....

    One can't begin to appreciate the true horror of Unicode until one has tried converting Word documents into MOBI to make Kindle books using Calibre.

  65. ozzee

    Old argument.

    http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0687.html

  66. ponga

    Bah, UTF-8 is just a continuation of the usual anglo-saxon cultural imperialism: A-Z as first class citizens, with Johnny Overseas characters such as é, è, ö, ä, å, ç, æ, ø and ß treated as a regrettable necessity, if dealt with at all.

    Frankly, I'm holding out for an encoding where everyone uses multibyte characters to represent all text:, including classic ASCII: that's the only way American software is ever going to be fully usable outside the good ole US of A. (Hmmmm... make that a necessary but insufficient condition.)

  67. organiser

    UTF-32

    UTF-16 is ancient and impractical. UTF-8 is intended for data interchange, not for processing. UTF-32 is the real and true Unicode encoding these days.

  68. jdieter

    you forgot EBCDIC

    Good old IBM

This topic is closed for new posts.