Down with Unicode! Why 16 bits per character is a right pain in the ASCII • The Register Forums

1. 1. Friday 4th October 2013 11:18 GMT Steve Knox
    
    Re: Stuck in the past
    
    @James 47 -- Gravis said "exploit", not "squander."
    
    1 0
    1. Friday 4th October 2013 23:08 GMT Anonymous Coward
      
      Re: Stuck in the past
      
      My RAM does feel exploited ...
      
      ... oh ...
      
      0 0
Friday 4th October 2013 10:15 GMT Paul Crawford

Cardinal sin of computing

The fact that some programmer, in an attempt to show the "benefit of Unicode", should use a 'double' variable for PI and only give 6 figures tells you they should be executed and their programs not!

But yes, you speak the truth - UTF-8 is better for all practical reasons because it won't break old software/code and yet it allows all characters you (and your customers/users) might want. Subject to matching system fonts - a rant for another day...

13 0
Friday 4th October 2013 10:21 GMT kurkosdr

For the love of (deity), make your code uses UTF-8. The world has decided. Text files are UTF-8, websites are (should be) UTF-8, emails should be UTF-8 and SMSes are UTF-8. So, use UTF-8.

5 0
1. Friday 4th October 2013 10:45 GMT Natalie Gritpants
  
  Unfortunately RDS isn't so when playing my MP3's via a personal FM transmitter I get weird characters in the track title on the car radio.
  
  0 0
2. Friday 4th October 2013 11:01 GMT MacroRodent
  
  SMS character sets
  
  SMSes are UTF-8
  
  Actually, they aren't. Multiple different character sets are allowed, none of them UTF-8. See http://en.wikipedia.org/wiki/GSM_03.38 for a description of the mess. Read it and groan...
  
  5 0
  1. Friday 4th October 2013 12:48 GMT Ramazan
    
    Re: SMS character sets
    
    Actually SMS-es use UCS2BE (!), not utf8.
    
    0 0
3. Friday 4th October 2013 11:20 GMT JeeBee
  
  SMSs with a non-ASCII character in are sent in UCS2 (which is not exactly UTF-16) and they're a lot shorter per message part. Once you've dealt with surrogate pair issues (encoding 3/4 byte characters in UCS2) you may never regain sanity again.
  
  0 0
4. Sunday 6th October 2013 19:26 GMT A J Stiles
  
  SMSs are NOT UTF-8!
  
  SMS messages are usually sent in GSM-7 aka SMSCII, a modified form of ASCII with some code points moved around, and some characters represented by 2-byte sequences; which also includes some accented characters and enough of the Greek alphabet to be able to write in capitals in Greek, making up the remainder with Latin characters that look like Greek characters. This way, 160 7-bit characters can fit into 140 8-bit bytes. And you get to use the << and >> operators.
  
  Alternatively they can be sent in UCS-2, which is as near enough to UTF-16 as makes no difference; but then the message is limited to 70 characters.
  
  There is no UTF-8 mode, though .....
  
  0 0
Friday 4th October 2013 10:30 GMT MrMur

All I can say is....

try dealing with LMBCS

0 0
Friday 4th October 2013 10:39 GMT Pen-y-gors

Good idea

I use UTF-8 for everything (some Welsh characters aren't supported in the usual European sets) - but could someone please give Microsoft a good slapping - just wasted ages trying to get data containing Welsh characters (ŵ and ŷ - see, el Reg can handle them) from an Excel spreadsheet via csv into a mySQL DB - nightmare! Excel output to csv can't do UTF-8. I ended up pasting into OpenOffice, then exporting.

4 0
1. Friday 4th October 2013 11:11 GMT David Given
  
  Re: Good idea
  
  Back in 2009 I posted a comment to the Reg containing astral plane characters (code points with a value above 0xffff). I got back an apologetic email saying that I'd broken their database and they'd had to remove them from the comment.
  
  Some time later I found a bug in Thunderbird's treatment of astral plane characters. I tried to file a bug. Then I had to file a bug on Bugzilla complaining that it didn't handle astral plane characters properly... which was quite hard, as Bugzilla's bug tracker is also Bugzilla.
  
  (All of these stem from the same underlying problem, which is MySQL assuming 16-bit Unicode. This is why 16-bit Unicode must die. MySQL too, of course.)
  
  16 0
  1. Friday 4th October 2013 11:13 GMT David Given
    
    Re: Good idea
    
    ...just tried to post something with astral plane characters and got back 'The post contains some characters we can't support'. MySQL FTW!
    
    1 0
  2. Friday 4th October 2013 11:21 GMT JeeBee
    
    Re: Good idea
    
    MySQL's "utf8" type doesn't support 4-byte utf8, you need to use 'utf8mb4' for that!
    
    2 0
Friday 4th October 2013 10:40 GMT rcorrect

Would be really interesting if computers were invented in China.

1 0
1. Friday 4th October 2013 10:48 GMT Mage
  
  Computers Invented in China?
  
  I thought indeed they were.
  
  Mechanical with Rods & Beads. c.f. adding MDCXIVL and XLVI
  
  1 0
Friday 4th October 2013 10:50 GMT Mage

Wonderfull.

Why did no-one point out to me earlier that Unicode isn't 16 bits any more?

I had noticed however that if you wanted Web Sites, Browsers & SQL all to talk nice that UTF-8 was best.

0 0
Friday 4th October 2013 10:54 GMT Anonymous Coward

Java

The Java devs could have saved themselves a lot of bother by just calling "char" an unsigned short and being done with it. Now they just look silly by calling a character a "codepoint" and a utf-16 unit a "char".

Mind, their punishment was to be swallowed whole by Oracle. Rough justice.

4 0
1. Friday 4th October 2013 17:31 GMT Richard 12
  
  Re: Java
  
  I hate the datatype "char" and refuse point blank to use it.
  
  I use quint8/qint8 or uint8/int8 for an 8-bit unsigned/signed value (depending on whether I'm Cute at the time).
  
  "char" should be banned. It's confused.
  
  1 0
Friday 4th October 2013 11:20 GMT Steve Knox

"All those extra holes made it easier to air cool in-memory databases."

Like Lego Technics, the holes just make them cooler!

2 0
1. Friday 4th October 2013 22:20 GMT Anonymous Coward
  
  Re: "All those extra holes made it easier to air cool in-memory databases."
  
  If "Rate This Article" still existed, that line alone would have got an 11 for this article.
  
  Can't see how to sneak it in to the office conversation yet, but I'll give it serious thought.
  
  [Yes there are computer people that don't read El Reg. Unbelievable but true.]
  
  0 0
Friday 4th October 2013 11:25 GMT Joe Harrison

Got to keep it

If every character needs 16 bits rather than 8 then the NSA will only be able to store half as much of our stuff

5 1
Friday 4th October 2013 11:32 GMT miket82

Machine code £

The comment about printing the £ (I hated the # sign) reminded me of my DOS days. I solved it by writing a small 90 byte machine code routine (most bytes were my credit line) that loaded through config.sys that redirected the print code to see the £ code rather than the hash code. Staff often asked me what the line

"Money added to system"

meant when they switched the machine on but then I always did have a weird sense of humor.

2 0
1. Sunday 6th October 2013 16:20 GMT Neil Barnes
  
  Re: Machine code £
  
  When I was but a lad, the BBC used internally a variant of the CEEFAX system to carry presentation messages (next item is, coming out three seconds early, etc.) around the country on a video display line that was stripped out before the signal went to the transmitter.
  
  What the character set PROM didn't have was a £ sign.
  
  Instead of using a separate PROM or even $deity$ help us an EPROM, the BBC designs department in its infinite wisdom built a whole chunk of logic that recognised the £ code and told the character generator to use the top half of a C and the bottom half of an E...
  
  I don't recall ever seeing a message that used the £ sign...
  
  1 0
Friday 4th October 2013 11:32 GMT John Savard

Fixed Length

As UTF-8 only represents characters up to 31 bits in length, the alternative of every character taking 32 bits still remains another valid, if wasteful, option.

UTF-8 is somewhat wasteful as well, often requiring three bytes instead of two, or two bytes instead of one; stateful encodings can do much better.

0 5
1. Friday 4th October 2013 11:45 GMT Frederic Bloggs
  
  Re: Fixed Length
  
  One has a choice: either stateful, shorter, but potentially fragile and non-self synchronising or UTF-8. Me? I choose UTF-8. But then I spend a large part of my programming life dealing with radio based comms protocols which means - by definition - I am rather strange.
  
  Oh, and it doesn't help that I spent a lot of time in my formative years having to deal with 5 channel paper tape...
  
  7 0
Friday 4th October 2013 11:41 GMT artbristol

Should be titled "Down with UTF-16"

Unicode is a good standard and it was written by clever guys. There's nothing wrong with Unicode's approach of mapping each character to a code point, and adding an intermediate step requiring encoding it into bytes. Far better than the ugly mess of codepages that preceded Unicode.

UTF-8 is part of Unicode and it's a damn good encoding.

7 0
1. Friday 4th October 2013 16:05 GMT Christian Berger
  
  Re: Should be titled "Down with UTF-16"
  
  Well that's a common problem with ElReg, there are many authors who have never seen anything else than the little area they work in and believe that the whole world is like this.
  
  They believe that E-Mail is as complex as Exchange, they believe that somehow IPv6 is amazingly difficult, and they believe that the world is still using UTF-16.
  
  It's a bit like the people from Krikkit who due to their dark night skies have never seen even a glimps of the worlds out there.
  
  3 1
  1. Saturday 5th October 2013 17:42 GMT Roland6
    
    Re: Should be titled "Down with UTF-16"
    
    MS Exchange I didn't realise it was complex - obviously spent too much time working on enterprise systems.
    
    1 1
2. Sunday 6th October 2013 10:10 GMT Destroy All Monsters
  
  Re: Should be titled "Down with UTF-16"
  
  > There's nothing wrong with Unicode's approach of mapping each character to a code point
  
  Actually there is plenty of wrong with that. Because then you suddenly need the whole cartesian product of diacritics and the base characters.
  
  The only one who I would trust to come up with a "good Unicode" would be Knuth.
  
  0 0
Friday 4th October 2013 11:58 GMT Tromos

Control Data had it right

The old CDC mainframes used a 6-bit character set - and no multibyte codes (until some idiot went and wanted lower case put in too).

0 0
1. Friday 4th October 2013 13:33 GMT Lars
  
  Re: Control Data had it right
  
  I those days memory was expensive so that some had 4 bit for numbers and 6 or 8 for characters.
  
  0 0
  1. Friday 4th October 2013 13:44 GMT BristolBachelor
    
    Re: Control Data had it right
    
    I remember some very old ICL and Digital machines with 6-bit bytes (being a pedant I am using Byte as the number of bits to represent a character). One guy here still cannot type in lower-case, and I'm pretty sure he'd have a stroke if you sent him a document without a single upper-case letter in it.
    
    But I also remember at least one of those Dec machines had a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing).
    
    It now makes me smile a bit that we have huge monstor machines running bare-metal hypervisors, with each user having a virtual machine running its own a virtual copy of Windows, loading its own virtual copy of Excel. In the past, a single machine loaded one copy of 2020, and all the users shared it. No need to load 150 copies of the same thing, all repeating the same houskeeping tasks.
    
    1 0
    1. Friday 4th October 2013 22:28 GMT Anonymous Coward
      
      Re: Control Data had it right
      
      "a machine-code square-root instruction (although seem to remember it being split into 2 to allow time slicing)."
      
      VAX, perhaps. VAXes (and lots of others) can have "page faults" (or maybe other exceptions) part way through the processing of an instruction. If it's a page fault, the relevant data is loaded into memory by the OS, and the faulting instruction is resumed. If it was a potentially long running CISC instruction (such as POLY), it may or may not need to be restarted from the beginning. The instruction is not restarted from the beginning unless the "first part done" bit isn't set (which indicates that it needs to restart from the beginning, because the first part of the instruction hadn't completed.)
      
      And why am I telling you this?
      
      Because you need to know.
      
      0 0
Friday 4th October 2013 12:20 GMT clean_state

Bravo!

I spent a lot of time fighting with text encodings when designing the .mobi file format for Mobipocket (and later Kindle). The conclusion was also that UTF-8 wins everywhere. The self-sync feature is superb. As for the "hassle" of handling a variable-length character encoding, you soon realize that:

- in most cases, you need the length of your string in bytes (for memory allocation, string copying, ...)

- cases where you need to decode UTF-8 to code points are rare, mostly when you display those characters and then you usually display the whole string from first to last byte so going through the bytes in sequence to decode the code points is not wasteful.

- the typical case when you need to know the characters is parsing, BUT, ALL keywords and ALL control characters in ALL computer languages are below the 128 code point so you can actually parse UTF8 as if it was ASCII and never care about the multi-byte encoding outside of string litterals.

So yes, UTF-8 everywhere!

5 0
1. Friday 4th October 2013 23:16 GMT rleigh
  
  Re: Bravo!
  
  I hate to be a pedant (actually, that's a lie), but it's not strictly true that all control characters are below codepoint 128. There is the ECMA-35/ISO-2022 C1 control set designated at 128-159, mirroring the C0 control set with the high bit set. This is obviously incompatible with UTF-8 though, and so not available when you have UTF-8 designated with "ESC % G".
  
  1 0
2. Monday 7th October 2013 13:21 GMT Anonymous Coward
  
  Re: Bravo!
  
  I'm not sure about your last point there. Lots of programming languages allow non-ASCII characters in identifiers (for example: http://golang.org/ref/spec#Identifiers), so, assuming you are not going to allow all non-ASCII characters in identifiers (Go allows 'ĉ' but not '£'), your lexer does need to identify characters. Also, you might want character constants to work beyond ASCII.
  
  However, you typically don't need to decode UTF-8 in order to identify the end of a string constant or comment.
  
  0 0
Friday 4th October 2013 12:30 GMT Admiral Grace Hopper

The Youth Of Today

Some of us still dream in EBCDIC.

2 0
Friday 4th October 2013 12:30 GMT Anonymous Coward

"public static final double π = 3.14159;"

No its not.

2 0
1. Friday 4th October 2013 13:17 GMT Uncle Slacky
  
  Re: "public static final double π = 3.14159;"
  
  It is...for suitably small values of π...
  
  3 0
2. Friday 4th October 2013 15:42 GMT Suricou Raven
  
  Re: "public static final double π = 3.14159;"
  
  public static final double π = 3.14159;//ish
  
  3 0
3. Friday 4th October 2013 16:51 GMT Annihilator
  
  Re: "public static final double π = 3.14159;"
  
  "Scientists, scientists, please. Looking for some order. Some order, please, with the eyes forward and the hands neatly folded and the paying attention ... PI IS EXACTLY THREE!!"
  
  4 0
  1. Saturday 5th October 2013 11:51 GMT Adam 1
    
    Re: "public static final double π = 3.14159;"
    
    I would like three pies
    
    1 0
  2. Saturday 5th October 2013 16:33 GMT Primus Secundus Tertius
    
    Re: "public static final double π = 3.14159;"
    
    public static final double π = 355/113;
    
    1 1
Friday 4th October 2013 12:41 GMT Anonymous Coward

Joel Spolsky

"he royally patronises programmers"

He sure does. All the time. And not just about Unicode.

0 0
Friday 4th October 2013 13:08 GMT disgruntled yank

Mr. U will not be missed

(see ee cummings).

I must say that Perl makes it not too painful to deal with Unicode.

I am slight disappointed with Ms. Stob, though, for not riffing on The U and the Non-U...

1 0
Friday 4th October 2013 13:09 GMT Anonymous Coward

Global posts

Discovered last week that Windows Notepad won't display all pasted UTF-8 characters - but it does preserve the binary values. So saving from Word in "TXT" UTF-8 format with an HTM suffix does appear correctly on a browser page.

Very useful for a hobby task that indexes public FaceBook and YouTube postings which can be written in just about any language. A quick screen scrape of Google Translate then combines a translation with the original.

0 0
Friday 4th October 2013 13:14 GMT joeldillon

Err....Qt has always tended to use UTF-8, not plain old UCS-2, for file i/o...

(Also, one 16-bit value equals one character wasn't true even from the start, even without taking Chinese into account; consider polytonic Greek for example)

1 0
Friday 4th October 2013 13:23 GMT Anonymous Coward

Go Forth

I've been using UTF-8 in my Forth code for years; it's nice to be able to use function names (all right, "words" as we right-minded Forth programmers call them) maths and logical symbols in them.

0 0
Friday 4th October 2013 14:57 GMT An0n C0w4rd

Unicode needs to be taken out back and shot

Not just shot once, but repeatedly.

One of the principals of Unicode is to separate the character from the representation of the character. In other words, ASCII 65 (decimal) is "A". How your system chooses to display "A" is up to the system. The character is transmitted as decimal 65 no matter what the display representation is.

Unicode promptly goes on to rubbish this ideal.

Pre-Unicode Asian fonts had "full-width" representations of ASCII characters so displays that mixed ASCII and Japanese characters kept their formatting as the characters had the same width, while the usual ASCII characters were narrower and hence broke formatting.

Unfortunately this lives on in Unicode, shattering the idea that the display of the character is independent of the code point of the character because there are now two different Unicode code points that both print out a Latin-1 "A" (and also the rest of the alphabet and numbers and punctuation). In reality, the full width "A" should not be U+FF21, it should be decimal 65 with the renderer deciding if it should be full width or not.

This has caused me more than one problem in the past with things that sometimes correctly handle the full-width and ASCII mix and sometimes don't.

2 2

Topics

Special Features

Vendor Voice

Resources

COMMENTS

Page:

Re: Stuck in the past

Re: Stuck in the past

Cardinal sin of computing

SMS character sets

Re: SMS character sets

SMSs are NOT UTF-8!

All I can say is....

Good idea

Re: Good idea

Re: Good idea

Re: Good idea

Computers Invented in China?

Wonderfull.

Java

Re: Java

"All those extra holes made it easier to air cool in-memory databases."

Re: "All those extra holes made it easier to air cool in-memory databases."

Got to keep it

Machine code £

Re: Machine code £

Fixed Length

Re: Fixed Length

Should be titled "Down with UTF-16"

Re: Should be titled "Down with UTF-16"

Re: Should be titled "Down with UTF-16"

Re: Should be titled "Down with UTF-16"

Control Data had it right

Re: Control Data had it right

Re: Control Data had it right

Re: Control Data had it right

Bravo!

Re: Bravo!

Re: Bravo!

The Youth Of Today

"public static final double π = 3.14159;"

Re: "public static final double π = 3.14159;"

Re: "public static final double π = 3.14159;"

Re: "public static final double π = 3.14159;"

Re: "public static final double π = 3.14159;"

Re: "public static final double π = 3.14159;"

Joel Spolsky

Mr. U will not be missed

Global posts

Go Forth

Unicode needs to be taken out back and shot

Page:

About Us

Our Websites

Your Privacy