Feeds

back to article Down with Unicode! Why 16 bits per character is a right pain in the ASCII

I recently experienced a Damascene conversion and, like many such converts, I am now set on a course of indiscriminate and aggressive proselytising. Ladies and gentlemen, place your ears in the amenable-to-bended position, and stand by to be swept along by the next great one-and-only true movement. The beginning In the …

COMMENTS

This topic is closed for new posts.

Page:

Hmmm...

0
12
Bronze badge
Alien

Re: Hmmm...

Yes, that was my reaction too, but then I admit near-total ignorance on the subject, beyond what I've just read.

Simplistically it seems the best solution is to implement a sufficiently large encoding length to accommodate all possible characters, which was supposedly the goal of UTF-16, except Becker naively assumed that "16 bits ought to be enough for anyone" (to paraphrase a well-known fallacy).

Again, simplistically, the answer to these "enough for anyone" fallacies would seem to be dynamic allocation, as in dynamic arrays or linked lists, which is in fact what UTF-8 does, although in its case the dynamic allocation pertains to the encoding length of each member rather than the overall length of the array, if I'm reading the descriptions correctly.

UTF-16 does that too, apparently, thus defeating its original objective, but suffers from ASCII and endian compatibility issues and, probably more than anything else, Microsoft's typically retarded implementation.

So UTF-8 it is, then.

7
0
Silver badge
Go

Re: Hmmm...

My reaction was: AMEN, sister!

0
0

"...my fellow Delphi users should notice that Embarcadero has dropped support for the UTF8String type..."

What else to expect from something as backwards as Delphi? Pining for the days of Turbo Pascal is like pining for the days of Lisp Machines, only without sense and good taste.

12
28

Embacadero in the past usually narrowly followed Windows Policy in these kind of issues. Recently they seem to orient themselves more against Objective C, Mac (because of their iOS offerings, only their mobile offerings are LLVM), but that is also utf16

My guess is Embarcadero will adapt if their core targets adapt. Ranting against them (with baseless sentiment) is there for useless..

That is also my problem with this whole rant. The main problem is not UTF16, but there being two standards, with two opposing camps. Even if you thing UTF8 is superior, if your core platforms are utf16 oriented, you will spend your days fighting windmills

1
2
Happy

Ah, Turbo Pascal...

6
0

This post has been deleted by its author

@Aaron Miller

If El Reg would allow it, I'd give my whole day's quota of downvotes to your comment. Blaspheme not against the mighty Turbo Pascal, for it was Holy and did give many of us reason to stay in CompSci instead of switching to History or Psych.

Cretin.

28
2
Bronze badge
Childcatcher

Bah, humbug!

Turbo What now...? BASIC is the bee's knees...!

2
2
Anonymous Coward

> Pining for the days of Turbo Pascal is like pining for the days of Lisp Machines

You better leave LISP out of this, understood?

5
0
AOD

What else to expect from something as backwards as Delphi? Pining for the days of Turbo Pascal is like pining for the days of Lisp Machines, only without sense and good taste.

If' you're going to have a little rant, please get your facts straight. Delphi evolved from Turbo Pascal but it is a distinct product and a very sophisticated one at that. Please enlighten us as to why you regard Delphi as backward? Do you have direct development experience with it that you can share or is it just that it's non MS and therefore can't be any good?

From experience I can tell you that when it was introduced, it brought features that gave the competition a swift kick to the happy sack, including but not limited to:

A WYSIWYG menu editor for designing your forms. The sad equivalent in VB3 was truly pitiful.

Decent object orientated support in a strongly typed language (Object Pascal).

Support for building applications as a single EXE. No more DLL's to fling around the place if you preferred not to.

3
0

Delphi backwards? I have a few choice unicode character for you!

Calling Delphi backwards ("backward" it should probably be!) shows a true ignorance of the language.

ASCII, Unicode 16 bit and UTF8, I just wish there was a standard that was universally accepted, and not rooted in the days of 8 bit machines.

0
0

There's UTF-8 and utf8 in Perl

The sainted Larry claims he can keep them separate in his head, but it baffles many a poor soul like me. And it has cause me to produce the odd bit of wombat-do-do in my time.

http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

I'm a former T.61 expert. Let's not go there.

2
0
Silver badge
Trollface

Re: There's UTF-8 and utf8 in Perl

Of course Larry can keep "UTF-8" and "utf8" apart in is head. For those who don't know, Perl is the language where if your cat walks across the keyboard then the text generated can be successfully run in Perl as a Turing-complete program.

21
0
Bronze badge
Pint

Re: There's UTF-8 and utf8 in Perl

And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.

a beer for you, Mr Sullivan. You've saved me more pain than I knew was lurking in my future.

2
0
Bronze badge

Re: There's UTF-8 and utf8 in Perl

And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.

Don't let it put you off. Unicode in Perl more or less "just works". The only times I've had problems with it have been in trying to correctly convert stuff from other code pages and broken MS document formats. That, and sometimes forgetting to tell my database that the incoming data is UTF-8 rather than ASCII (though sometimes Perl needs a hint, too, to tell it not to do a spurious conversion).

Speaking of MS documents, I find it really incredible to come across HTML on the web that obviously came from MS Word initially and that has completely messed up rendering of some trivial glyphs (like em dash and currency symbols). I find it hard to believe that in this day and age, Word can't even convert to HTML properly. OK, so maybe the problem isn't with word, but with the options the user selected for the conversion, but still...

7
1
Bronze badge

Re: There's UTF-8 and utf8 in Perl

Don't let it put you off. Unicode in Perl more or less "just works".

Agreed. I wrote a script recently then, at the end, remembered that some bits of the data would be coming in with things like (un)"intelligent quotes". I set about looking at what I'd need to do, only to discover that it was all being handled correctly without me having to do anything special at all.

There is a great number of modules in Perl which do "what you need".

2
1
Bronze badge

Re: There's UTF-8 and utf8 in Perl

"There is a great number of modules in Perl..."

Yeh... ... yeh there is.

1
0
Silver badge
Trollface

Re: There's UTF-8 and utf8 in Perl

@Dan 55 - and it will do something useful in the real world That is how brilliant Perl is!!

I take your Troll, and raise it.

0
0
Silver badge

Re: There's UTF-8 and utf8 in Perl

@Frumious Bandersnatch, "completely messed up rendering of some trivial glyphs (like em dash and currency symbols)" - my guess would be the HTML was saved in cp1252, and the browser guessed (or was told) it was iso8859-1. They are almost the same, apart from those glyphs.

I was going to add a rant, but I couldn't decide whether it was agains Microsoft's "embrace, extend, extinguish", incorrectly configured web servers or browsers silently "being helpful" and changing the encoding they're applying so that you can never figure out whether you've configured your web server correctly. Basically, Verity's right.

6
0
Gold badge
Unhappy

" Frumious Bandersnatch Ignore"

"Speaking of MS documents, I find it really incredible to come across HTML on the web that obviously came from MS Word initially and that has completely messed up rendering of some trivial glyphs (like em dash and currency symbols)."

So that's the source of that annoying little f**k up.

Word --> HML.

Thanks for that. I've always wondered. I though it was something to do with IE not liking any web server but IIS.

Still f**king annoying.

0
0
Anonymous Coward

Re: There's UTF-8 and utf8 in Perl

And there it is. My fledgling interest is learning perl. stone. cold. dead. Life is just too short to deal with so much silly.

Nah, Perl 5 isn't so bad. I still find it easier to bash out a quick hack in Perl than, say Python, or some other slightly less baroque language like Ruby... CPAN is something of a killer app.

Now Perl 6 on the other hand... I don't know if the designers coined the word "Twigil" and the concept it describes, but who ever did surely deserves a terrible, lasting punishment.

0
0
Bronze badge

Re: There's UTF-8 and utf8 in Perl

I'm no fan of Perl, but the utf8 / UTF-8 distinction is probably the best solution to a real problem.

Perl's original UTF-8 implementation ("utf8") was created before the format was standardized. Broadly speaking, it follows Postel's Interoperability Principle, and allows many sequences that were forbidden by the standard when it was finalized. That made it easier for people to start using UTF-8 with Perl.

Those sequences - such as non-minimal encodings - have bad security implications. They make it too easy to slip malicious data past poorly-designed filters (i.e., most filters), for example.

The later UTF-8 implementation follows the spec. It's good to have an implementation that follows the spec, and it's especially good when that implementation is a lot safer than the overly-permissive one it supersedes. But if Perl had simply dropped "utf8", it would have broken at least some old programs; and if it had made "utf8" a synonym for "UTF-8", some old data would have been rejected.

0
0
Bronze badge
Trollface

UTF? WTF!

3
0
Silver badge

I don't have a gripe against utf-8

In fact it strikes me as a pretty good idea.

It's just that standard C doesn't have to tools to talk to it!

6
2
Bronze badge

Re: I don't have a gripe against utf-8

"It's just that standard C doesn't have to tools to talk to it!"

The whole point of utf-8 is that you don't need special tools to talk to it.

1
0
Silver badge

Re: I don't have a gripe against utf-8

Indeed. UTF-8 strings will sort in codepoint order if you give them to strcmp(), which is as good and bad as its behaviour in ASCII. However, anyone who thinks that strcmp() sorts strings in "alphabetical" order is at best living in a dream-world, or at worst, a hopeless xenophobe.

As for the other "problem", that of lenght: strlen() returns the length of a UTF-8 string in bytes, and outside of font rendering engines, that is all you ever need to know to write proper text-processing code. Any argument to the contrary is based on a misplaced notion that somehow a byte is a character (I blame C).

"Character" is a very slippery defitinion, is language sensitive ("rijstafel" is 9 letters long if you're English; only eight if you're Dutch), and doesn't always correspond to the number of symbols the user sees anyway: when the five codes 's','o','e','u','r' arerendered as the four glyphs "sœur", how many characters really are in the string? (both answers are equally right and wrong, btw)

3
0
Bronze badge

Re: I don't have a gripe against utf-8

Any argument to the contrary is based on a misplaced notion that somehow a byte is a character (I blame C).

It's not C's fault. In C, a byte is a character (ISO 9899-1999 3.7.1). It's the fault of programmers who don't understand that a "character" in C is not the same as a "character" in some arbitrary natural-language writing system. (Generally, these are the same people who don't understand that a "byte" in C is not an octet.)

1
0
Coat

utf-8 ftw

that is all.

5
0
Bronze badge

Linux users ... who regarded GUIs in general as a barely satisfactory system for marshalling their half dozen terminal sessions.

Classic.

38
0

This post has been deleted by its author

Linux

... their half dozen terminal sessions.

By sheerest coincidence, I have six tabs going in my Konsole shell window at the moment. That's on this machine.

And I code in C and Python, for stdin, stdout, and stderr.

0
0
Silver badge
Joke

Re: ... their half dozen terminal sessions.

HALF dozen? HALF dozen??

I trust you mean half dozen on each desktop!!!

9
0
Bronze badge

Re: ... their half dozen terminal sessions.

I ... rarely had more than three sessions going at a time. I feel so inadequate now.

3
0

Re: ... their half dozen terminal sessions.

I only have three open, but they're covered in dvtm :)

1
0
Anonymous Coward

Re: ... their half dozen terminal sessions.

> I trust you mean half dozen on each desktop!!!

Come on, Wilkinson. This is not the 80's any more. These are the days of retina displays.

That's at least a dozen terminals per desktop. Not counting tabbed sessions.

You've got to embrace progress.

3
0
Bronze badge

Re: ... their half dozen terminal sessions.

screen ftw. Some of my sessions are now firm family friends.

2
0

Hey hold it

I use Gimp now too!!

0
0
Bronze badge

Fair enough

You'll notice that the 4th line of HTML defining this page is

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

like most websites these days

14
0
Silver badge
Happy

@Verity

Thank you for using 'one' as a pronoun. One doesn't get to see it often nowadays, if at all.

8
0
Bronze badge
Headmaster

Re: @Verity

Indeed, one doesn't, and one is tempted to bemoan this lamentable situation. There was a time past when one could gleefully indulge in the subjective (and indeed the reflexive) willy nilly as it were, to one's own inner satisfaction and general merriment of all and sundry. One likes to avail oneself of these forms, if only for the inner satisfaction of reaffirming their very existence. Alas, such linguistic beauties have fallen away in our language, shunned by the masses for whom grammar is a small town in Eastern Prussia.

23
0

Re: @Verity

Surely any small town in Eastern Prussia is now in Western Poland and spelled entirely differently?

6
0

Re: @Verity

True, but you can't spell the new name in UTF-8.

7
0

This post has been deleted by its author

Silver badge
Coat

@ Phillip Lewis Re: @Verity

Alas, such linguistic beauties have fallen away in our language, shunned by the masses for whom grammar is a small town in Eastern Prussia.

Nah, grammar is the wife of grampar...

13
0
Bronze badge

Re: @Verity

Sadly one cannot edit one's own posts here at The Reg. "subjective" should of course have been spelled "subjunctive". One apologises, humbly.

Which reminds me, where is my silver badge Mr. Moderator?

0
0
Headmaster

Re: @Verity

I believe they are now in north-eastern Poland and north western Russia (Kaliningrad oblast).

1
0
Silver badge

Re: @Verity

I thought all of those had been slash-and-burned by the advance of Stalin's army (and possibly the retreat of Hitler's, too) so why name them at all?

0
0
Anonymous Coward

Re: @Verity

Former Eastern Prussia is actually North Eastern Poland. Except for the part which is now the Russian Kaliningrad enclave.

0
0
Silver badge

A good article, but I'm rather disappointed that it passed up the chance to mention how endianness can feck things up. Little endian (x86/Windows) being COMPLETELY WRONG of course.

14
7

Page:

This topic is closed for new posts.