UTF-16 anywhere?

by Alexander Gazarov in Everything sucks

Another question about Unicode popped up not so long ago into the "attention magnet" list of questions on the StackExchange network. What's interesting, even though the question itself is about the total characters possible using UTF-8, the discussion quickly jumped to the limitations of UTF-16 and how this standard needs to be phased out.

It probably all began from the StackOverflow question, Should UTF-16 be Considered Harmful, which has since been migrated to Programmers.SE. The author of the top voted post didn't settle with his answer though, and now there's a whole website promoting UTF-8: utf8everywhere.org. And I've been really curious about the effort he put into that.

The facts on the website look correct overall, the only strange fact is that they say that

UTF-16 is often misused as a fixed-width encoding.

The problem is, there is no such miracle as fixed-width UTF-16. If we always have two bytes per character, that's UCS-2, period. It's not UTF-16. That's definitely a problem, but we'll come back to it later.

The conclusion section is the most interesting one. Here we can take the arguments one by one.

UTF-16 is too wide.

Being too wide is not really much of a problem since earlier he makes the argument himself that compressed text size doesn't differ between UTF-8 and UTF-16. And for efficient communication it will always be better to use compression, so this "disadvantage" doesn't really hold any water.

Where it would matter is in the in-memory representations where compressing the strings back and forth would have a certain tax on performance. But it doesn't necessarily mean that we should simply use UTF-8. For example, JEP 254 on compacting strings in Java uses the approach of representing strings in ASCII where possible and in UTF-16 in all other cases. This will save the memory even better than UTF-8 because it will not only be at least as efficient as UTF-8, but it will also use two bytes where UTF-8 would use three.

It exists only for historical reasons and creates a lot of confusion.

Historical reasons - that doesn't say all, it exists for compatibility reasons (with UCS-2). And so does UTF-8, to be compatible with ASCII. As for creating lots of confusion, we'll take a look at that later.

Adding wchar_t to the C++ standard was a mistake, and so are the Unicode additions to C++11.

I believe that they shouldn't have had wchar_t in the standard as well, but the thing is the standard doesn't really say anything about the relation between wchar_t and Unicode. Isn't this the fault of the vague standard though?

This is even more supported by the website having a whole section on how to work with the strings in C++ on Windows. Not just any language or any platform, but this combination in particular. It makes sense, of course, because the author (or at least one of them, since there are three names) even mentions primarily being a Windows developer, but it just looks strange to pay so much attention to a specific case on a site dealing with the encodings in general. Doesn't so much attention to this special case hint that the problem lies in there and not with UTF-16?

Safety is an important feature of every design, and encodings are no exception.

Here's a fun fact: when you use UTF-8, it doesn't automatically mean that you will get proper non-BMP code point support. I had my share of surprise when I found some time ago that MySQL apparently has a so called 4-byte UTF-8 Unicode encoding. As it turns out, if you specify regular UTF-8, it will not handle non-BMP characters with the same effect as if they declared the use of UTF-16 and used UCS-2 instead. And this is a relatively recent addition, this wonderful encoding appeared in MySQL 5.5.3 in 2010, while (crippled) UTF-8 has been there long before that, successfully confusing people.

So as we see UTF-8 can have problems too. Of course, there are much more cases of that with the UTF-16. But there's nothing particularly wrong with UTF-16, it is as valid compatibility tool as UTF-8. What people are really complaining about is that there is a lot of software which works with UCS-2 but says that it supports UTF-16. This results from UCS-2 being a thing for a noticeable time period and a considerable amount of software embracing it during that time. This is where all the confusion really comes from. Of course, when a transition needs to happen later, it never goes smoothly because legacy software and backward compatibility is always a thing. As a result we have to put up with the consequences, and there are serious problems with that indeed. But that's not UTF-16's fault, and this couldn't have happened any other way. If not for UTF-16, we would be stuck with UCS-2 with no hope to support non-BMP characters at all since rewriting existing software completely is rarely an option.

It surely would be advantageous to use one encoding everywhere. If one day every piece of software started supporting one encoding and nothing else, it would be awesome, be it UTF-8 or UTF-16. However, the chances of this happening in the foreseeable future are nil. Let's take a look at the programming language rankings by RedMonk which I like so much. What is the string object representation in the top ten languages?

  • JavaScript - UTF-16;
  • Java - UTF-16;
  • PHP - no specific encoding;
  • Python - UTF-8;
  • C# - UTF-16;
  • C++ - no specific encoding;
  • Ruby - UTF-8;
  • CSS - N/A;
  • C - no specific encoding;
  • Objective-C - UTF-16.

Four out of nine languages (CSS doesn't count as a programming language) use UTF-16 and only two use UTF-8. One can argue that something like C++ could give the developer a choice of encoding, but that's not entirely the case, because you will have to take the platform and existing libraries into account, which is why there's a long section about Windows on utf8everywhere.org.

But even that's not all, the languages which use UTF-16 represent underlying platforms: Java represents the JVM, browser scripts will have to compile to JavaScript (and not only browser scripts if you remember node.js), and C# represents the .NET platform. We could say Objective-C is tied to a platform too, even though it is limited to Apple software. What this means is that anything that uses these platforms will have to deal with UTF-16.

I'm sorry to break it to the zealous UTF-8 supporters, but UTF-16 isn't going anywhere anytime soon. Understand why it's here, understand how it works, understand that there are legacy functions and methods which deal with UTF-16 code units and not code points, and embrace this encoding. And please fix the legacy software if you are maintaining it.

Comments:

No comments here yet, want to be the first one?

Leave a comment:

Want to use an avatar? Go to Gravatar and upload one!