Comparing Traditional and Simplified Chinese

Posted 2012-01-16. Last updated 2012-01-16.

Overview

Presented here is a quantitative comparison of traditional and simplified Chinese characters and of some Chinese character sets. The charts and links below use a live Chinese Character Web API (my own devising) that serves data provided by the Unicode Consortium.


First, let's get some character sets out on the table.





Big5 is split into two parts, though this does not seem to be readily acknowledged. Each part makes a complete pass through "the" dictionary. The first pass has 5,401 characters; the second pass has 7,660 characters. The first set of characters (which I call Big5a) seems to be the only useful part of Big5, at least for a learner of Chinese. I'm skeptical that even a language native has much use for, or knowledge of, the second set of characters (which I call Big5b).

The data are live

The data are live, and the regions in the pie charts have tool tips and can be clicked. The charts are Google Charts, with the Venn diagrams being a fairly old, weak, image-based solution (still Google).

Results in the form of character lists show up on the right side of the browser window (you need a fairly large screen). The character lists are crude but I hope self-explanatory. When variants are shown, T means traditional and S means simplified.

I tend to fall into the lingo I used for the Web API, such as using "!" to mean not. You can read more about the API and the lingo at http://ccdb.hemiola.com/#filter.

If you don't see any charts or links, something went wrong. Perhaps your browser is not modern enough. No attempt has been made to support old browsers. See here for a tiny bit more detail.

Character sets? Encoding? Huh?

When I refer to GB2312, I am referring to the set of characters it encompasses, not to an encoding standard. GB2312 characters can be encoded as Unicode for manipulation and further encoded as UTF-8 for over-the-wire transmission.


Some reading this page may think, "GB2312 is an old, stale encoding that causes nothing but compatibility issues." That line of thinking is unrelated to how I treat GB2312, which is as a set of characters. I wouldn't advocate using anything other than UTF-8 as an encoding.


If you're typing an e-mail that uses only GB2312 characters, or if you're displaying a web page that uses only Big5 characters, it's highly likely that Unicode and UTF-8 are used to render and display the content.


What do I mean by focusing on, sticking with, and studying one character set over another? I am referring to the tools (dictionaries) you may use when looking up characters. To avoid overwhelming yourself, you want a dictionary that gives you practical results. For example, if you're searching for the character pronounced peng, would you rather see around 20 possibilities or around 70, where 50 of those might be variant duplicates and generally unknown or obsolete characters?



If you're searching for a character with a 水 radical plus 8 strokes, would you rather see around 40 possibilities or around 110?



On the other hand, if you're looking up a character and can't find it, you might pull your hair out trying to determine if you missed it or if your dictionary doesn't cast a wide enough net. Unfortunately, there's no perfect balance. For example, if you're looking up a character in a Cantonese newspaper, you may need to widen your scope a bit; Big5a might not cut it.

Quantity of simplified characters

Let's examine the number of simplified characters in each major character set.


CJK Breakdown
GB2312 Breakdown


It's not effective to chart the number of simplified Big5 characters, as there are so few. Here's a summary:


Of those few simplified Big5 characters, GB2312 includes them all except:


GB2312 is a subset of CJK, so the difference in simplified characters is that GB2312 is missing some:


GB2312 includes a handful of simplifiable characters:

Summary of this section

We can see that:

Some obscure cases

Let's look at a few obscure cases. Some of these may represent errors in the Unihan database. I haven't researched them linguistically.


In all of CJK, there is a unique character triple. The database indicates that a traditional character has been simplified, and the character it was simplified to has also been simplified. Does this make two traditionals and one simplified? Or one traditional and two simplifieds? It's mainly a curiosity, as there's just one.


Another obscure case is where multiple traditional characters have been simplified to the same character, or where a single traditional character has multiple simplifications. Again, I don't know if any of these are errors in the database, but they are interesting to ponder:


All the above simplified characters are in GB2312.


A look at CJK

Here are two different looks at the CJK Unified Ideographs:


CJK Breakdown
CJK Breakdown


We expect the percentages to look the same, because we expect the simplified characters to be the simplified counterparts of the same number of simplifiable traditional characters. But what is the subtle difference? First, let's look at a Venn diagram of the larger regions:


!Simplified CJK|!Simplifiable CJK


We expect the 2,620 "not simplified" characters to be the traditional characters that simplify down to the 2,548 "not simplifiable" characters. Notice these numbers do match the small slices in the pie charts except for being off by one. That's because of that one character that has both a traditional variant and a simplified variant, mentioned above.


Still, there is the difference of 2,620 and 2,548, which is 72. We'll assume this is accounted for by the multiple traditional variants (+77) and multiple simplified variants (-6) that are present, mentioned above. Again, off by one—probably that pesky special character again.

Comparing GB2312 and Big5, part one

Note the comfortingly symmetrical breakdowns of GB2312 and Big5a:

GB2312 Breakdown
Big5a Breakdown


In both cases, there's about a 1/3 + 2/3 split.


These two charts show a comforting similarity between GB2312 and Big5a, but if you look closer at how the sets relate, the comforting feeling falls apart somewhat. When you consider the difference in sizes of the two sets, you know that there cannot be a direct mapping between the two. Here's a comparison of the larger 2/3 areas of each:

!Simplified GB2312|!Simplifiable Big5a


There's a lot of overlap, but the non-overlapping areas are also fairly large. One question you might ask is, "Are the missing traditional GB2312 characters in Big5b?" Most of them are, but there are still about 200 traditional GB2312 characters that aren't in Big5 at all.


Conclusions to this section

GB2312 has 6,763 characters and Big5a has 5,401 characters—smaller by 1,362. There can't be a one-to-one mapping between the two. The range of 5,000 or 6,000 characters is about how many a language native would be exposed to. Big5b, by adding 7,660 infrequently-used characters, is overwhelming to anyone, especially the learner.


It appears that GB2312 went a little bit farther in including infrequently-used characters than did Big5a, but Big5 as a whole went a lot further than GB2312 in including infrequently-used characters.

Comparing GB2312 and Big5, part two

Here are three simple comparisons of GB2312 and Big5, to visually show the overlap. Considering just Big5a gives the most overlap:

GB2312|Big5a


Note that a lot of the disparity comes from GB2312 having 2,311 simplified characters and Big5a having 1,772 simplifiable traditional characters. I can't show that visually using the current database capabilities. Maybe later....


Here's the GB2312 and Big5b overlap:

GB2312|Big5b


And finally, here's the consolidated GB2312 / Big5 overlap:

GB2312|Big5


Note that if the simplified characters in GB2312 were converted to traditional before computing the overlap, GB2312 would almost be completely contained inside of Big5. But is it worth dealing with 7,660 additional characters to achieve that? It depends on your perspective, of course.

Comments?

You can leave comments or questions on my blog.