Posted 2012-01-16. Last updated 2012-01-16.
Presented here is a quantitative comparison of traditional and
simplified Chinese characters and of some Chinese character sets. The
charts and links below use a live Chinese
Character Web API (my own devising) that serves data provided by
the Unicode Consortium.
First, let's get some character sets out on the table.
Big5 is split into two parts, though this does not seem to be readily acknowledged. Each part makes a complete pass through "the" dictionary. The first pass has 5,401 characters; the second pass has 7,660 characters. The first set of characters (which I call Big5a) seems to be the only useful part of Big5, at least for a learner of Chinese. I'm skeptical that even a language native has much use for, or knowledge of, the second set of characters (which I call Big5b).
If you're typing an e-mail that uses only GB2312 characters, or if you're displaying a web page that uses only Big5 characters, it's highly likely that Unicode and UTF-8 are used to render and display the content.
What do I mean by focusing on, sticking with, and studying one character set over another? I am referring to the tools (dictionaries) you may use when looking up characters. To avoid overwhelming yourself, you want a dictionary that gives you practical results. For example, if you're searching for the character pronounced peng, would you rather see around 20 possibilities or around 70, where 50 of those might be variant duplicates and generally unknown or obsolete characters?
If you're searching for a character with a 水 radical plus 8 strokes, would you rather see around 40 possibilities or around 110?
On the other hand, if you're looking up a character and can't find it, you might pull your hair out trying to determine if you missed it or if your dictionary doesn't cast a wide enough net. Unfortunately, there's no perfect balance. For example, if you're looking up a character in a Cantonese newspaper, you may need to widen your scope a bit; Big5a might not cut it.
Let's examine the number of simplified characters in each major character set.
It's not effective to chart the number of simplified Big5 characters, as there are so few. Here's a summary:
Of those few simplified Big5 characters, GB2312 includes them all except:
GB2312 is a subset of CJK, so the difference in simplified characters is that GB2312 is missing some:
GB2312 includes a handful of simplifiable characters:
We can see that:
Let's look at a few obscure cases. Some of these may represent errors in the Unihan database. I haven't researched them linguistically.
In all of CJK, there is a unique character triple. The database indicates that a traditional character has been simplified, and the character it was simplified to has also been simplified. Does this make two traditionals and one simplified? Or one traditional and two simplifieds? It's mainly a curiosity, as there's just one.
Another obscure case is where multiple traditional characters have been simplified to the same character, or where a single traditional character has multiple simplifications. Again, I don't know if any of these are errors in the database, but they are interesting to ponder:
All the above simplified characters are in GB2312.
Here are two different looks at the CJK Unified Ideographs:
We expect the percentages to look the same, because we expect the simplified characters to be the simplified counterparts of the same number of simplifiable traditional characters. But what is the subtle difference? First, let's look at a Venn diagram of the larger regions:
We expect the 2,620 "not simplified" characters to be the traditional
characters that simplify down to the 2,548 "not simplifiable"
characters. Notice these numbers do match the small slices in the pie
charts except for being off by one. That's because of that one character
that has both a traditional variant and a simplified variant, mentioned
above.
Still, there is the difference of 2,620 and 2,548, which is 72.
We'll assume this is accounted for by the multiple traditional variants
(+77) and multiple simplified variants (-6) that are present, mentioned
above. Again, off by one—probably that pesky special character again.
Note the comfortingly symmetrical breakdowns of GB2312 and Big5a:
In both cases, there's about a 1/3 + 2/3 split.
There's a lot of overlap, but the non-overlapping areas are also fairly large. One question you might ask is, "Are the missing traditional GB2312 characters in Big5b?" Most of them are, but there are still about 200 traditional GB2312 characters that aren't in Big5 at all.
GB2312 has 6,763 characters and Big5a has 5,401 characters—smaller by 1,362. There can't be a one-to-one mapping between the two. The range of 5,000 or 6,000 characters is about how many a language native would be exposed to. Big5b, by adding 7,660 infrequently-used characters, is overwhelming to anyone, especially the learner.
It appears that GB2312 went a little bit farther in including infrequently-used characters than did Big5a, but Big5 as a whole went a lot further than GB2312 in including infrequently-used characters.
Here are three simple comparisons of GB2312 and Big5, to visually show
the overlap. Considering just Big5a gives the most overlap:
Note that a lot of the disparity comes from GB2312 having 2,311
simplified characters and Big5a having 1,772 simplifiable traditional
characters. I can't show that visually using the current database
capabilities. Maybe later....
Here's the GB2312 and Big5b overlap:
And finally, here's the consolidated GB2312 / Big5 overlap:
Note that if the simplified characters in GB2312 were converted to traditional before computing the overlap, GB2312 would almost be completely contained inside of Big5. But is it worth dealing with 7,660 additional characters to achieve that? It depends on your perspective, of course.
You can leave comments or questions on my blog.