Comparing Traditional and Simplified Chinese

Posted 2012-01-16. Last updated 2012-01-16.

Overview

Presented here is a quantitative comparison of traditional and simplified Chinese characters and of some Chinese character sets. The charts and links below use a live Chinese Character Web API (my own devising) that serves data provided by the Unicode Consortium.

First, let's get some character sets out on the table.

CJK Unified Ideographs. This is the set of 20,940 characters covered in Unicode 1.0. Even though Unicode 6.0 covers almost 94,000 characters, 20,940 is already more than any one person could probably fully comprehend. The general consensus for functional literacy is knowledge of between 3,000 and 4,000 characters. Yes, it's occasionally interesting to consider the set of CJK Unified Ideographs, but it's not practical to expect to master it.

GB2312. Devised in 1980 for mainland China, this is still the character set to focus on if you are studying simplified Chinese. It consists of 6,763 characters. The newer versions of this set, namely GBK and GB18030, consist of the full set of CJK Unified Ideographs, which in my view makes them less useful to the learner. Examining GB2312 provides insight into the language that bigger character sets do not. Sticking with GB2312 helps avoid overload when trying to get a handle on the language.

Big5. Devised in 1984 for Taiwan, this is still the character set to focus on if you are studying traditional Chinese.

Big5 is split into two parts, though this does not seem to be readily acknowledged. Each part makes a complete pass through "the" dictionary. The first pass has 5,401 characters; the second pass has 7,660 characters. The first set of characters (which I call Big5a) seems to be the only useful part of Big5, at least for a learner of Chinese. I'm skeptical that even a language native has much use for, or knowledge of, the second set of characters (which I call Big5b).

The data are live

The data are live, and the regions in the pie charts have tool tips and can be clicked. The charts are Google Charts, with the Venn diagrams being a fairly old, weak, image-based solution (still Google).

Results in the form of character lists show up on the right side of the browser window (you need a fairly large screen). The character lists are crude but I hope self-explanatory. When variants are shown, T means traditional and S means simplified.

I tend to fall into the lingo I used for the Web API, such as using "!" to mean not. You can read more about the API and the lingo at http://ccdb.hemiola.com/#filter.

If you don't see any charts or links, something went wrong. Perhaps your browser is not modern enough. No attempt has been made to support old browsers. See here for a tiny bit more detail.

Character sets? Encoding? Huh?

When I refer to GB2312, I am referring to the set of characters it encompasses, not to an encoding standard. GB2312 characters can be encoded as Unicode for manipulation and further encoded as UTF-8 for over-the-wire transmission.

Some reading this page may think, "GB2312 is an old, stale encoding that causes nothing but compatibility issues." That line of thinking is unrelated to how I treat GB2312, which is as a set of characters. I wouldn't advocate using anything other than UTF-8 as an encoding.

If you're typing an e-mail that uses only GB2312 characters, or if you're displaying a web page that uses only Big5 characters, it's highly likely that Unicode and UTF-8 are used to render and display the content.

What do I mean by focusing on, sticking with, and studying one character set over another? I am referring to the tools (dictionaries) you may use when looking up characters. To avoid overwhelming yourself, you want a dictionary that gives you practical results. For example, if you're searching for the character pronounced peng, would you rather see around 20 possibilities or around 70, where 50 of those might be variant duplicates and generally unknown or obsolete characters?

Search on peng in GB2312

Search on peng in big5a

Search on peng in CJK

Search on peng outside of GB2312 and Big5

If you're searching for a character with a 水 radical plus 8 strokes, would you rather see around 40 possibilities or around 110?

Search on 水+8 in GB2312

Search on 水+8 in big5a

Search on 水+8 in CJK

Search on 水+8 in outside of GB2312 and Big5

On the other hand, if you're looking up a character and can't find it, you might pull your hair out trying to determine if you missed it or if your dictionary doesn't cast a wide enough net. Unfortunately, there's no perfect balance. For example, if you're looking up a character in a Cantonese newspaper, you may need to widen your scope a bit; Big5a might not cut it.

Quantity of simplified characters

Let's examine the number of simplified characters in each major character set.

CJK Breakdown

GB2312 Breakdown

It's not effective to chart the number of simplified Big5 characters, as there are so few. Here's a summary:

Simplified Big5a

Simplified Big5b

Of those few simplified Big5 characters, GB2312 includes them all except:

Simplified GB2312 missing from Big5

GB2312 is a subset of CJK, so the difference in simplified characters is that GB2312 is missing some:

Simplified CJK missing from GB2312

GB2312 includes a handful of simplifiable characters:

Simplifiable GB2312

Summary of this section

We can see that:

There are 2,311 simplified characters in GB2312.
Big5 includes a small number of simplified characters (123).
There are 2,549 simplified characters in GBK (i.e, CJK Unicode Ideographs).
The size of GB went from 6,763 to 20,902 (an increase of over 14,000) and only added 238 simplified characters. The rest are traditional characters, many of which are infrequently-used, and clearly about 2,500 of those are the traditional characters that the simplified characters have been simplified from.

Some obscure cases

Let's look at a few obscure cases. Some of these may represent errors in the Unihan database. I haven't researched them linguistically.

In all of CJK, there is a unique character triple. The database indicates that a traditional character has been simplified, and the character it was simplified to has also been simplified. Does this make two traditionals and one simplified? Or one traditional and two simplifieds? It's mainly a curiosity, as there's just one.

Simplified and simplifiable

Another obscure case is where multiple traditional characters have been simplified to the same character, or where a single traditional character has multiple simplifications. Again, I don't know if any of these are errors in the database, but they are interesting to ponder:

Multiple traditional characters simplified to the same character

All the above simplified characters are in GB2312.

Multiple simplifications of the same character

A look at CJK

Here are two different looks at the CJK Unified Ideographs:

CJK Breakdown

We expect the percentages to look the same, because we expect the simplified characters to be the simplified counterparts of the same number of simplifiable traditional characters. But what is the subtle difference? First, let's look at a Venn diagram of the larger regions:

!Simplified CJK|!Simplifiable CJK

We expect the 2,620 "not simplified" characters to be the traditional characters that simplify down to the 2,548 "not simplifiable" characters. Notice these numbers do match the small slices in the pie charts except for being off by one. That's because of that one character that has both a traditional variant and a simplified variant, mentioned above.

Still, there is the difference of 2,620 and 2,548, which is 72. We'll assume this is accounted for by the multiple traditional variants (+77) and multiple simplified variants (-6) that are present, mentioned above. Again, off by one—probably that pesky special character again.

Comparing GB2312 and Big5, part one

Note the comfortingly symmetrical breakdowns of GB2312 and Big5a:

GB2312 Breakdown

Big5a Breakdown

In both cases, there's about a 1/3 + 2/3 split.

In the GB2312 case, 1/3 of the characters are simplified, and 2/3 of the characters are traditional (i.e., not simplified).
In the Big5a case, 1/3 of the characters are simplifiable, and 2/3 of the characters are not simplifiable.

These two charts show a comforting similarity between GB2312 and Big5a, but if you look closer at how the sets relate, the comforting feeling falls apart somewhat. When you consider the difference in sizes of the two sets, you know that there cannot be a direct mapping between the two. Here's a comparison of the larger 2/3 areas of each:

!Simplified GB2312|!Simplifiable Big5a

There's a lot of overlap, but the non-overlapping areas are also fairly large. One question you might ask is, "Are the missing traditional GB2312 characters in Big5b?" Most of them are, but there are still about 200 traditional GB2312 characters that aren't in Big5 at all.

Traditional GB2312 covered in Big5b

Traditional GB2312 not in Big5

Conclusions to this section

GB2312 has 6,763 characters and Big5a has 5,401 characters—smaller by 1,362. There can't be a one-to-one mapping between the two. The range of 5,000 or 6,000 characters is about how many a language native would be exposed to. Big5b, by adding 7,660 infrequently-used characters, is overwhelming to anyone, especially the learner.

It appears that GB2312 went a little bit farther in including infrequently-used characters than did Big5a, but Big5 as a whole went a lot further than GB2312 in including infrequently-used characters.

Comparing GB2312 and Big5, part two

Here are three simple comparisons of GB2312 and Big5, to visually show the overlap. Considering just Big5a gives the most overlap:

GB2312|Big5a

Note that a lot of the disparity comes from GB2312 having 2,311 simplified characters and Big5a having 1,772 simplifiable traditional characters. I can't show that visually using the current database capabilities. Maybe later....

Here's the GB2312 and Big5b overlap:

GB2312|Big5b

And finally, here's the consolidated GB2312 / Big5 overlap:

GB2312|Big5

Note that if the simplified characters in GB2312 were converted to traditional before computing the overlap, GB2312 would almost be completely contained inside of Big5. But is it worth dealing with 7,660 additional characters to achieve that? It depends on your perspective, of course.

Comments?

You can leave comments or questions on my blog.