How to create a font input method

Statement: This article is not a software development tutorial and will not teach you how to develop input software like "Some Dog Input Method" or "Some Baidu Input Method." This article does not involve program code and serves only as a source of inspiration for ideas. Input method experts are welcome to provide guidance.

If you have never encountered glyph input methods, you may first read these two articles by Beiqiao: After Trying Seven Glyph Input Methods, I Want to Talk About Using Wubi in 2022 and To Type More Comfortably, I Learned a Niche Input Method Pursuing Extreme Performance. The introduction of this article also provides a brief overview; if you still have questions, feel free to leave a message, and I will add more information.

About myself: I use Starry Sky Keyboard for chatting and typing, and I use Sanxu (a modified version of Xu Code for personal use) for formal typing, with a typing speed of 60 characters per minute. I have previously dabbled in Xiaohe Sound Glyph, Cangjie Five Generations, and Tiger Code, but abandoned them due to different needs.

Introduction#

Definition of Chinese Character Input Method#

It's actually quite simple: An input encoding function that outputs a set of Chinese characters is a Chinese character input method.

Suppose $Z$ is a set of $n$ Chinese characters, $C$ is a set of $l$ subsets of $Z$, $c_i$ is the $i$-th element of $C$, and $M$ is a set of $l$ encodings, then $f: M \rightarrow C$ is the input method.

Nowadays, most people in simplified character regions use Pinyin input methods—inputting Pinyin to output Chinese characters; a small number of people use Wubi input methods—inputting codes composed of A to Y with a length of less than four to output Chinese characters.

In traditional Chinese character regions, there is also the Cangjie input method—inputting codes composed of letters with a length of less than five to output Chinese characters; and Zhuyin input method—inputting codes composed of most characters on the keyboard to output Chinese characters.

In fact, typing using Unicode encoding can also be considered a type of input method.

Glyph Input Method vs. Phonetic Input Method#

The Pinyin and Zhuyin mentioned above belong to phonetic input methods (hereinafter referred to as phonetic codes) that use the pronunciation of Chinese characters for encoding; Wubi and Cangjie belong to glyph input methods (hereinafter referred to as glyph codes) that use the shapes of Chinese characters for encoding; Xiaohe Sound Glyph's encoding method first uses phonetics and then uses glyphs, so it belongs to phonetic-glyph codes.

Today, although phonetic codes equipped with large dictionaries can meet the typing needs of daily communication, glyph codes still hold value due to their low redundancy and independence from pronunciation, so enthusiasts continue to develop new glyph codes.

Radicals and Splitting#

Radicals#

Radicals are the basic shape units for splitting characters in glyph codes. They are similar to the concepts of radicals and components we learned in elementary school, but the scope of radicals is often broader than that of radicals and components.

For example, the 86 version of Wubi input method has a total of 234 radicals, among which "王，" "土，" and "日" are familiar radicals, while the upper part of "炙" and the middle part of "互" may seem a bit unfamiliar.

86 Version Wubi Radicals

In some glyph codes, complex characters may also be set as radicals to reduce the user's burden of splitting characters. For example, in Xu Code, "爾，" "鹵，" and "黽" are all radicals.

Splitting Characters#

Splitting characters means using the established radicals to decompose Chinese characters. This is actually something we experienced in elementary school—when learning new characters, the language teacher would teach us that they are composed of previously learned characters.

How to split simple characters should be straightforward for everyone: "叶" is naturally split into "口" and "十，" and "音" is naturally split into "立" and "日." However, if the structure of the character is slightly more complex, it becomes challenging: should "戊" be split as "戈丿" or as "厂㇂丿丶"? At this point, we need to introduce rules that limit splitting methods to prevent a single Chinese character from having multiple splits.

Taking the 86 Wubi as an example: Wubi has several rules prioritized as considering intuitiveness, following stroke order, taking the larger first, can separate not connect, and can scatter not connect.

Starting with taking the larger first and following stroke order:

"世" can be split as "一凵𠃊" and "廿𠃊," because the first radical of the latter is larger than that of the former, so we split "廿𠃊."
"夷" can be split as "一弓人" and "大弓，" but since "大弓" does not follow stroke order and following stroke order is prioritized over taking the larger, we split "一弓人."

"Can scatter not connect" and "can connect not intersect" mean that splits where radicals are not connected are preferred over those where radicals are connected, and splits where radicals are connected are preferred over those where radicals intersect. Some input methods may also introduce a further rule of can intersect without breaking, meaning that splits where connected radicals intersect are preferred over those where the same stroke is broken. These three rules are collectively referred to as scatter, connect, intersect, break.

Considering intuitiveness is somewhat difficult to understand; it has become a universal rule in Wubi, where the author actively determines what splits are intuitive. For example, in the following images of "或" and "戊，" the former can be split as "戈口一" without following stroke order, while the latter must be split as "厂㇂丿丶" following stroke order. In the 86 Wubi, such issues are common: why does "里" split as "日土" and not "甲二"? Why does "匹" split as "匚儿" and not "兀𠃊"?

Such vague rules are absolutely unacceptable; it is now recommended to use clearer and more rational rules. For example, the splitting rule for "匹" can be explained by introducing complete structure (radicals that exist in fully enclosed and semi-enclosed structures like "囗日目勹冂匚コ凵" should not be split apart). Additionally, the rule of original form radicals (if a radical's vertical stroke becomes a slant when used as a component, or if a horizontal stroke becomes a rising stroke when used as a component, it is considered a lower priority variant radical) can also explain some splitting ambiguities.

However, splitting rules should not be overly detailed; for instance, Zhang Code has made the rules extremely complex to reduce redundancy, which complicates things for users; Cangjie’s splitting method is also unique in the coding community. Input method authors should establish the most easily understood rules while ensuring the uniqueness of splits to reduce the cognitive burden on users.

Cangjie Splitting Overview Cangjie Splitting Overview, excerpted from the Cangjie textbook on Wikipedia

After splitting the commonly used 3500 characters (Level 1 characters from the General Standard Chinese Character List), we obtain a usable splitting table.

Adding and Removing Radicals#

Mr. Wang Yongmin, the author of Wubi, once said: "Generally speaking, a glyph scheme using 26 keys should select 150-250 radicals." Currently, most new input methods fall within this range after merging radicals, in accordance with statistical principles.

In four-code schemes, the number of radicals involved in encoding a single character is often only the first three and the last one, meaning that the middle radicals are essentially useless. If a radical has never been involved in encoding, it can be directly deleted; if multiple characters share the same first three and last one radicals, such as "赢，" "嬴，" and "贏，" then additional radicals should be considered to differentiate them and eliminate redundancy at the splitting stage.

If you want to learn more theoretical basis for quantitative analysis during the splitting phase, you can refer to Lan Luoxiao’s article Quantitative Evaluation of Splitting Tables .

Here, I also want to promote Lan Luoxiao: If you are interested in creating glyph codes and still find it difficult to grasp after reading this, please consider using the Chinese Character Automatic Splitting System developed under Lan Luoxiao's supervision (please ignore the registration function; just select an example to enter the main interface), which has no barriers to UI operation.

Chinese Character Automatic Splitting System

Encoding Method#

Radical Encoding#

Encoding radicals is a prerequisite for encoding single characters. Radical encoding generally follows certain rules; for example, Wubi uses the first two strokes to partition on the QWERTY keyboard, Zheng Code uses the first two strokes corresponding to letter sequences, Cangjie uses "日月金木水火土" corresponding to letter sequences, and glyph codes use the shapes of radicals corresponding to letters. This method of systematically encoding radicals based on glyph features is called shape support. Just as there are phonetic codes and glyph codes, radical encoding also has phonetic support and shape support, with Xiaohe Sound Glyph using phonetic support.

Some input methods, for better performance metrics, will use a completely random (hereinafter referred to as disorderly) method to encode radicals, such as the Sapphire input method.

There are various methods for encoding radicals, listed directly below:

Single encoding, represented by Wubi, where each radical is represented by a letter, for example, the encoding for "十二" is all F, and the encoding for "五一" is all G.
Double encoding, where most radicals are represented by two letters, referred to as big code and small code.
- Small code shape support, represented by Zheng Code, where the small code relates to the shape of the Chinese character, for example, the small code for "耳" is E because its structure contains "十."
- Small code phonetic support, where the small code relates to the pronunciation of the Chinese character, for example, in Xu Code, the small code for "自" is Z because its Pinyin is zi. Using phonetic support can reduce the memory load, making the memory difficulty of double-encoded radicals approach that of single encoding.
Three encodings and beyond... In fact, the length of radical encoding can be arbitrary, but if users cannot even remember the encoding of radicals, it is pointless to discuss subsequent single character encoding.

Single Character Encoding#

First, it is necessary to determine how many radicals participate in the encoding of a character. This article adopts the mainstream glyph code method, which is the aforementioned first three and last one. What if there are fewer than four radicals? We need to use other glyph features of the Chinese character to supplement the encoding; in Wubi, this is the strokes of the radicals or the structure of the Chinese character, while in Zheng Code, it is the small code of the radicals; of course, we can also use the pronunciation of the Chinese character to supplement the encoding, but that would turn it into a phonetic-glyph code.

For example, for characters like "呗员" and "吧邑，" which have identical radicals, we need to use structural codes to differentiate them, using B for top-bottom structure and N for left-right structure. However, the introduction of structural codes creates new memory points for users; at this time, double-encoded radicals can come into play—small codes can be added after the encoding of less frequently used characters, allowing the two encodings to be separated.

Suppose a Chinese character can be split into several radicals, each with an encoding, then the encoding is numbered using uppercase Latin letters ABCD...WXYZ. Specifically, Y and Z are used to emphasize the second-to-last and last radicals. To represent radical strokes, lowercase Greek letters αβ...ω are used. Specifically, ω is used to emphasize the last stroke. Ω represents the glyph structure encoding.

Then the encoding table for Wubi is:

Single radical character

Representing radical AAAA

Non-representing radical Aαβω

Multi-radical character

Two radicals AZΩ

Three radicals ABZΩ

Four or more radicals ABCZ

If you want to learn more about input method encoding methods, you can refer to Zhu Yuhao's article Common Glyph Input Scheme Encoding Rules.

Word Encoding (Optional)#

If you do not want to type only single characters, word encoding is essential. Fortunately, there is currently a recognized method for word encoding in the input method community, so there is no need to spend any brainpower thinking of new methods. Just as radicals form single characters, single characters form words, so the method for encoding words is similar to that for single characters:

Each character has an encoding, which is numbered using uppercase Latin letters ABCD...WXYZ. Specifically, Y and Z are used to emphasize the second-to-last and last radicals. The second encoding of the character is numbered using the corresponding lowercase Latin letters abcd...wxyz.

Then the word encoding table for glyph codes is:

Two-character words AaBb

Three-character words ABCc

Four-character words and above ABCZ

Performance Optimization#

Performance Metrics#

There is also a saying circulating among the public from Mr. Wang Yongmin, the author of Wubi: "A high-level 'glyph code' scheme must possess the three characteristics of compatibility, regularity, and harmony." Compatibility means low redundancy; regularity means easy to learn; harmony means good tactile feedback—these three cannot be achieved simultaneously. Often, input methods sacrifice one advantage to gain another; for example, the currently popular disorderly input methods (Tiger Code, Sapphire, Yima) have sacrificed regularity to enhance the other two, while the Xu Code I use has largely sacrificed harmony in pursuit of compatibility with a large character set.

Static redundancy count: Traverse all encodings, the size of the output set of Chinese characters is the total sum of subsets of size two, reflecting compatibility.
Dynamic redundancy rate: The output set of Chinese characters is sorted by character frequency, removing the first element, and the total sum of the frequencies of the remaining elements reflects compatibility.
Average code length: The encoding length multiplied by the total sum of the character frequencies, noting that the encoding length of non-preferred characters adds one, which is positively correlated with the dynamic redundancy rate. The number of characters typed per minute = the number of keystrokes per minute / average code length.
Speed equivalent: The comfort level of continuous key positions statistically analyzed from over two million experimental data points; for details, refer to the paper Research on Speed Equivalents Related to Key Positions, reflecting harmony.
Left-right hand alternating hit rate: Input all encodings that alternate between left and right hands, outputting the total sum of character frequencies within the character set, reflecting harmony.

Do you remember the mathematical definition of input methods in the introduction? We can derive the mathematical definitions of the above metrics from it.

Let $p: Z \rightarrow [0.1]$ be the mapping of the single character frequency of Chinese characters in a certain text state, sorting each character set in $C$ by character frequency, such that $c_{ij}$ is the $j$-th Chinese character encoded as $m_i$, where $i \in I$, $j \in J_i$, and satisfying $a\geq b$ implies $p(c_{ia})\geq p(c_{ib})$.

Static redundancy count: $N_{s} = \mid {c_{ia}, c_{ib} \text{ for all } a,b \in J_i \text{ and } i \in I }.$

Dynamic redundancy rate: $N_{d} = \sum\limits_{i \in I, j \in J_i/{1}} p(c_{ij}).$

If you do not understand how to calculate these performance metrics, you can also use the online Tiger Code Evaluation Website .

Tiger Code Evaluation Website

Simplified Codes for Full Codes#

The lengths of Wubi's single character encoding and word encoding are both greater than three because space was left for simplified codes when designing the encoding rules. Simplified codes are shorter encodings— for example, the Wubi encoding for "的" is rqyy, but in practice, you only need to type r and then press the spacebar to display "的." Simplified codes that only retain the first code are called first-level simplified codes (abbreviated as one simplified), and similarly, there are two simplified and three simplified codes.

Simplified codes are the simplest way to improve input efficiency. The total frequency of the top 26 commonly used characters is 0.26; if all single characters have a four-code encoding, just reaching one simplified code can reduce the average code length by 0.5; the total frequency of the top 27-702 commonly used characters is 0.57, meaning that reaching one and two simplified codes can reduce the average code length by 1.09. We know that typing speed = number of keystrokes / code length, so ideally reaching one and two simplified codes can improve typing speed by one-third.

The benefits of simplified codes go far beyond this. In contrast to simplified codes, the original encoding of a single character is called the full code. Since the character with the highest frequency at that full code position has already been simplified, the character with the second-highest frequency can naturally take its place; this practice is called letting the full code, meaning giving up the full code position of a character that already has a simplified code. This way, existing redundancy can be eliminated; even if there are no gains in code length from three simplified codes, they can still be used to reduce redundancy.

With so many benefits, what is the cost? Simplified codes are essentially a type of irrational code; the more you have, the heavier the user's memory burden becomes. If you cannot remember them, you will waste time looking at the candidate box, which reduces the number of keystrokes and is counterproductive. It is recommended to only implement one simplified code, let users set two simplified codes themselves, or only implement simplified codes without letting full codes, and not to set three simplified codes.

Global Optimization#

Global optimization generally uses simulated annealing algorithms; for programs, please refer to Principles and Applications of Simulated Annealing Algorithms, Input Method Optimization, Character Frequency, Word Frequency Statistical Algorithm Source Code Sharing, Yuhao Input Method Development Technical Documentation, and this article will not elaborate further.

Since the principle of the annealing algorithm is based on collision probability, setting some constraints (such as radicals that cannot be on the same key position and those that must be on the same key position) can effectively reduce useless radical arrangements, thereby improving algorithm efficiency. When setting constraints, it is advisable to adopt the wisdom of predecessors. For example, before designing Sanxu, I was always curious why regular double-encoded glyph codes must set two small codes under a large code as fixed irregular main radicals. After I started optimizing, I realized that if two main radicals are not produced, redundancy will inevitably increase.

Conclusion#

When your input method is completed, do not forget to export the code table for enthusiasts to use: the code table format supported by most input platforms is character\tab encoding per line, while a few input platforms are reversed. If you are confident in your input method, you can post to the Wubi forum to promote it. Input methods rely on community support; with users, you can optimize based on feedback to make the input method more perfect.

Of course, there is an old saying in cryptography: "Do not design your own encryption algorithm," and I believe the same applies to input methods. Before truly designing one, it might be worthwhile to check if there is an existing input method that meets your needs; using something ready-made is always simpler.

Casual Talk: The Philosophy of Choosing an Input Method#

Currently, Unicode has recorded about 100,000 Chinese characters, meaning that electronic devices can display 100,000 Chinese characters once fonts are installed, and this number is expected to continue to grow (over 4,000 new characters have been added to the CJK-J area). Among these characters, not all pronunciations of Chinese characters have survived to the present; some characters may have pronunciations, but without checking a dictionary, you would generally not know how to pronounce them— to type these characters, you must use glyph codes. In summary, is a large character set important? Not really. Even if your name contains rare characters, you can simply add that character to the code table; there is no need to look for an input method that can type the entire character set. Personally, splitting a large character set is a joy for enthusiasts of Chinese characters, rather than a necessity.

Is typing speed important? Of course, but consider whether you have the perseverance to practice. The barrel effect of typing speed is very evident, and the performance of the input method is actually the longest board in it. To improve typing speed, regardless of which input method you use, only long-term practice can achieve it. Do not think that choosing the best-performing input method will solve everything; speed experts do not become experts because they chose a better input method, but because speed experts invented better input methods to further improve their skills. If you cannot even reach a hundred characters per minute with non-watered-down text, how can you discuss input method performance?

Is simplified and traditional character input important? Of course not; it is a niche demand within a niche demand. The main issue is that OpenCC conversion has some shortcomings, and some variant characters do not need to be converted: for example, "羣，" I have always believed that the left-right structure "群" displays better on electronic devices.

Ultimately, is glyph code important? Not really. The learning cost of double pinyin is low, and reducing the length of full pinyin codes has immediate effects. Input methods exist to input text, and the knowledge carried by text is the ladder to human understanding.