This document compiles the Open Chinese Convert (OpenCC) project information to help quickly familiarize with the code structure, data organization, and accompanying tools.
- OpenCC is an open-source Chinese Simplified-Traditional and regional variant conversion tool, supporting Simplified↔Traditional, Hong Kong/Macau/Taiwan regional differences, Japanese Shinjitai/Kyujitai character forms, and other conversion schemes.
- The project provides a C++ core library, C language interface, command-line tools, as well as Python, Node.js and other language bindings. The dictionary and program are decoupled for easy customization and extension.
- Main dependencies:
rapidjsonfor configuration parsing,marisa-triefor high-performance dictionaries (.ocd2), optionalDartsfor legacy.ocdsupport.
- Dictionaries are maintained in
data/dictionary/*.txt, covering phrases, characters, regional differences, Japanese new characters, and other topic files; converted to.ocd2during build for acceleration. - Default configurations are located in
data/config/, such ass2t.json,t2s.json,s2tw.json, etc., defining segmenter types, dictionaries used, and combination methods. data/schemeanddata/scriptsprovide dictionary compilation scripts and specification validation tools.
.ocd(legacy format) hasOPENCCDARTS1as the file header, with the main body being serialized Darts double-array trie data, combined withBinaryDictstructure to store key-value offsets and concatenation buffers. Loading process is detailed insrc/DartsDict.cppandsrc/BinaryDict.cpp. Commonly used in environments requiringENABLE_DARTSfor compatibility..ocd2(default format) hasOPENCC_MARISA_0.2.5as the file header, followed bymarisa::Triedata, then uses theSerializedValuesmodule to store all candidate value lists. Seesrc/MarisaDict.cpp,src/SerializedValues.cppfor details. This format is smaller and loads faster (e.g.,NEWS.mdrecordsSTPhrasesreduced from 4.3MB to 924KB).- The command-line tool
opencc_dictsupportstext ↔ ocd2(and optionallyocd) conversion. When adding or adjusting dictionaries, first edit.txt, then run the tool to generate the target format.
- The top-level build system supports CMake, Bazel, Node.js
binding.gyp, Pythonpyproject.toml, with cross-platform CI integration. src/*Test.cpp,test/directories contain Google Test-style unit tests covering dictionary matching, conversion chains, segmentation, and other key logic.- Tools
opencc_dict,opencc_phrase_extract(src/tools/) help developers convert dictionary formats and extract phrases.
- Python module is located in
python/, providing theOpenCCclass through the C API. - Node.js extension is in the
node/directory, using N-API/Node-API to call the core library. - README lists third-party Swift, Java, Go, WebAssembly and other porting projects, showcasing ecosystem breadth.
- Edit or add dictionary entries in
data/dictionary/*.txt. - Use
opencc_dictto convert to.ocd2. - Copy/modify configuration JSON in
data/configand specify new dictionary files. - Load custom configuration through
SimpleConverter, command-line tools, or language bindings to verify results.
For deeper understanding, read the module documentation in
src/README.md, or refer to test cases intest/to understand conversion chain combinations.
- Missing segmentation and conversion chain order: If
groupconfiguration or dictionary priority is not restored, compound words may be split apart or overwritten by single characters. - Missing longest prefix logic: Character-by-character replacement alone will miss idioms and multi-character word results.
- Improper UTF-8 handling: Overlooking multi-byte characters or surrogate pair handling can easily cause offset or truncation issues.
- Incomplete dictionaries/configuration: Missing segmentation dictionaries, regional differences and other
.ocd2files will result in missing words in output. - Path and loading process differences: If OpenCC's path search and configuration parsing details are not followed, the actual loaded resources will differ from official ones, naturally leading to different results.
- CONTRIBUTING.md - Complete guide on how to contribute dictionary entries to OpenCC, write test cases, and execute testing procedures.
- src/README.md - Detailed technical documentation for core modules.
- README.md - Project overview, installation and usage guide.