Skip to content

Numbered Pinyin issues encountered in CEDICT #29

@bai-yi-bai

Description

@bai-yi-bai

Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.

Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.

Issues

  1. Numbered Pinyin do not convert to Accented
  2. Accented pinyin which do not convert to zhuyin fuhao
  3. Already noted in issue 27
  4. Taiwanese pronunciation exceptions

Numbered Pinyin do not convert to Accented Pinyin

More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:

  • 'u:4', 'ǜ'
  • 'u:3', 'ǚ'
  • 'u:2', 'ǘ'
  • 'u:1', 'ǖ'
  • 'u:', 'ü'
  • 'yo1', 'yō'
  • 'yo5', 'yo'

These items raise 'ValueError: Not a valid syllable:' exceptions.

Accented pinyin which do not convert to zhuyin fuhao

I also encountered the following items which do not convert correctly:

  • 'ó':'ㄛˊ' # 哦 哦 [o2] /oh (interjection indicating doubt or surprise)/
  • 'ò':'ㄛˋ' # 哦 哦 [o4] /oh (interjection indicating that one has just learned sth)/
  • 'ō':'ㄛ'
  • 'ǒ':'ㄛˇ'
  • 'yō':'ㄧㄛ'
  • 'yo':'ㄧㄛ˙'
  • 'dia3':'ㄉㄧㄚˇ' # diǎ 嗲 嗲 [dia3] /coy/childish/
  • 'm2':'ㄇˊ'
  • 'm4':'ㄇˋ'

Already noted in issue 27

#27

  • 'tēi':'ㄊㄨㄟ' # Workaround for 忒 忒 [tei1] /(dialect) too/very/also pr. [tui1]/
  • 'eng1':'ㄥ' # Work around for ēng 鞥 鞥 [eng1] /reins/

Taiwanese Pronunciation Exceptions

I found it necessary to skip items which contained Taiwanese pronunciations of ['khè' ,'goá' ,'khàu' ,'ô' ,'yai2'] . I'm not sure anything can be done about this with Dragonmapper.
dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions