Corpus of the People's Daily / 中國人民日報標注語料庫 (PFR)
- 年月日-版號-篇章號-段號
- Interval of each article will contain one extra empty line (i.e. 3 empty lines)
I found that the file structure isn't consist = = (rule 2)
| Tag | Meaning |
|---|---|
| a | adjective |
| ad | adjective as adverbial |
| ag | adjective morpheme |
| an | adjective with nominal function |
| b | non-predicate adjective |
| bg | non-predicate adjective morpheme |
| c | conjunction |
| cg | conjunction morpheme |
| d | adverb |
| dg | adverb morpheme |
| e | interjection |
| ew | sentential puncuation |
| f | directional locality |
| fg | locality morpheme |
| g | morpheme |
| h | prefix |
| i | idiom |
| j | abbreviation |
| k | suffix |
| l | fixed expressions |
| m | numeral |
| mg | numeric morpheme |
| n | common noun |
| ng | noun morpheme |
| nr | personal name |
| ns | place name |
| nt | organization name |
| nx | nominal charachter string |
| nz | other proper noun |
| o | onomatope |
| p | preposition |
| pg | preposition morpheme |
| q | classifier |
| qg | classifier morpheme |
| r | pronoun |
| rg | pronoun morpheme |
| s | space word |
| t | time word |
| tg | time word morpheme |
| u | auxiliary |
| v | verb |
| vd | verb as adverbial |
| vg | verb morpheme |
| vn | verb with nominal function |
| w | symbol and non-sentential punctuation |
| x | unclassified items |
| y | modal particle |
| yg | modal particle morpheme |
| z | descriptive |
| zg | descriptive morpheme |
| 代碼 | 名稱 |
|---|---|
| Ag | 形語素 |
| a | 形容詞 |
| ad | 副形詞 |
| an | 名形詞 |
| Bg | 區別語素 |
| b | 區別詞 |
| c | 連詞 |
| Dg | 副語素 |
| d | 副詞 |
| e | 嘆詞 |
| f | 方位詞 |
| g | 語素 |
| h | 前接成分 |
| i | 成語 |
| j | 簡略語 |
| k | 後接成分 |
| l | 習用語 |
| Mg | 數語素 |
| m | 數詞 |
| Ng | 名語素 |
| n | 名詞 |
| nr | 人名 |
| ns | 地名 |
| nt | 機構團體 |
| nx | 外文字符 |
| nz | 其它專名 |
| o | 擬聲詞 |
| p | 介詞 |
| Qg | 量語素 |
| q | 量詞 |
| Rg | 代語素 |
| r | 代詞 |
| s | 處所詞 |
| Tg | 時間語素 |
| t | 時間詞 |
| Ug | 助語素 |
| u | 助詞 |
| Vg | 動語素 |
| v | 動詞 |
| vd | 副動詞 |
| vn | 名動詞 |
| w | 標點符號 |
| x | 非語素字 |
| Yg | 語氣語素 |
| y | 語氣詞 |
| z | 狀態詞 |
Output diagonal element is 1. (self-similarity)
In : df.shape
Out: (3443, 3443)
In : df.head()
Out:
0 1 2 3 4 5 6 \
0 1.000000 0.247370 0.031764 0.011568 0.069333 0.000000 0.104921
1 0.247370 1.000000 0.074034 0.093247 0.114389 0.118067 0.099436
2 0.031764 0.074034 1.000000 0.199128 0.240452 0.000000 0.318746
3 0.011568 0.093247 0.199128 1.000000 0.121334 0.024070 0.152113
4 0.069333 0.114389 0.240452 0.121334 1.000000 0.000000 0.148902In : df.shape
Out: (3443, 3443)
In : df.head()
Out:
(2-dimensional LSI space)
0 1 2 3 4 5 6 \
0 1.000000 0.972027 0.986259 0.965273 0.980496 0.753523 0.988651
1 0.972027 1.000000 0.997472 0.999629 0.999230 0.886853 0.996280
2 0.986259 0.997472 1.000000 0.995168 0.999492 0.851778 0.999885
3 0.965273 0.999629 0.995168 1.000000 0.997791 0.899103 0.993565
4 0.980496 0.999230 0.999492 0.997791 1.000000 0.868036 0.998895
(5-dimensional LSI space)
0 1 2 3 4 5 6 \
0 1.000000 0.983483 0.848290 0.796135 0.629115 0.691262 0.791310
1 0.983483 1.000000 0.894295 0.839724 0.696409 0.727351 0.818970
2 0.848290 0.894295 1.000000 0.907489 0.878193 0.536798 0.933318
3 0.796135 0.839724 0.907489 1.000000 0.910081 0.370954 0.861982
4 0.629115 0.696409 0.878193 0.910081 1.000000 0.406662 0.728450
(100-dimensional LSI space)
0 1 2 3 4 5 6 \
0 1.000000 0.653180 0.215969 0.222961 0.154828 0.149586 0.406351
1 0.653180 1.000000 0.376675 0.380387 0.363116 0.256816 0.463112
2 0.215969 0.376675 1.000000 0.475014 0.592771 -0.006707 0.613748
3 0.222961 0.380387 0.475014 1.000000 0.350812 -0.039246 0.369790
4 0.154828 0.363116 0.592771 0.350812 1.000000 -0.022277 0.382058