Mining pinyin-to-character conversion rules from large-scale corpus : a rough set approach

This paper introduces a rough set technique for solving the problem of mining Pinyin-to-character (PTC) conversion rules. It first presents a text-structuring method by constructing a language information table from a corpus for each pinyin, which it will then apply to a free-form textual corpus. Da...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society. - 1997. - 34(2004), 2 vom: 24. Apr., Seite 834-44
1. Verfasser: Wang, Xiaolong (VerfasserIn)
Weitere Verfasser: Chen, Qingcai, Yeung, Daniel S
Format: Aufsatz
Sprache:English
Veröffentlicht: 2004
Zugriff auf das übergeordnete Werk:IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM150512473
003 DE-627
005 20250205202601.0
007 tu
008 231223s2004 xx ||||| 00| ||eng c
028 5 2 |a pubmed25n0502.xml 
035 |a (DE-627)NLM150512473 
035 |a (NLM)15376833 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Wang, Xiaolong  |e verfasserin  |4 aut 
245 1 0 |a Mining pinyin-to-character conversion rules from large-scale corpus  |b a rough set approach 
264 1 |c 2004 
336 |a Text  |b txt  |2 rdacontent 
337 |a ohne Hilfsmittel zu benutzen  |b n  |2 rdamedia 
338 |a Band  |b nc  |2 rdacarrier 
500 |a Date Completed 15.10.2004 
500 |a Date Revised 08.11.2019 
500 |a published: Print 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a This paper introduces a rough set technique for solving the problem of mining Pinyin-to-character (PTC) conversion rules. It first presents a text-structuring method by constructing a language information table from a corpus for each pinyin, which it will then apply to a free-form textual corpus. Data generalization and rule extraction algorithms can then be used to eliminate redundant information and extract consistent PTC conversion rules. The design of our model also addresses a number of important issues such as the long-distance dependency problem, the storage requirements of the rule base, and the consistency of the extracted rules, while the performance of the extracted rules as well as the effects of different model parameters are evaluated experimentally. These results show that by the smoothing method, high precision conversion (0.947) and recall rates (0.84) can be achieved even for rules represented directly by pinyin rather than words. A comparison with the baseline tri-gram model also shows good complement between our method and the tri-gram language model 
650 4 |a Journal Article 
700 1 |a Chen, Qingcai  |e verfasserin  |4 aut 
700 1 |a Yeung, Daniel S  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society  |d 1997  |g 34(2004), 2 vom: 24. Apr., Seite 834-44  |w (DE-627)NLM098252887  |x 1083-4419  |7 nnns 
773 1 8 |g volume:34  |g year:2004  |g number:2  |g day:24  |g month:04  |g pages:834-44 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 34  |j 2004  |e 2  |b 24  |c 04  |h 834-44