|本期目录/Table of Contents|

基于CRF模型的蒙古文分词及词性标注的研究(PDF)

《内蒙古大学学报(社会科学版)》[ISSN:1000-9035/CN:22-1262/O4]

期数:
2016年02期
页码:
23-28
栏目:
蒙古学研究
出版日期:
2016-04-05

文章信息/Info

Title:
Research on CRF-based Mongolian Word Segmentation and POS-tagging
作者:
那日松1淑〓琴2齐力格尔3
1. 杭州师范大学国际教育学院,浙江 杭州 311121;
2. 内蒙古大学图书馆;
3. 内蒙古大学蒙古学学院,内蒙古 呼和浩特 010021
Author(s):
Narisong1SHU Qin2Qiliger3
1. School of International Education, Hangzhou Normal University, Hangzhou  311121, China; 2.  Inner Mongolia University Library; 3. School of Mongolian Studies, Inner Mongolia University, Hohhot 010021, China
关键词:
蒙古文分词蒙古文词性标注条件随机场
Keywords:
Mongolian word segmentation Mongolian part of speech (POS) tagging CRF model
分类号:
-
DOI:
-
文献标识码:
A
摘要:
为了探讨蒙古文自动词切分及词性标注的问题,可以首先对20万词级蒙古文语料的词切分和词性标注情况进行统计和分析,并对其切分和标注错误进行二次修正,然后再采用条件随机场模型(CRF),进行自动“分词”、“词性标注”、分词及词性标注“统一实现”的研究。开放测试的结果表明,蒙古文自动分词准确率在98%以上,〖JP2〗蒙古文分词和词性标注“统一实现”实验结果的准确率比分词和词性标注“两步走”实验结果的准确率高出3.55%,“统一实现”实验在考虑“上下文”和特征“连写的附加成分”后所得准确率可以达到93.38%,这在一定程度上解决了蒙古文分词及词性标注问题。
Abstract:
This paper explores the Mongolian word segmentation and POS tagging problems based on  200 thousand Mongolian words corpus. The Mongolian words corpus is firstly analyzed after manual segmentation and POS tagging. Then the Conditional Random Fields model (CRF) is adopted for the word segmentation, POS tagging, and a unified process of word segmentation and POS tagging respectively. Findings in the open test show that the precision of word segmentation is more than 98%; the precision of "unified process" (unified process of word segmentation and POS tagging) is 3.55% higher than that of "two-step" (word segmentation firstly, then POS tagging); and the precision of "unified process" can reach 93.38% considering the context and characteristics of the "agglutinative word-formation suffix", which to some extent solves the problems of Mongolian word segmentation and POS tagging.

参考文献/References

[1]石民,陈小荷,于丽丽,等. 基于CRF的古汉语分词标注一体化研究[J].中文信息学报,2010,(3).
[2]白栓虎.汉语词切分及词性标注一体化方法[A].计算语言学进展与应用[C]. 北京:清华大学出版社, 1995.
[3]Hwee Tou Ng,Jin Kiat Low.Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word_Based or Character-Based?[C].Proceedings of ACL-04.
[4]Yue Zhang,Stephen Clark.Joint Word Segmentation and POS Tagging Using a Single Perceptron[C].Proceedings of ACL-08.
[5]J. Lafferty, A. McCallum, F. Pereira.Conditional random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,In Proc. of ICML,  2001.
[6]F. Sha,F. Pereira.Shallow Parsing with Conditional random Fields,In Proc. of HLT/NAACL 2003.

备注/Memo

备注/Memo:
收稿日期:  2015-03-18
基金项目:国家社科基金重大项目(项目批准号:11&ZD188)
作者简介:  那日松,女,蒙古族,内蒙古兴安盟人,杭州师范大学国际教育学院,助理研究员,博士;淑琴,女,蒙古族,内蒙古哲里木盟人,内蒙古大学图书馆,副研究馆员,博士;齐力格尔,女,蒙古族,内蒙古哲里木盟人,内蒙古大学蒙古学学院,硕士研究生。
更新日期/Last Update: