分类:Cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Shaosheng Cao, Wei Lu, Jun Zhou and Xiaolong Li, cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information, In Proceedings of AAAI 2018
Abstract
In this paper, we propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observation that exploiting stroke-level information is crucial for improving the learning of Chinese word embeddings. Specifically, we design a minimalist approach to exploit such features, by using stroke n-grams, which capture semantic and morphological level information of Chinese words. Through qualitative analysis, we demonstrate that our model is able to extract semantic information that cannot be captured by existing methods. Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms state-of-the-art approaches such as word-based word2vec and GloVe, character-based CWE, component-based JWE and pixel-based GWE.
总结和评论
Word3vec算法word2vec[1][2],基于一个字上下文经常在一起出现的其他字的频率,可以用来发现词之间的语义联系,并用于解决自然语言处理的问题。在语音语言上,词的内部结构通常没有丰富的意义,基本上就是表示一个读音。但是,汉字是特殊的,其内部结构很多时候表示了含义上的联系。例如,妈、妹、姐、姑、奶都有“女”字旁,而且它们这几个字确实和“女”有联系。这个工作的研究者就注意到了这个联系,从而把汉字打开用更加基本的结构来发现汉字的含以上的联系。
参考文献
- ↑ T Mikolov, K Chen, G Corrado, J Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781.
- ↑ T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, 3111-3119.
引用错误:在<references>中以“Cao”名字定义的<ref>标签没有在先前的文字中使用。
本分类目前不含有任何页面或媒体文件。