分类:Distributed representations of words and phrases and their compositionality

来自Big Physics
Jinshanw讨论 | 贡献2020年6月15日 (一) 13:37的版本
(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)


T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, 3111-3119(2013).

Abstract

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show that by subsampling frequent words we obtain significant speedup, and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of Canada and "Air cannot be easily combined to obtain "Air Canada. Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model.

总结和评论

这个工作实际上是[1]的后续工作,但是具有更高的可读性,也讨论了对于短语的扩展。

在这个工作中,作者提出了用机器学习的方法获得词语的矢量表示,在这个表示里面两个意义相近的词的表示矢量也相近,也就是内积比较大。其基本的思路就是训练一个预测机器,使得从一个词的附近的几个词来预测这个词,或者倒过来从一个词开始来预测周围的几个词,有比较高的准确率。如图所示Word2vec.png。预测模型Skip-gram(从目标词预测周围词)的主要数学结构就是[math]\displaystyle{ P\left(w_{e}|w_{t}\right) = \frac{e^{\nu^{T}_{t}\nu_{e}}}{\sum_{\tau}e^{\nu^{T}_{\tau}\nu_{t}}} }[/math],其中t是目标词,e是环境词。那么,目标函数就是使得通过这个机器——它训练的是这些矢量[math]\displaystyle{ \nu_{t},\nu_{e} }[/math]——预测出来的[math]\displaystyle{ w_{e} }[/math]和实际文本中出现的最接近。可以发现,如果训练成功,那么,这些经常在一起出现的词的矢量表示就会比较接近。加上额外的假设:在一起出现的词的含义上具有相近的地方,就达到了初始的目的:含义上相近的词的矢量表示内积比较大。当然,实际计算不是直接拿着这个函数来做的,计算量实在是太大了。具体如何把这个函数改成更容易计算的形式,如何运用机器学习的方法来实现,就是另外的事情了。主要思想就这么简单。

了解了这个主要思想之后,就会发现,其实这里面假设挺多的。是不是真的能够实现目标不好说。因此,结果的验证不是很容易。在这方面,这篇文章和后续的其他文章有可以借鉴的地方。另外,词矢量表示的检验本身后来成了一个研究问题。

参考文献

  1. T Mikolov, K Chen, G Corrado, J Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781.

本分类目前不含有任何页面或媒体文件。