分类:High-Precision Extraction of Emerging Concepts from Scientific Literature

来自Big Physics
Songyk讨论 | 贡献2020年12月8日 (二) 16:01的版本 →‎总结和评论
(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)


Daniel King, Doug Downey, Daniel S. Weld. High-Precision Extraction of Emerging Concepts from Scientific Literature. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Abstract

Identification of new concepts in scientific literature can help power faceted search, scientific trend analysis, knowledge-base construction, and more, but current methods are lacking. Manual identification can't keep up with the torrent of new publications, while the precision of existing automatic techniques is too low for many applications. We present an unsupervised concept extraction method for scientific literature that achieves much higher precision than previous work. Our approach relies on a simple but novel intuition: each scientific concept is likely to be introduced or popularized by a single paper that is disproportionately cited by subsequent papers mentioning the concept. From a corpus of computer science papers on arXiv, we find that our method achieves a Precision@1000 of 99%, compared to 86% for prior work, and a substantially better precision-yield trade-off across the top 15,000 extractions. To stimulate research in this area, we release our code and data.

总结和评论

这篇文章提出了一种科技文献中抽取涌现的概念的方法(命名为ForeCite),在arXiv上的计算机科学文献上的实验取得较高精确度(Precision@1000=99%)。文章还开源了代码和数据集。

该文章想要解决的问题是,如何有效地区别一个词汇究竟是一个真正的概念,还是说只是简单地和真正的概念有一些关联而已?主要想法是:一个真正的概念,有很大的可能性是在一篇被后续文章大量(不成比例地)引用的文章中提出的,或者是经由这篇文献开始涌现或流行起来的。 文章提到了前人的两个工作LoOR[1]和CNLC[2], 三个工作都是基于term citation graph来进行分析,所谓term citation graph是指一个包含特定词汇的引文网络,实际上是整个语料库文献构成的引文网络的子网。LoOR和CNLC的主要想法是,一个概念的引文网络的“密度”要比非概念的引文网络更大。“密度”可以用不同指标来描述,例如子网络的连通性、连边数等等。

回到ForeCite的方法,它给每个潜在的概念词汇定义了一个排序分数,公式如下:

[math]\displaystyle{ ForeCite(G_t)=\max_{p\in{G_t}}\log(f_t^p+1) \cdot \frac{f_t^p}{f_t} }[/math]

文章p是属于词汇t的引文网络中的节点,[math]\displaystyle{ f_t^p }[/math]表示引用了p且包含词汇t的文章数量。[math]\displaystyle{ f_t }[/math]表示包含词汇t的文章总数量

根据上式计算每个词汇的ForeCite得分,提取Top-N个词汇作为概念进行人工验证。文章也和LoOR和CNLC方法做了对比,结果更优。

文章通过分析文献引用网络来识别提取概念,而不是从文本内容深度分析着手。NLP主要用于数据预处理。使用网络分析也可以从不同的角度思考这个问题。

参考文献

  1. Yookyung Jo, Carl Lagoze, and C. Lee Giles. 2007. Detecting research topics via the correlation between graphs and texts. In KDD ’07.
  2. Asif ul Haque and Paul Ginsparg. 2011. Phrases as subtopical concepts in scholarly text. In JCDL ’11.

概念地图

本分类目前不含有任何页面或媒体文件。