分类:Measuring academic influence: Not all citations are equal

来自Big Physics


Xiaodan Zhu, Peter Turney, Daniel Lemire & André Vellino, Measuring Academic Influence: Not All Citations Are Equal, Journal of the Association for Information Science and Technology, 66(2), 408, http://doi.org/10.1002/asi.23179

Abstract

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work, we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features, among those we evaluated, were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

总结和评论

这篇文章用机器学习的算法来解决关键引文的问题:有一些引文是真正的工作基础,有一些仅仅是大背景或者勉强的引用,如何区分它们。其中有人工标注的训练数据,可以当做训练集和暂时用于评价算法,见分类:数据集

具体来说,这篇文章用SVM(支持向量机)考虑了引用次数特征(例如在“引言”小节出现了多少次,在全文出现了多少次,出现在了几个小节里面)、施引被引论文文本相似性(例如,施引论文的标题和被引论文标题的相似性,施引论文中被引论文出现的文本和施引/被引论文的相似性。没有考虑引用网络相似性,也就是共施引共被引之类的)等多个特征指标和训练集中标注的施引-被引论文之间“有/没有直接学术影响”的关联程度,构建出来一个分类器。

在构建文本相似性特征的时候,文章采用的是词的离散向量,也就是代表某一个词这样的表示。这是可以做进一步工作的地方,例如把离散表示变成word2vec训练的分布式向量。上面提到的引文网络相似性,也可以作为一个指标加入到现有的指标体系中来。可以看看这两点是否会提到这个分类器的准确率(或者说F值)。

在结果验证上,这篇文章在标注出来的“有/没有直接学术影响”的结果的基础上,重新计算了其数据集中每个作者的被引次数和h指数,发现能够和这些作者的学术地位(文章里面采用了某个学会的成员这个指标)比较符合。这个做法,也有一定参考意义。

本分类目前不含有任何页面或媒体文件。