分类:A new methodology for constructing a publication-level classification system of science

来自Big Physics


Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392

Abstract

Classifying journals or publications into research areas is an essential element of many bibliometric analyses. Classification usually takes place at the level of journals, where the Web of Science subject categories are the most popular classification system. However, journal-level classification systems have two important limitations: They offer only a limited amount of detail, and they have difficulties with multidisciplinary journals. To avoid these limitations, we introduce a new methodology for constructing classification systems at the level of individual publications. In the proposed methodology, publications are clustered into research areas based on citation relations. The methodology is able to deal with very large numbers of publications. We present an application in which a classification system is produced that includes almost 10 million publications. Based on an extensive analysis of this classification system, we discuss the strengths and the limitations of the proposed methodology. Important strengths are the transparency and relative simplicity of the methodology and its fairly modest computing and memory requirements. The main limitation of the methodology is its exclusive reliance on direct citation relations between publications. The accuracy of the methodology can probably be increased by also taking into account other types of relations–for instance, based on bibliographic coupling.

踩或赞

总结和评论

这个工作基于论文之间的引用网络做了论文主题发现(聚类)的研究。其主要思想就是通过调整网络的集团结构的方式来优化网路集团结构的模块度[1],实现模块度的最大值。具体实现这个优化的算法有很多种。Waltman这篇文章用的哪一种方法,还需要再看一看。这里有他的源程序。得到聚类结果之后,文章还对每一个类以词的频率为基础(TF-IDF)做了类标签。接着讨论了这个聚类的表现。在表现上,选择了几个领域几个期刊来当例子,讨论了优点和不足。但是,文章没有对比他们的聚类结果和其他聚类结果,以及和现有的作者或者编辑部标记的分类。由于这个算法能够处理比较大的引用网络,现在有不少研究者,例如Boyack在使用这样的分类方法。

实际上,从这个工作本身的细节的角度来说,具体优化方法,甚至聚类方法,还都有可能可以继续提高的地方,标签处理也没有考虑词语之间的语义联系。这些都是有可能可以开展工作的点。

但是,对于我们来说,主要的不是这些小的地方,而是这个文章解决了一个什么问题——提出一个基于论文引用网络的聚类算法,以及这样的一个问题的解决,和我们自己在研究的文章主题识别在思考上有什么异同。第二,还有这个文章的写作,因为以后我们自己的方法的结果也要写出来的,对于我们有什么参考价值,尤其是在对结果合理性和不足的说明方式这一点上。

具体来说,问题是相同的:文章聚类算法。但是,思路上不一样。第一、数据上,我们考虑引用网络和文本相似性的结合。第二、在分析思想上,在引用网路上甚至在文本相似性上,我们在计算相似度的时候就希望考虑直接和间接联系,而不仅仅是直接联系。也就是这篇文章中的相似度直接基于邻接矩阵做的归一化,而我们希望相似度本身的计算需要考虑例如邻接矩阵的平方之类的影响。第三、分析方法不一样,我们希望运用word2vec[2][3]来计算在引用网络和文本两个方面上文章的矢量表示,并且在得到矢量表示之后可以通过LDA[4][5]这样的方法,或者仍然采用相似性聚类的方法来做聚类。

参考文献

  1. Newman, M.E.J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113.
  2. T Mikolov, K Chen, G Corrado, J Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781.
  3. T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, 3111-3119.
  4. T. Griffiths, and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences 101 (Suppl. 1): 5228-5235.
  5. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.

本分类目前不含有任何页面或媒体文件。