分类:用于文章主题识别等任务的自然语言处理技术

来自Big Physics
Shishuangqing讨论 | 贡献2019年1月27日 (日) 21:13的版本


分类是很多研究的基础。例如,从对每一篇文章的了解过渡到对整个领域或者某个分支领域的了解,需要做粗粒化,需要做文章主题识别。例如,每一个领域的平均引用次数不一样,如果需要实现被引次数在不同领域之间比较,就需要转化一下一个“货币”,而这个转化的基础就是一个好的领域识别。

同时,大量的科学学研究也需要在领域的层面来开展,例如这个领域目前的热点、困难点和可能的突破点怎样,一个领域如何受其他领域的支撑以及支撑其他领域。也就是说,文章的层面可能太细了,文章的分类——领域是一个比较合适的粗粒化,而大量问题的研究需要这个粗粒化。

这个识别工作主要的思想是利用文章引用网络得到网络相似性,结合文章的标题摘要甚至全文来判断文章得到的语义相似性,然后运用相似性来做某种聚类。具体方法上思考自然语言处理技术word2vec和LDA的结合。或者word2vec和pLSA的结合。

更具体一点,word2vec和LDA的结合就是用文章的文本和文章之间的引用网络一起把文章矢量化,然后用这些矢量表示来分类。也就是换一个角度来看word2vec或者更像GloVe,可以把它看作是网络的矢量表示:在这个网络上,每一个词有到周围其他词的连边。于是,自然文章引用网络也就可以用这样的方法来给每一篇文章找到矢量表示,node2vec(word2vec的推广)模型就是实现对网络节点的矢量表示的方法。这时候,再结合doc2vec(word2vec的推广)就可以得到基于内容的文章矢量表示。把两种矢量表示结合,例如通过直和(相当于两个概率平均)或者直积(相当于两个概率乘起来),再来做聚类就有可能可以得到更好的文章分类。

依靠整个社区的力量,做一个学科概念列表和文献概念标记?结合自然语言处理技术,发现和修订概念集合,发现和修订概念标记,发现和修订分类标记。

在那之前只能先用数学物理学经济学生命科学数据先试试了。尤其生命科学有pmc全文。

从聚类算法算法来说,还可以考虑去掉一些高社团交叉性的论文。

Kevin和Nees等人的算法和对比

算出来基于全文word2vec,或者基于全文加上引用的word2vec,交给Kevin去做对比。

参考文献

  1. K. Boyack, W. Glänzel, J. Gläser, F. Havemann, A. Scharnhorst, B. Thijs, N. van Eck, T. Velden & Ludo Waltmann, Topic identification challenge, Scientometrics (2017) 111: 1223. doi:10.1007/s11192-017-2307-0
  2. J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—Different results? Towards a comparative approach to the identification of thematic structures in science. Special Issue of Scientometrics. doi:10.1007/s11192-017-2296-z
  3. Boyack, K. W. (2017a). Investigating the effect of global data on topic detection. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2297-y
  4. Boyack, K. W. (2017b). Thesaurus-based methods for mapping contents of publication sets. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2304-3
  5. Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404
  6. Glänzel, W., & Thijs, B. (2017). Using hybrid methods and `core documents’ for the representation of clusters and topics: The astronomy dataset. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2301-6
  7. Havemann, F., Gläser, J., & Heinz, M. (2017). Memetic search for overlapping topics based on a local evaluation of link communities. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2302-5
  8. Klavans, R., & Boyack, K. W. (2015). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? http://arxiv.org/abs/1511.05078.
  9. Koopman, R., & Wang, S. (2017). Mutual information based labelling and comparing clusters. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2305-2.
  10. Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics (extended): Browsing through the universe of bibliographic information. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2303-4
  11. Šubelj, L., van Eck, N. J., & Waltman, L. (2016). Clustering scientific publications based on citation relations: A systematic comparison of different methods. PLoS ONE, 11(4), e0154404
  12. Van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2300-7
  13. Velden, T., Yan, S., & Lagoze, C. (2017b). Mapping the cognitive structure of astrophysics by infomap: Clustering of the citation network and topic affinity analysis. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2299-9.
  14. Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392
  15. Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2298-x
  16. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84
  17. Thijs, Bart and Glänzel, Wolfgang and Meyer, Martin S. (2015) Using noun phrases extraction for the improvement of hybrid clustering with text- and citation-based components. The example of “Information Systems Research”. In: Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI), Istanbul, Turkey, 29/6/2015, Istanbul
  18. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
  20. Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196).
  21. Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). ACM.