分类:用于文章主题识别等任务的自然语言处理技术

来自Big Physics
Jinshanw讨论 | 贡献2017年6月30日 (五) 16:46的版本


分类是很多研究的基础。例如,从对每一篇文章的了解过渡到对整个领域或者某个分支领域的了解,需要做粗粒化,需要做文章主题识别。例如,每一个领域的平均引用次数不一样,如果需要实现被引次数在不同领域之间比较,就需要转化一下一个“货币”,而这个转化的基础就是一个好的领域识别。

这个识别工作主要的思想是利用文章引用网络得到网络相似性,结合文章的标题摘要甚至全文来判断文章得到的语义相似性,然后运用相似性来做某种聚类。具体方法上思考word2vec和LDA的结合。

LDA(Latent Dirichlet allocation)有几个关键点:[math]\displaystyle{ P\left(w|d\right)=\sum_{t}P\left(w|t\right)P\left(t|d\right) }[/math][math]\displaystyle{ P\left(w|t\right) }[/math]来自于整个语料库,分布函数的具体的Dirichlet函数形式。我现在希望构建一个无参数的不用具体函数形式的假设的主题发现方法。可以是自洽的迭代算法,例如从某一组假定的[math]\displaystyle{ P\left(w|t\right) }[/math]开始,求出来[math]\displaystyle{ P\left(t|d\right) }[/math],接着再次更新[math]\displaystyle{ P\left(w|t\right) }[/math]。也可以是某一个目标下的优化算法,例如这样的目标,对于每一个文档[math]\displaystyle{ d }[/math]都满足[math]\displaystyle{ P_{em}\left(w|d\right)-\sum_{t}P\left(w|t\right)P\left(t|d\right) }[/math]


参考文献

  1. K. Boyack, W. Glänzel, J. Gläser, F. Havemann, A. Scharnhorst, B. Thijs, N. van Eck, T. Velden & Ludo Waltmann, Topic identification challenge, Scientometrics (2017) 111: 1223. doi:10.1007/s11192-017-2307-0
  2. J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—Different results? Towards a comparative approach to the identification of thematic structures in science. Special Issue of Scientometrics. doi:10.1007/s11192-017-2296-z
  3. Boyack, K. W. (2017a). Investigating the effect of global data on topic detection. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2297-y
  4. Boyack, K. W. (2017b). Thesaurus-based methods for mapping contents of publication sets. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2304-3
  5. Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404
  6. Glänzel, W., & Thijs, B. (2017). Using hybrid methods and `core documents’ for the representation of clusters and topics: The astronomy dataset. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2301-6
  7. Havemann, F., Gläser, J., & Heinz, M. (2017). Memetic search for overlapping topics based on a local evaluation of link communities. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2302-5
  8. Klavans, R., & Boyack, K. W. (2015). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? http://arxiv.org/abs/1511.05078.
  9. Koopman, R., & Wang, S. (2017). Mutual information based labelling and comparing clusters. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2305-2.
  10. Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics (extended): Browsing through the universe of bibliographic information. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2303-4
  11. Šubelj, L., van Eck, N. J., & Waltman, L. (2016). Clustering scientific publications based on citation relations: A systematic comparison of different methods. PLoS ONE, 11(4), e0154404
  12. Van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2300-7
  13. Velden, T., Yan, S., & Lagoze, C. (2017b). Mapping the cognitive structure of astrophysics by infomap: Clustering of the citation network and topic affinity analysis. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2299-9.
  14. Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392
  15. Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2298-x
  16. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84
  17. Thijs, Bart and Glänzel, Wolfgang and Meyer, Martin S. (2015) Using noun phrases extraction for the improvement of hybrid clustering with text- and citation-based components. The example of “Information Systems Research”. In: Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI), Istanbul, Turkey, 29/6/2015, Istanbul