分类:用于文章主题识别等任务的自然语言处理技术
分类是很多研究的基础。例如,从对每一篇文章的了解过渡到对整个领域或者某个分支领域的了解,需要做粗粒化,需要做文章主题识别。例如,每一个领域的平均引用次数不一样,如果需要实现被引次数在不同领域之间比较,就需要转化一下一个“货币”,而这个转化的基础就是一个好的领域识别。
这个识别工作主要的思想是利用文章引用网络得到网络相似性,结合文章的标题摘要甚至全文来判断文章得到的语义相似性,然后运用相似性来做某种聚类。具体方法上思考word2vec和LDA的结合。
LDA(Latent Dirichlet allocation)有几个关键点:[math]\displaystyle{ P\left(w|d\right)=\sum_{t}P\left(w|t\right)P\left(t|d\right) }[/math],[math]\displaystyle{ P\left(w|t\right) }[/math]来自于整个语料库,分布函数的具体的Dirichlet函数形式。我现在希望构建一个无参数的不用具体函数形式的假设的主题发现方法。可以是自洽的迭代算法,例如从某一组假定的[math]\displaystyle{ P\left(w|t\right) }[/math]开始,求出来[math]\displaystyle{ P\left(t|d\right) }[/math],接着再次更新[math]\displaystyle{ P\left(w|t\right) }[/math]。也可以是某一个目标下的优化算法,例如这样的目标,对于每一个文档[math]\displaystyle{ d }[/math]都满足[math]\displaystyle{ P_{em}\left(w|d\right)-\sum_{t}P\left(w|t\right)P\left(t|d\right) }[/math]取得最小值。
但是,一方面更新算法我还没找到,一方面看起来仅仅有这个目标还不够。需要从LDA的细节中去吸取更多营养。满足这个目标的平庸解有[math]\displaystyle{ P\left(w|t\right)=\delta_{wt} }[/math](每一个词都是一个主题),于是[math]\displaystyle{ P\left(w|d\right)=P\left(t|d\right) }[/math];或者[math]\displaystyle{ P\left(t|d\right)=\delta_{td} }[/math](每一个文档都是一个主题),于是[math]\displaystyle{ P\left(w|d\right)=P\left(w|t\right) }[/math]。因此,仅仅依靠这个目标是不行的。当然,你可以把外参数——主题的数目——当做一个约束。没准看起来,仅仅依靠这个目标然后加上这个约束也能够得到非平庸的主题分类。其实,还有其他的约束。例如,保相似性条件[math]\displaystyle{ \sum_{w}P\left(d_{1}|w\right)P\left(w|d_{2}\right)=\sum_{t}P\left(d_{1}|t\right)P\left(t|d_{2}\right) }[/math]
另外,已知[math]\displaystyle{ P\left(w|t\right) }[/math]来求解[math]\displaystyle{ P\left(t|d\right) }[/math]的问题,也需要去看看有没有人研究过。这个问题实际上可以看做一个非方的线性方程求解的问题。
参考文献
- K. Boyack, W. Glänzel, J. Gläser, F. Havemann, A. Scharnhorst, B. Thijs, N. van Eck, T. Velden & Ludo Waltmann, Topic identification challenge, Scientometrics (2017) 111: 1223. doi:10.1007/s11192-017-2307-0
- J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—Different results? Towards a comparative approach to the identification of thematic structures in science. Special Issue of Scientometrics. doi:10.1007/s11192-017-2296-z
- Boyack, K. W. (2017a). Investigating the effect of global data on topic detection. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2297-y
- Boyack, K. W. (2017b). Thesaurus-based methods for mapping contents of publication sets. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2304-3
- Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404
- Glänzel, W., & Thijs, B. (2017). Using hybrid methods and `core documents’ for the representation of clusters and topics: The astronomy dataset. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2301-6
- Havemann, F., Gläser, J., & Heinz, M. (2017). Memetic search for overlapping topics based on a local evaluation of link communities. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2302-5
- Klavans, R., & Boyack, K. W. (2015). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? http://arxiv.org/abs/1511.05078.
- Koopman, R., & Wang, S. (2017). Mutual information based labelling and comparing clusters. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2305-2.
- Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics (extended): Browsing through the universe of bibliographic information. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2303-4
- Šubelj, L., van Eck, N. J., & Waltman, L. (2016). Clustering scientific publications based on citation relations: A systematic comparison of different methods. PLoS ONE, 11(4), e0154404
- Van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2300-7
- Velden, T., Yan, S., & Lagoze, C. (2017b). Mapping the cognitive structure of astrophysics by infomap: Clustering of the citation network and topic affinity analysis. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2299-9.
- Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392
- Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. In J. Gläser, A. Scharnhorst & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics. doi:10.1007/s11192-017-2298-x
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84
- Thijs, Bart and Glänzel, Wolfgang and Meyer, Martin S. (2015) Using noun phrases extraction for the improvement of hybrid clustering with text- and citation-based components. The example of “Information Systems Research”. In: Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI), Istanbul, Turkey, 29/6/2015, Istanbul
子分类
本分类有以下10个子分类,共有10个子分类。