分类:作者识别

问题描述

最细的作者层级，就是把每一篇文章都当做来自于不同的作者，也就是把作者用文章来编号，例如First_Last_DOI，最粗的可以是First Initial_Last，也就是把每一个名的首字母和姓相同的都当做一个作者。当然，实际上后者还不是最粗的，会把改了名或者姓的人当成两个作者。这个在目前讨论的层面，忽略不计。那么，作者识别的问题就是找到一个介于最粗和最细的之间的一个识别，使得实际上相同的作者的文章都能够对应到同一个作者那里去。

如果仅仅考虑西方作者以及日本人，那么，姓的全部和名的第一个字（First Initial_Last），就已经比较准确^[1]。但是考虑到东方人的姓氏实在相同的太多^[2] ，我们需要研究更好的姓名识别算法。最近有一些用机器学习来做姓名识别的工作^[3] ^[4]^[5] ^[6] ^[7]^[8] 。

不过，在这个问题上，目前阶段，我们关心发现式解决方法，或者说无监督学习。如何在标记数据的基础上做有监督学习，暂时不考虑。

主要思路

在这里，我们想看看，包含主题－作者－论文的多层网络框架是否能够更好地识别作者作者。具体算法可以是多层网络上的某种间接联系算法，也可以是实体（作者、论文等）的矢量表示算法。

例如，对于作者识别，有了作者和主题的对应，那么在某个合适的尺度下，出现同名的作者数量就会比较少了。因此，作者姓名，加上单位，加上专业的识别方式应该是已经比较严格的。如果过于严格，则可以考虑运用合作者来归并。技术上，对于作者是否属于一个主题，可以考虑用论文主题标注，或者用论文主题分类算法。

一种更加具体的算法可以这样：对文章的全文（或者只有文章ID、标题、作者姓名、作者地址、摘要、关键词，是不是包含参考文献另说）做word2vec（需要区分一词多义现象）或者BERT的矢量化训练，然后对训练得到的矢量做聚类。尤其是BERT算法，得到的矢量是包含context的，可能可以直接解决一词多义的问题，然后通过聚类再整合起来。这个思路的假设，还是：作者的领域可以帮助区分和合并作者，也就是研究内容的矢量表示（不管是依靠文章的矢量，还是直接就在词汇的矢量空间内）可以用来做作者识别。

例如，在得到作者的矢量表示以后直接做聚类，然后对于姓名具有一定相似性的作者，做认同合并。或者，对姓名相似的作者的所有文章做聚类，来合并属于某一个作者的文章。

可检验数据

对于可检验数据的问题，可以通过参考基金机构的项目报告中整理出来的文章作者数据。也可以反过来，把这个基金文章作者数据当做训练集，来设计机器学习算法，或者某种扩散算法，把这个子集的数据想办法推广到整个数据集上去。这一点对于中国和韩国作者尤其有意义。

Orcid数据^[9]和机构数据库（例如中科院机构数据、挪威模型数据）可以用来检验和训练。

也有小规模的其他研究者已经做好的数据^[10] ^[8] 。我们可以通过建立Dimensions（包含作者Orcid）和APS数据的对应来得到我们自己的检验数据，还可以直接自己来处理Orcid数据（里面有作者姓名、文章、工作简历等数据）来得到检验数据。整理好的数据见数据集。

应用以及进一步研究

可以考虑在WoS、Dimensions、微软学术等数据库上实现一下。如果准确率高，就可以用来研究科学家的年龄、半衰期、性别影响等问题了。

参考文献

↑ Staša Milojević, Accuracy of simple, initials-based methods for author name disambiguation, Journal of Informetrics 7, 767-773(2013). https://doi.org/10.1016/j.joi.2013.06.006
↑ Jinseok Kim and Jana Diesner, Distortive effects of initial‐based name disambiguation on measurements of large‐scale coauthorship networks, JASIST, 67, 1446-1461(2016). https://doi.org/10.1002/asi.23489
↑ Christian Schulz, Amin Mazloumian, Alexander M Petersen, Orion Penner and Dirk Helbing, Exploiting citation networks for large-scale author name disambiguation, EPJ Data Science 20143:11 https://doi.org/10.1140/epjds/s13688-014-0011-3
↑ Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, Felix Wu, Hsiao-Yu Tung, Tong Yu, Jui-Pin Wang, Cheng-Xia Chang, Chun-Pai Yang, Wei-Cheng Chang, Kuan-Hao Huang, Tzu-Ming Kuo, Shan-Wei Lin, Young-San Lin, Yu-Chen Lu, Yu-Chuan Su, Cheng-Kuang Wei, Tu-Chun Yin, Chun-Liang Li, Ting-Wei Lin, Cheng-Hao Tsai, Shou-De Lin, Hsuan-Tien Lin, Chih-Jen Lin; Effective String Processing and Matching for Author Disambiguation　http://jmlr.org/papers/v15/chin14a.html
↑ Ruijie Wang, Yuchen Yan, Chuan Wen, Yunsong Zhou, Jiefeng Gao, Weinan Zhang, Xinbing Wang, Shanghai Jiao Tong University, wjerry5@sjtu.edu.cn. 1997. Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning. In Proceedings of ACM Woodstock conference (WOODSTOCK’97). ACM, New York, NY, USA, Article 4, 11 pages. https://doi.org/10.475/123_4
↑ Müller MC. (2017) Semantic Author Name Disambiguation with Word Embeddings. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham https://www.h-its.org/wp-content/uploads/2018/01/semantic_and_tpdl_2017.pdf
↑ Jun Xu, Siqi Shen, Dongsheng Li, Yongquan Fu, A Network-embedding Based Method for Author Disambiguation, CIKM’18, October 22-26, 2018, Torino, Italy, https://dl.acm.org/citation.cfm?id=3269272
↑ ^8.0 ^8.1 Dong, Yuxiao and Chawla, Nitesh V and Swami, Ananthram. metapath2vec: Scalable Representation Learning for Heterogeneous Networks, https://ericdongyx.github.io/metapath2vec/m2v.html
↑ Bohannon J, Doran K (2017) Introducing ORCID. Science 356(6339) 691-692. https://doi.org/10.1126/science.356.6339.691, Bohannon J, Doran K (2017) Data from: Introducing ORCID. Dryad Digital Repository. https://doi.org/10.5061/dryad.48s16
↑ Müller, MC., Reitz, F. & Roy, N. Data sets for author name disambiguation: an empirical analysis and a new resource, Scientometrics (2017) 111: 1467. https://doi.org/10.1007/s11192-017-2363-5

本分类目前不含有任何页面或媒体文件。

[Milojevi.C4.87-1] Staša Milojević, Accuracy of simple, initials-based methods for author name disambiguation, Journal of Informetrics 7, 767-773(2013). https://doi.org/10.1016/j.joi.2013.06.006

[Kim-2] Jinseok Kim and Jana Diesner, Distortive effects of initial‐based name disambiguation on measurements of large‐scale coauthorship networks, JASIST, 67, 1446-1461(2016). https://doi.org/10.1002/asi.23489

[Schulz-3] Christian Schulz, Amin Mazloumian, Alexander M Petersen, Orion Penner and Dirk Helbing, Exploiting citation networks for large-scale author name disambiguation, EPJ Data Science 20143:11 https://doi.org/10.1140/epjds/s13688-014-0011-3

[Chin-4] Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, Felix Wu, Hsiao-Yu Tung, Tong Yu, Jui-Pin Wang, Cheng-Xia Chang, Chun-Pai Yang, Wei-Cheng Chang, Kuan-Hao Huang, Tzu-Ming Kuo, Shan-Wei Lin, Young-San Lin, Yu-Chen Lu, Yu-Chuan Su, Cheng-Kuang Wei, Tu-Chun Yin, Chun-Liang Li, Ting-Wei Lin, Cheng-Hao Tsai, Shou-De Lin, Hsuan-Tien Lin, Chih-Jen Lin; Effective String Processing and Matching for Author Disambiguation　http://jmlr.org/papers/v15/chin14a.html

[Wang-5] Ruijie Wang, Yuchen Yan, Chuan Wen, Yunsong Zhou, Jiefeng Gao, Weinan Zhang, Xinbing Wang, Shanghai Jiao Tong University, wjerry5@sjtu.edu.cn. 1997. Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning. In Proceedings of ACM Woodstock conference (WOODSTOCK’97). ACM, New York, NY, USA, Article 4, 11 pages. https://doi.org/10.475/123_4

[M.C3.BCller:WE-6] Müller MC. (2017) Semantic Author Name Disambiguation with Word Embeddings. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham https://www.h-its.org/wp-content/uploads/2018/01/semantic_and_tpdl_2017.pdf

[Xu-7] Jun Xu, Siqi Shen, Dongsheng Li, Yongquan Fu, A Network-embedding Based Method for Author Disambiguation, CIKM’18, October 22-26, 2018, Torino, Italy, https://dl.acm.org/citation.cfm?id=3269272

[Dong-8] 8.0 ^8.1 Dong, Yuxiao and Chawla, Nitesh V and Swami, Ananthram. metapath2vec: Scalable Representation Learning for Heterogeneous Networks, https://ericdongyx.github.io/metapath2vec/m2v.html

[Bohannon-9] Bohannon J, Doran K (2017) Introducing ORCID. Science 356(6339) 691-692. https://doi.org/10.1126/science.356.6339.691, Bohannon J, Doran K (2017) Data from: Introducing ORCID. Dryad Digital Repository. https://doi.org/10.5061/dryad.48s16

[M.C3.BCller-10] Müller, MC., Reitz, F. & Roy, N. Data sets for author name disambiguation: an empirical analysis and a new resource, Scientometrics (2017) 111: 1467. https://doi.org/10.1007/s11192-017-2363-5

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

匿名

搜索

分类:作者识别

名字空间

更多

页面选项

目录

问题描述

主要思路

可检验数据

应用以及进一步研究

参考文献

导航

导航

Wiki工具

Wiki工具

匿名

搜索

分类:作者识别

问题描述

主要思路

可检验数据

应用以及进一步研究

参考文献

导航

Wiki工具

页面工具

分类