Big Physics - 用户贡献 [zh-cn]

分类:工作进展之宋玉鲲

2022-07-10T01:46:21Z

Songyk：

<accesscontrol>Songyk(ro)</accesscontrol>

[[分类:工作进展板]]
[[分类:宋玉鲲]]

# 研究工作
## 科学学概念网络构建任务和算法
### 通过简单自然语言处理技术，从一段话中自动构建一个粗糙的概念地图
### 已有概念抽取和关系抽取现有算法的实现，在科学学论文数据上检验效果
### 人工和算法相结合，构建科学学概念网络，甚至三层网络
## 汉字检测算法（算法和实验）
## 英文单词学习顺序和检测算法
### 英文单词字源数据（Wiktionary/google/www.etymonline.com）（进行中）
### 算法设计和实验研究
## 科学学三层网络用于论文创新性等科学学研究
## 传染病代际再生数程序、模拟和论文写作
# 学习
## 自然语言处理、知识抽取和表示
## 概率图模型
# 团队管理工作
## 团队软件平台管理
### 常规维护；自动化脚本；维护文档（部分完成）；
### bigphysics迁移到新服务器（待定）
### 支持https访问(完成)；
### bigphysics增加一个控制访问权限的拓展，例如Extension:AccessControl(完成)
## 教育系统科学研究中心网站建设(完成)

概念抽取和概念关系挖掘文献讨论安排

2021-12-25T14:28:14Z

Songyk：

[[分类:概念抽取和概念关系挖掘]]

文献列表详情见[[:Category:概念抽取和概念关系挖掘|概念抽取和概念关系挖掘]]。
报告人可以从这里选择相关文献进行分享：[https://github.com/thunlp/KRLPapers/blob/master/README.md KRLPapers Github项目]，或自行查找相关文献。

建议报告人在本站创建新的页面并写下对所分享文献的总结评述。请在创建的页面中添加两个分类：文献讨论、概念抽取和概念关系挖掘。这里是一个例子：[[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]].

以下是文献讨论安排表（持续更新中）：
{| class="wikitable"
! 文献篇名
! 报告人
! 日期
|-
| Translating Embeddings for Modeling Multi-relational Data
|| 宋玉鲲 || 2021-10-17
|-
| [[:Category:Measuring prerequisite relations among concepts|Measuring prerequisite relations among concepts]]
|| 邓招奇 || 2021-10-24
|-
| Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
|| 骆慧颖 || 2021-10-31
|-
| [[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]]
|| 焦奕霖 || 2021-11-07
|-
| Span-based Joint Entity and Relation Extraction with Transformer Pre-training
|| 骆慧颖 || 2021-11-28
|-
| [[:DocRED:_A_Large-Scale_Document-Level_Relation_Extraction_Dataset|DocRED: A Large-Scale Document-Level Relation Extraction Dataset ]]
|| 宋玉鲲 || 2021-12-26

|}

DocRED: A Large-Scale Document-Level Relation Extraction Dataset

2021-12-25T14:25:09Z

Songyk：建立内容为“Category:文献讨论分类:概念抽取和概念关系挖掘 Yao, Yuan, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang,…”的新页面

[[Category:文献讨论]]
[[分类:概念抽取和概念关系挖掘]]

Yao, Yuan, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. “DocRED: A Large-Scale Document-Level Relation Extraction Dataset.” In _Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics_, 764–77. Florence, Italy: Association for Computational Linguistics, 2019. [https://doi.org/10.18653/v1/P19-1074](https://doi.org/10.18653/v1/P19-1074).

==Abstract==
Multiple entities in a document generally exhibit complex inter-sentence relations, and cannot be well handled by existing relation extraction (RE) methods that typically focus on extracting intra-sentence relations for single entity pairs. In order to accelerate the research on document-level RE, we introduce DocRED, a new dataset constructed from Wikipedia and Wikidata with three features: (1) DocRED annotates both named entities and relations, and is the largest human-annotated dataset for document-level RE from plain text; (2) DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document; (3) along with the human-annotated data, we also offer large-scale distantly supervised data, which enables DocRED to be adopted for both supervised and weakly supervised scenarios. In order to verify the challenges of document-level RE, we implement recent state-of-the-art methods for RE and conduct a thorough evaluation of these methods on DocRED. Empirical results show that DocRED is challenging for existing RE methods, which indicates that document-level RE remains an open problem and requires further efforts. Based on the detailed analysis on the experiments, we discuss multiple promising directions for future research. We make DocRED and the code for our baselines publicly available at https://github.com/thunlp/DocRED.

==总结和评论==
作者发布了一个面向文档级别关系抽取任务的大规模数据集：DocRED。该数据集是基于Wikipedia和Wikidata，同时提供了人工标注数据集和远程监督标注数据集以支持不同的使用场景。作者在DocRED数据集上评估了当时最先进的几个关系提取方法。

概念抽取和概念关系挖掘文献讨论安排

2021-12-25T14:09:47Z

Songyk：

[[分类:概念抽取和概念关系挖掘]]

文献列表详情见[[:Category:概念抽取和概念关系挖掘|概念抽取和概念关系挖掘]]。
报告人可以从这里选择相关文献进行分享：[https://github.com/thunlp/KRLPapers/blob/master/README.md KRLPapers Github项目]，或自行查找相关文献。

建议报告人在本站创建新的页面并写下对所分享文献的总结评述。请在创建的页面中添加两个分类：文献讨论、概念抽取和概念关系挖掘。这里是一个例子：[[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]].

以下是文献讨论安排表（持续更新中）：
{| class="wikitable"
! 文献篇名
! 报告人
! 日期
|-
| Translating Embeddings for Modeling Multi-relational Data
|| 宋玉鲲 || 2021-10-17
|-
| [[:Category:Measuring prerequisite relations among concepts|Measuring prerequisite relations among concepts]]
|| 邓招奇 || 2021-10-24
|-
| Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
|| 骆慧颖 || 2021-10-31
|-
| [[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]]
|| 焦奕霖 || 2021-11-07
|-
| Span-based Joint Entity and Relation Extraction with Transformer Pre-training
|| 骆慧颖 || 2021-11-28
|-
| DocRED: A Large-Scale Document-Level Relation Extraction Dataset
|| 宋玉鲲 || 2021-12-26

|}

概念抽取和概念关系挖掘文献讨论安排

2021-10-24T08:57:13Z

Songyk：

[[分类:概念抽取和概念关系挖掘]]

文献列表详情见[[:Category:概念抽取和概念关系挖掘|概念抽取和概念关系挖掘]]。
报告人可以从这里选择相关文献进行分享：[https://github.com/thunlp/KRLPapers/blob/master/README.md KRLPapers Github项目]，或自行查找相关文献。

建议报告人在本站创建新的页面并写下对所分享文献的总结评述。请在创建的页面中添加两个分类：文献讨论、概念抽取和概念关系挖掘。这里是一个例子：[[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]].

以下是文献讨论安排表（持续更新中）：
{| class="wikitable"
! 文献篇名
! 报告人
! 日期
|-
| Translating Embeddings for Modeling Multi-relational Data
|| 宋玉鲲 || 2021-10-17
|-
| [[:Category:Measuring prerequisite relations among concepts|Measuring prerequisite relations among concepts]]
|| 邓招奇 || 2021-10-24
|-
| Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
|| 骆慧颖 || 2021-10-31
|-
| [[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]]
|| 焦奕霖 || 2021-11-07
|-
| Span-based Joint Entity and Relation Extraction with Transformer Pre-training
|| 骆慧颖 || 待定

|-
| Linguistic regularities in continuous space word representations
|| 待定 || 待定
|-
| Relation extraction with matrix factorization and universal schemas
|| 待定 || 待定
|-
| Connecting language and knowledge bases with embedding models for relation extraction
|| 待定 || 待定
|-
| A Review of Relational Machine Learning for Knowledge Graphs
|| 待定 || 待定
|}

概念抽取和概念关系挖掘文献讨论安排

2021-10-17T11:08:17Z

Songyk：

[[分类:概念抽取和概念关系挖掘]]

文献列表详情见[[:Category:概念抽取和概念关系挖掘|概念抽取和概念关系挖掘]]。
报告人可以从这里选择相关文献进行分享：[https://github.com/thunlp/KRLPapers/blob/master/README.md KRLPapers Github项目]，或自行查找相关文献。

建议报告人在本站创建新的页面并写下对所分享文献的总结评述。请在创建的页面中添加两个分类：文献讨论、概念抽取和概念关系挖掘。这里是一个例子：[[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]].

以下是文献讨论安排表（持续更新中）：
{| class="wikitable"
! 文献篇名
! 报告人
! 日期
|-
| Translating Embeddings for Modeling Multi-relational Data
|| 宋玉鲲 || 2021-10-17
|-
| 待定
|| 待定 || 2021-10-24
|-
| Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
|| 骆慧颖 || 2021-10-31
|-
| Span-based Joint Entity and Relation Extraction with Transformer Pre-training
|| 骆慧颖 || 待定
|-
| End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models
|| 待定 || 待定
|-
| Linguistic regularities in continuous space word representations
|| 待定 || 待定
|-
| Relation extraction with matrix factorization and universal schemas
|| 待定 || 待定
|-
| Connecting language and knowledge bases with embedding models for relation extraction
|| 待定 || 待定
|-
| A Review of Relational Machine Learning for Knowledge Graphs
|| 待定 || 待定
|}

概念抽取和概念关系挖掘文献讨论安排

2021-10-17T11:06:56Z

Songyk：建立内容为“Category:概念抽取和概念关系挖掘文献列表详情见概念抽取和概念关系挖掘。报告…”的新页面

[[Category:概念抽取和概念关系挖掘]]

文献列表详情见[[:Category:概念抽取和概念关系挖掘|概念抽取和概念关系挖掘]]。
报告人可以从这里选择相关文献进行分享：[https://github.com/thunlp/KRLPapers/blob/master/README.md KRLPapers Github项目]，或自行查找相关文献。

建议报告人在本站创建新的页面并写下对所分享文献的总结评述。请在创建的页面中添加两个分类：文献讨论、概念抽取和概念关系挖掘。这里是一个例子：[[:Category:End-to-end_Named_Entity_Recognition_and_Relation_Extraction_using_Pre-trained_Language_Models|End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models]].

以下是文献讨论安排表（持续更新中）：
{| class="wikitable"
! 文献篇名
! 报告人
! 日期
|-
| Translating Embeddings for Modeling Multi-relational Data
|| 宋玉鲲 || 2021-10-17
|-
| 待定
|| 待定 || 2021-10-24
|-
| Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
|| 骆慧颖 || 2021-10-31
|-
| Span-based Joint Entity and Relation Extraction with Transformer Pre-training
|| 骆慧颖 || 待定
|-
| End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models
|| 待定 || 待定
|-
| Linguistic regularities in continuous space word representations
|| 待定 || 待定
|-
| Relation extraction with matrix factorization and universal schemas
|| 待定 || 待定
|-
| Connecting language and knowledge bases with embedding models for relation extraction
|| 待定 || 待定
|-
| A Review of Relational Machine Learning for Knowledge Graphs
|| 待定 || 待定
|}

分类:工作进展之宋玉鲲

2021-03-25T09:46:08Z

Songyk：

[[分类:工作进展板]]
[[分类:宋玉鲲]]

# 研究工作
## 科学学概念网络构建任务和算法
### 通过简单自然语言处理技术，从一段话中自动构建一个粗糙的概念地图
### 已有概念抽取和关系抽取现有算法的实现，在科学学论文数据上检验效果
### 人工和算法相结合，构建科学学概念网络，甚至三层网络
## 汉字检测算法（算法和实验）
## 英文单词学习顺序和检测算法
### 英文单词字源数据（Wiktionary/google/www.etymonline.com）（进行中）
### 算法设计和实验研究
## 科学学三层网络用于论文创新性等科学学研究
## 传染病代际再生数程序、模拟和论文写作
# 学习
## 自然语言处理、知识抽取和表示
## 概率图模型
# 团队管理工作
## 团队软件平台管理
### 常规维护；自动化脚本；维护文档（部分完成）；
### bigphysics迁移到新服务器（待定）
### 支持https访问(完成)；
### bigphysics增加一个控制访问权限的拓展，例如Extension:AccessControl(完成)
## 教育系统科学研究中心网站建设(完成)

分类:工作进展之宋玉鲲

2021-03-23T11:42:24Z

Songyk：

[[分类:工作进展板]]
[[分类:宋玉鲲]]

# 研究工作
## 科学学概念网络构建任务和算法
### 通过简单自然语言处理技术，从一段话中自动构建一个粗糙的概念地图
### 已有概念抽取和关系抽取现有算法的实现，在科学学论文数据上检验效果
### 人工和算法相结合，构建科学学概念网络，甚至三层网络
## 汉字检测算法（算法和实验）
## 英文单词学习顺序和检测算法
### 英文单词字源数据（Wiktionary/google/www.etymonline.com）（进行中）
### 算法设计和实验研究
## 科学学三层网络用于论文创新性等科学学研究
## 传染病代际再生数程序、模拟和论文写作
# 学习
## 自然语言处理、知识抽取和表示
## 概率图模型
# 团队管理工作
## 团队软件平台管理
### 所有网站平台：常规维护；撰写维护文档（部分完成）；支持https访问；编写、部署自动化脚本
### bigphysics增加一个控制访问权限的拓展，例如Extension:AccessControl(完成)
## 教育系统科学研究中心网站建设(完成)

分类:工作进展之宋玉鲲

2021-02-25T08:19:49Z

Songyk：

[[分类:工作进展板]]
[[分类:宋玉鲲]]

# 研究工作
## 科学学概念网络构建任务和算法
### 已有概念抽取和关系抽取现有算法的实现，在科学学论文数据上检验效果
### 人工和算法相结合，构建科学学概念网络，甚至三层网络
## 汉字检测算法（算法和实验）
## 英文单词学习顺序和检测算法
### 英文单词字源数据（Wiktionary/google/www.etymonline.com）
### 算法设计和实验研究
## 科学学三层网络用于论文创新性等科学学研究
## 传染病代际再生数程序、模拟和论文写作
# 学习
## 自然语言处理、知识抽取和表示
## 概率图模型
# 团队管理工作
## 团队软件平台管理
### 所有网站平台：支持https访问；编写、部署自动化备份脚本;升级系统（如有必要）；撰写维护文档
### bigphysics增加一个控制访问权限的拓展，例如Extension:AccessControl
## 教育系统科学研究中心网站建设(完成)

分类:High-Precision Extraction of Emerging Concepts from Scientific Literature

2020-12-08T08:01:37Z

Songyk：/* 总结和评论 */

[[Category:文献讨论]]
[[分类:AllenAI系列科学学文章]]

Daniel King, Doug Downey, Daniel S. Weld. High-Precision Extraction of Emerging Concepts from Scientific Literature. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

== Abstract ==

Identification of new concepts in scientific literature can help power faceted search, scientific trend analysis, knowledge-base construction, and more, but current methods are lacking. Manual identification can't keep up with the torrent of new publications, while the precision of existing automatic techniques is too low for many applications. We present an unsupervised concept extraction method for scientific literature that achieves much higher precision than previous work. Our approach relies on a simple but novel intuition: each scientific concept is likely to be introduced or popularized by a single paper that is disproportionately cited by subsequent papers mentioning the concept. From a corpus of computer science papers on arXiv, we find that our method achieves a Precision@1000 of 99%, compared to 86% for prior work, and a substantially better precision-yield trade-off across the top 15,000 extractions. To stimulate research in this area, we release our code and data.

== 总结和评论 ==

这篇文章提出了一种科技文献中抽取涌现的概念的方法（命名为ForeCite），在arXiv上的计算机科学文献上的实验取得较高精确度（Precision@1000=99%）。文章还开源了代码和数据集。

该文章想要解决的问题是，如何有效地区别一个词汇究竟是一个真正的概念，还是说只是简单地和真正的概念有一些关联而已？主要想法是：一个真正的概念，有很大的可能性是在一篇被后续文章大量（不成比例地）引用的文章中提出的，或者是经由这篇文献开始涌现或流行起来的。
文章提到了前人的两个工作LoOR<ref name="LoOR"/>和CNLC<ref name="CNLC"/>, 三个工作都是基于term citation graph来进行分析，所谓term citation graph是指一个包含特定词汇的引文网络，实际上是整个语料库文献构成的引文网络的子网。LoOR和CNLC的主要想法是，一个概念的引文网络的“密度”要比非概念的引文网络更大。“密度”可以用不同指标来描述，例如子网络的连通性、连边数等等。

回到ForeCite的方法，它给每个潜在的概念词汇定义了一个排序分数，公式如下：

<math>ForeCite(G_t)=\max_{p\in{G_t}}\log(f_t^p+1) \cdot \frac{f_t^p}{f_t}</math>

文章p是属于词汇t的引文网络中的节点，<math>f_t^p</math>表示引用了p且包含词汇t的文章数量。<math>f_t</math>表示包含词汇t的文章总数量

根据上式计算每个词汇的ForeCite得分，提取Top-N个词汇作为概念进行人工验证。文章也和LoOR和CNLC方法做了对比，结果更优。

文章通过分析文献引用网络来识别提取概念，而不是从文本内容深度分析着手。NLP主要用于数据预处理。使用网络分析也可以从不同的角度思考这个问题。

==参考文献==
<references>
<ref name="LoOR"> Yookyung Jo, Carl Lagoze, and C. Lee Giles. 2007. Detecting research topics via the correlation between graphs and texts. In KDD ’07.</ref>
<ref name="CNLC"> Asif ul Haque and Paul Ginsparg. 2011. Phrases as subtopical concepts in scholarly text. In JCDL ’11.</ref>
</references>

== 概念地图 ==

分类:ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

2020-12-07T14:13:21Z

Songyk：/* 总结和评论 */

[[Category:文献讨论]]
[[分类:AllenAI系列科学学文章]]
[[分类:概念抽取和概念关系挖掘]]

Mark Neumann, Daniel King, Iz Beltagy, Waleed Ammar. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. BioNLP@ACL 2019

== Abstract ==
Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.

== 总结和评论 ==

这篇文章发展了一套用于科研论文的[[:分类:概念抽取和概念关系挖掘|概念抽取和概念关系挖掘]]的scispaCy软件，其本身基于spaCy软件。

这套软件及其背后的方法，用于概念地图半自动构建也是可以的。

[https://allenai.github.io/scispacy ScispaCy]是作者开发的面向生物医学科学论文的自然语言处理软件，一个Python库。ScispaCy基于 [https://spacy.io spaCy]，通过在生物医学科学论文语料库上进行训练, 得到新的、专业化的“模型”，从而在生物医药自然语言处理任务中达到更好的表现。也可以将ScispaCy看做是spaCy的扩展。文章主要贡献：
* 发布了一个数据集：[https://github.com/allenai/genia-dependency-trees Universal Dependencies v1.0 for the GENIA Treebank]
* 在POS tagging， Dependency Parsing, Named Entity Recoognition等任务上进行基准测试并和主流工具进行了对比，结果显示ScispaCy具有相当好的表现
* 提供了用于生物医学领域文本处理的快速、稳定、易用的“管道工具集”（pipelines），即ScispaCy

spaCy是一个基于Python语言的工业级的自然语言处理工具包。据其官方网站介绍，spaCy的特点是：
* 易用（easy to install, simple and productive API）
* 特别快（written from the ground up in carefully memory-managed Cython）
* 无缝对接下游工具（TensorFlow, PyTorch, scikit-learn, Gensim, ...）

在ScispaCy从不同语料库上训练得到的几个模型中，en_core_sci_md和en_core_sci_lg两个模型提供了训练好的词向量数据。文章中并未介绍详细的训练过程。要了解这方面细节，需要看spaCy/ScispaCy的文档和源代码。

在生物医药文本处理领域已有很多工具，例如广泛使用的命名实体识别工具MetaMap和MetaMapLite。尽管如此，这些经典的自然语言处理工具，大都还没有用上诸如词向量表示和神经网络这样的技术。另一个问题则是如何更好输出NLP处理后的信息给下游（通常机器学习）任务。ScispaCy的目标就是解决这些问题。

spaCy/ScispaCy主要是作为自然语言处理或机器学习工作流中的“上游”工具。首先，可以直接利用ScispaCy来研究特定的问题，例如构建生物医药领域的概念地图。可以进一步去了解文章中使用和提到的几个生物医药语料库和数据集。此外，可以用ScispaCy的方法来训练其它学科领域的“模型”。

资源列表：
* ScispaCy https://allenai.github.io/scispacy, https://github.com/allenai/scispacy
* Universal Dependencies v1.0 for the GENIA Treebank, https://github.com/allenai/genia-dependency-trees
* GENIA Project http://www.geniaproject.org/home
* spaCy https://spacy.io/
* The GENIA 1.0 Treebank https://nlp.stanford.edu/~mcclosky/biomedical.html

== 概念地图 ==