分类:SciREX: A Challenge Dataset for Document-Level Information Extraction

来自Big Physics
Jinshanw讨论 | 贡献2020年12月2日 (三) 15:37的版本 (创建页面,内容为“Category:文献讨论 分类:AllenAI系列科学学文章 Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, Iz Beltagy. SciREX: A Challenge Dataset f...”)
(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)


Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, Iz Beltagy. SciREX: A Challenge Dataset for Document-Level Information Extraction. ACL 2020


Abstract

Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level $N$-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at this https URL LESS

总结和评论

概念地图

本分类目前不含有任何页面或媒体文件。