分类:SciBERT: A Pretrained Language Model for Scientific Text

来自Big Physics
Luohuiying讨论 | 贡献2020年11月27日 (五) 10:49的版本 →‎总结和评论


Iz Beltagy, Kyle Lo, Arman Cohan. SciBERT: A Pretrained Language Model for Scientific Text. EMNLP/IJCNLP 2019

Abstract

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

总结和评论

这篇文章为了解决科研领域的高质量标注数据难以获取的问题,对Bert模型进行改进,特出看一个基于科研文献的Bert预训练模型(SciBERT).

首先,在词汇表上,选择的是StencepiecesLibrary,(是一个不依赖句子逻辑的词汇表),并且发现科研领域的词汇表和已发布的Bert模型的词汇表的重叠部分仅为40%,这说明科研领域在常用词上科研领域与其他一般领域有着很大的不同。

其次,文章所选择的训练集来自Semantic Schoolar,其中包括了18%的计算机领域,82%的生物医学领域的文章,在进行模型输入的时候,使用的是全文。下游任务有NER,PICO,CLS,REL,DEP。

文章构建了4种模型,封装好的Bert模型,进行微调后的Bert模型,封装好的SciBert模型,进行微调后的SciBert模型。结果发现SciBert在很多下游任务和一些科研领域中都达到了最高水平。

    1. SciBert和Bert有什么区别?

概念地图

本分类目前不含有任何页面或媒体文件。