分类:SciBERT: A Pretrained Language Model for Scientific Text

来自Big Physics
Luohuiying讨论 | 贡献2020年11月27日 (五) 11:05的版本

Iz Beltagy, Kyle Lo, Arman Cohan. SciBERT: A Pretrained Language Model for Scientific Text. EMNLP/IJCNLP 2019


Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.




其次,文章所选择的训练集来自Semantic Schoolar,其中包括了18%的计算机领域,82%的生物医学领域的文章,在进行模型输入的时候,使用的是全文。下游任务有NER,PICO,CLS,REL,DEP。结果分为三个部分:生物医学领域,计算机科学领域和多个领域。 高层结果显示,SCIBERT在科学文本方面的表现优于BERT-Base,并在许多下游任务上实现了新的SOTA

SciBert和Bert有什么区别? 词汇表使用的是科学词汇(SCIVOCAB),Bert基于新闻和Wikipedia. 经过科学语料库的训练。 其余框架和训练方式都一样。

这篇文章与综述文献识别? 确实在idea上拥有着一定的相似度,我们也想用自然语言处理的方式来解决标注数据的问题,所以想要训练出一个用于综述文献识别的分类器,但是我们这个特定分类任务的分类器还没完全优化完成,别人的更一般性的通用的预训练模型就已经发布了。

