Structured information extraction from scientific text with large language models

Resource type
Journal Article
Authors/contributors
Title
Structured information extraction from scientific text with large language models
Abstract
Abstract Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.
Publication
Nature Communications
Volume
15
Issue
1
Pages
1418
Date
2024-02-15
Journal Abbr
Nat Commun
Language
en
ISSN
2041-1723
Accessed
23/02/2024, 23:13
Library Catalogue
DOI.org (Crossref)
Citation
Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A. S., Ceder, G., Persson, K. A., & Jain, A. (2024). Structured information extraction from scientific text with large language models. Nature Communications, 15(1), 1418. https://doi.org/10.1038/s41467-024-45563-x