unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network

Saier, Tarek; Krause, Johan; Färber, Michael

doi:10.1109/JCDL57899.2023.00020

Your search

Return to list of results

Theme

Research and evaluation

Publication year

Between 2000 and 2024

1
...
10
11
12
13
14
...
181

Page 12 of 181

unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network

Open in Zotero

View on zotero.org

Resource type

Journal Article

Authors/contributors

Saier, Tarek (Author)
Krause, Johan (Author)
Färber, Michael (Author)

Title

unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network

Abstract

Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 $\mathrm{M}$ publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IlIDepence/unarXive.

Publication

2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Pages

66-70

Date

6/2023

DOI

10.1109/JCDL57899.2023.00020

Short Title

unarXive 2022

URL

https://ieeexplore.ieee.org/document/10266058/

Accessed

10/03/2024, 19:41

Library Catalogue

Semantic Scholar

Extra

Conference Name: 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL) ISBN: 9798350399318 Place: Santa Fe, NM, USA Publisher: IEEE

Citation

Saier, T., Krause, J., & Färber, M. (2023). unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network. 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 66–70. https://doi.org/10.1109/JCDL57899.2023.00020

Theme

Research and evaluation
- Bibliometrics

Link to this record

https://docs.opendeved.net/lib/CY569RVZ

1
...
10
11
12
13
14
...
181

Page 12 of 181