Your search
In authors or contributors
Language Models are Few-Shot Learners
Resource type
Preprint
Authors/contributors
- Brown, Tom B. (Author)
- Mann, Benjamin (Author)
- Ryder, Nick (Author)
- Subbiah, Melanie (Author)
- Kaplan, Jared (Author)
- Dhariwal, Prafulla (Author)
- Neelakantan, Arvind (Author)
- Shyam, Pranav (Author)
- Sastry, Girish (Author)
- Askell, Amanda (Author)
- Agarwal, Sandhini (Author)
- Herbert-Voss, Ariel (Author)
- Krueger, Gretchen (Author)
- Henighan, Tom (Author)
- Child, Rewon (Author)
- Ramesh, Aditya (Author)
- Ziegler, Daniel M. (Author)
- Wu, Jeffrey (Author)
- Winter, Clemens (Author)
- Hesse, Christopher (Author)
- Chen, Mark (Author)
- Sigler, Eric (Author)
- Litwin, Mateusz (Author)
- Gray, Scott (Author)
- Chess, Benjamin (Author)
- Clark, Jack (Author)
- Berner, Christopher (Author)
- McCandlish, Sam (Author)
- Radford, Alec (Author)
- Sutskever, Ilya (Author)
- Amodei, Dario (Author)
Title
Language Models are Few-Shot Learners
Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Repository
arXiv
Archive ID
arXiv:2005.14165
Date
2020-07-22
Accessed
24/02/2024, 17:40
Library Catalogue
Extra
arXiv:2005.14165 [cs]
Citation
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (arXiv:2005.14165). arXiv. https://doi.org/10.48550/arXiv.2005.14165
Theme
Link to this record