Large Language Models: A Decade of Progress

Imagine it's 2015 and someone had told you that in 2025:

AI could write novels almost indistinguishable from human-written ones
AI would pass the bar exam
AI could write write a song indistinguishable from a human-written one
AI could assist humans in writing considerable amount of code
A single AI model can understand text, images, and audio and respond in all of those modalities
AI could generate images that is indistinguishable from a real photo

You probably would have dismissed such claims. Artificial Intelligence has gone through transformational changes in the past decade. We moved from models barely able to generate a coherent sentence to those capable of passing the toughest exams, translating between multiple languages, creating realistic images, composing music, coding, and more.

In this post, we will look at the major milestones in the development of transformer-based Large Language Models (LLMs) in the past decade. Although a deep understanding of how LLMs work isn't required, having a basic familiarity with them is recommended to fully follow the post.

2014

Sequence to Sequence models with attention mechanism. [1]

2016

Big LSTM in Google's machine translation system. [2]

1B Parameters

240B Training Tokens

1.2 x 10²⁰ Flops (32 K40 GPUs for 10 days)

2017

Attention is all you need. [3]

213M Parameters

15B Training Tokens

2.3 x 10¹⁹ Flops (8 P100 GPUs for 3.5 days)

OpenAI introduces Deep Reinforcement Learning from Human Preferences later on coined the name RLHF(Reinforcement Learning from Human Feedback). In RLHF, goals are defined by non-expert humans and the model learns to follow them by maximizing a reward function. [20,21]

2018

Google's BERT. [4]

340M Parameters

128B Training Tokens

10 x 10²⁰ Flops (16 TPU v2 for 4 days)

OpenAI's GPT-1 [5]

117M Parameters

100B Training Tokens

2.3 x 10¹⁹ Flops (Estimated by 2 x Params x Tokens. Information about HW not available.)

(State of the art GPU available at that time was V100.)

2019

OpenAI GPT-2 [6]

1.5B Parameters

40B Training Tokens

1.2 x 10²⁰ Flops (Estimated by 2 x Params x Tokens. Information about HW not available.)

Note that GPT-2 compared to GPT-1 has almost 10x more parameters but was train on less tokens. GPT-2 was a scaling experiment focused on capacity, not data quantity. The goal was to see how much zero-shot performance improved with increasing model size, even without changing the dataset.

Google introduces Parameter-Efficient Fine-Tuning (PEFT) to fine-tune LLMs for a specific task. In PEFT, the original model parameters are frozen and only small number(3-4 % of the original parameters) of trainable parameters are added. [22]

2020

OpenAI GPT-3 [7]

175B Parameters

300B Training Tokens

3 x 10²³ Flops

GPT-3 showed that massive scale alone enables zero-shot, one-shot, and few-shot learning — no parameter updates needed.

RAG was introduced.[8]

MoE was reintroduced within the context of transformers where only a subset of model parameters get activated during inference.[17]

EleutherAI Started as open source alternative to OpenAI. They released open source(code/data/weight) models such as GPT-J, GPT-Neo, GPT-NeoX. Their training infrastructure includes public compute on donated clusters and TPU Research Cloud.

2021

NVIDIA and Microsoft reveal Megatron-Turing NLG [14]

530B Parameters

270B Training Tokens

~3x10²³ Flops

Trained using NVIDIA Selene supercomputer with 560 DGX A100 nodes. Each node has 8 NVIDIA 80GB A100 GPUs

The term prompt engineering started to gain popularity after the release of GPT-3

OpenAI introduces DALL-E as a zero-shot text-to-image generation model. This marked a significant milestone in transformer-based multimodal LLMs.[26]

12B Parameters

320B Tokens(250M images into 32x32 grids + 256 BPE tokens per image captions)

2.3x10²¹ Flops

Trained using 1024 NVIDIA V100 GPUs (16GB each).

2022

Google DeepMind Introduces Chinchilla. A model designed using an optimal compute/token scaling approach. [9]

70B Parameters

1.4T Training Tokens

5.7x10²³ Flops

Most LLMS are undertrained.

For compute-optimal training, model size and data tokens should scale equally: a 1:1 ratio. Unlike Kaplan's scaling law[35], which favored larger models over more data.

Google releases PaLM. [10]

540B Parameters

780B Training Tokens

8.4x10²³ Flops

6144 TPU v4 chips across two TPU Pods. Trained with JAX + T5X framework.

META AI releases OPT-175B.

175B Parameters

180B Training Tokens

6.3x10²² Flops

Trained using 992 × A100 (80GB) GPUs with PyTorch + FSDP + Megatron-LM framework.

Flash Attention was introduced.[15]

OpenAI fine-tunes GPT-3 via reinforcement learning with human feedback.[11]

Chain of thoughts is introduced by Google.[12]

ChatGPT is released in November 2022 to the public and reaches 100M users in 2 months.

Anthropic introduces Constitutional AI. The goal is to train an AI assistant through self improvement and only using a set of rules and principles provided by humans. They also call it Reinforcement Learning from AI Feedback (RLAIF). [19]

2023

Georgi Gerganov, a Bulgarian developer, introduced the Georgi Gerganov Machine Learning (GGML) library. It is a C/C++ tensor library optimized for CPU inference of LLMs. After releasing GGML, it became more apparent that LLMs can run on consumer-grade hardware, especially CPUs.[24]

GPT-4 is released. It is estimated to have 100s of billions of parameters. Trained on trillions of tokens using Microsoft-built GPU clusters.

Anthropic releases Claude and Claude 2 with 100K-token context window. Google introduces Gemini family of models as direct competitor to OpenAI's GPT-4.

Gemini Ultra is rumored to have trillion of parameters with image generation capabilities.

Meta releases Llama family of models 7B-65B parameters.

Google introduces ReAct. LLMs are used for both reasoning and task-specific actions in an interleaved manner, allowing for synergy between reasoning and action. This in turn helped in creating agentic flows in LLM applications.[18]

Mistral AI released Mistral 7B. This model achieved better results than models 2–5x larger on benchmarks for reasoning, math, and code generation. Some of the key innovations included: sliding window attention to handle long sequences and grouped query attention. Mistral AI open sourced this model. [23]

EvoLved Sign Momentum(Lion) was introduced as a more memory efficient optimizer than Adam as it only keeps track of the momentum. It converges faster and scales better to larger models.[25]

2024

Google DeepMind introduces Gemini 1.5 with 1 million context window. Roughly an hour of video or 700k words in a single prompt. [13]

In May, Google's Deepmind introduces Veo that is capable of generating high quality 1080p videos from text, image, and video prompts. Later that year, they introduces Veo 2 with the ability to generate 4k videos and improved understanding of the physical world.

DeepSeek V3, a 671B-parameter model, outperforms GPT-4 on certain coding and math benchmarks. [28]

671B Parameters

14.8T Training Tokens

6.7x10²⁴ Flops

Trained on a cluster with 2048 NVIDIA H800 GPUs. They introduced multiple techniques to improve training efficiency such as custom cuda kernels, 8bit weight precision when possible, multi-token prediction, communication optimization between the GPUS, and mixture of experts with routing.

Only 37 billion parameters activated per token, due to the use of Mixture of Experts (MoE) layers.

2025

Shortly after DeepSeek V3, they released DeepSeek-R1. DeepSeek-V3 was good at math, coding, and multilingual reasoning. DeepSeek-R1 was more of a general purpose foundational model. [29]

236B Parameters

3T Training Tokens

5.8x10²⁴ Flops

2.3 million H100 GPU hours.

Only 12.9 billion activated per token, due to the use of Mixture of Experts (MoE) layers.

DeepSeek's open-source models took the market by surprise, as people realized they could achieve state-of-the-art performance on benchmarks with much less compute than previously thought possible. Not surprisingly, NVIDIA took the biggest hit.

Meta introduced LLaMA 4, it's fourth series of open source models, with multimodal capabilities. [16]

Google's Gemini 2 emphasizes on agentic capabilities - planning and sequence of actions.

Specialized models tailored for a specific task gained popularity as people realized that they don't need to have an enormous model for a specific task. Instead they could fine-tune a much smaller specialized model for a specific task.

OpenAI released GPT-03, which is capable of reflective reasoning. It is the successor to the 01 model, which was released in late 2024. They are designed for questions that require step by step reasoning. [36]

Quality of Models VS Number of Parameters Over the Years

So far, we've mostly focused on scale—things like model size, number of tokens, and FLOPs—and haven't really discussed the quality of models. The improvements we've seen aren't simply due to increasing scale—we've also made considerable progress in model quality. Let's look at a math reasoning benchmark and see how the performance of models has improved over time. For this we are going to pick a dataset called GSM8K[30] which contains grade school math word problem dataset for testing mathematical reasoning. Here is one example:

Question	Answer
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?	Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72

GSM8K Accuracy of Large Language Models — Performance comparison of language models on the GSM8K math reasoning benchmark over time. The graph shows that despite a dramatic reduction in model size in recent years, model accuracy was maintained or even improved.

Above is a graph showing the performance of different models on this math reasoning benchmark over time.[23,31,32,33,34] Looking at this graph two things are apparent:
1. Despite a dramatic reduction in model size in recent years, model accuracy was maintained or even improved.
2. In a very short span of time, smaller models have become increasingly more capable. All of this happened in 4 just years.

Conclusion

This post is by no means an exhaustive list of progress that occurred in AI in the past decade. We primarily focused on the model side and didn’t include areas such as hardware accelerators and software side (like agentic frameworks). The pace of progress has been nothing short of astonishing in the past decade. We saw a dramatic increase in the size of models that led to an increase in their capabilities. We also witnessed innovations in model architectures that resulted in better model performance and training efficiency.

Despite this rapid progress, there are still critical areas that LLMs can be improved. One very important area is hallucinations. These models are trained as next token prediction models which can cause them to hallucinate and generate irrelevant information. Another critical area is continuous model improvement without making significant architectural changes to the underlying model. Alignment is another key area that needs to be addressed. How can we ensure that these models are aligned with human goals and values? This in turn brings in the ethical and societal implications of LLMs. Who defines the goals and values?

One more thing

Finally, my last point above brings in the point that we have a fair amount of control over how these models behave. If you look at these LLM models at a higher level and consider them similar to a system—like many other systems—we can define and design their behavior. We, as a society, created the system of national governments to better help us live together. We have bad systems of government and good systems of government, which to a huge degree are determined by their original design principles. The same is true about these large language models—the way we design them is going to have a huge impact on how they behave and perform in the real world.

Thanks to Sean McGregor and Mohammadreza Heydari for reading drafts of this post.

References:

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate", 2014; arXiv:1409.0473.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu: "Exploring the Limits of Language Modeling", 2016; arXiv:1602.02410.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: "Attention Is All You Need", 2017; arXiv:1706.03762.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", 2018; arXiv:1810.04805.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever: "Improving Language Understanding by Generative Pre-Training", 2018; Generative Pre-Training.
OpenAI: "Language Models are Unsupervised Multitask Learners", 2019; Language Models are Unsupervised Multitask Learners.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei: "Language Models are Few-Shot Learners", 2020; arXiv:2005.14165.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020; arXiv:2005.11401.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre: "Training Compute-Optimal Large Language Models", 2022; arXiv:2203.15556.
Google Research Team(Sorry list was too long to include all the authors): "PaLM: Scaling Language Modeling with Pathways", 2022; arXiv:2204.02311.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe: "Training language models to follow instructions with human feedback", 2022; arXiv:2203.02155.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou: "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", 2022; arXiv:2201.11903.
Gemini Team(Sorry list was too long to include all the authors) : "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context", 2024; arXiv:2403.05530.
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro: “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model", 2022; arXiv:2201.11990.
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", 2022; arXiv:2205.14135.
Meta: "Llama 4: Multimodal Intelligence Beyond Vision", 2024; The Llama 4 herd.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding MoE in LLMs.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao: “ReAct: Synergizing Reasoning and Acting in Language Models", 2022; arXiv:2210.03629.
Anthropic Team(Sorry list was too long to include all the authors): “Constitutional AI: Harmlessness from AI Feedback", 2022; arXiv:2212.08073.
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei: “Deep reinforcement learning from human preferences", 2017; arXiv:1706.03741.
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving: “Fine-Tuning Language Models from Human Preferences", 2019; arXiv:1909.08593.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly: “Parameter-Efficient Transfer Learning for NLP", 2019; arXiv:1902.00751.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed: “Mistral 7B", 2023; arXiv:2310.06825.
Gerganov, G. llama.cpp: Low-Latency Audio Streaming Library for C++. github, 2023.
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le: “Symbolic Discovery of Optimization Algorithms", 2023; arXiv:2302.06675.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever: “Zero-Shot Text-to-Image Generation", 2021; arXiv:2102.12092.
Langchain: https://www.langchain.com/
DeepSeek-AI team: "DeepSeek-V3 Technical Report", 2024; arXiv:2412.19437.
DeepSeek-AI team: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", 2025; arXiv:2501.12948.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman: “Training Verifiers to Solve Math Word Problems”, 2021; arXiv:2110.14168.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample: “LLaMA: Open and Efficient Foundation Language Models”, 2023; arXiv:2302.13971.
LLama Team(List was too long to include all the authors): “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023; arXiv:2307.09288.
LLama Team(List was too long to include all the authors): “The Llama 3 Herd of Models”, 2024; arXiv:2407.21783.
Gemma Team at Deepmind: “Gemma 2: Improving Open Language Models at a Practical Size”, 2024; arXiv:2408.00118.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei: “Scaling Laws for Neural Language Models”, 2020; arXiv:2001.08361.
OpenAI: "Introducing O3 and O4 Mini", 2025; Introducing O3 and O4 Mini.