Imagine it's 2015 and someone had told you that in 2025:
AI could write novels almost indistinguishable from human-written ones
AI would pass the bar exam
AI could write write a song indistinguishable from a human-written one
AI could assist humans in writing considerable amount of code
A single AI model can understand text, images, and audio and respond in all of those modalities
AI could generate images that is indistinguishable from a real photo
You probably would have dismissed such claims.
Artificial Intelligence has gone through transformational changes in the past decade. We moved from models barely able to generate a coherent sentence to those capable of passing the toughest exams, translating between multiple languages, creating realistic images, composing music, coding, and more.
In this post, we will look at the major milestones in the development of transformer-based Large Language Models (LLMs) in the past decade. Although a deep understanding of how LLMs work isn't required, having a basic familiarity with them is recommended to fully follow the post.
2014
Sequence to Sequence models with attention mechanism. [1]
2016
Big LSTM in Google's machine translation system. [2]
1B Parameters
240B Training Tokens
1.2 x 1020 Flops (32 K40 GPUs for 10 days)
2017
Attention is all you need. [3]
213M Parameters
15B Training Tokens
2.3 x 1019 Flops (8 P100 GPUs for 3.5 days)
OpenAI introduces Deep Reinforcement Learning from Human Preferences later on coined the name RLHF(Reinforcement Learning from Human Feedback). In RLHF, goals are defined by non-expert humans and the model learns to follow them by maximizing a reward function. [20,21]
2018
Google's BERT. [4]
340M Parameters
128B Training Tokens
10 x 1020 Flops (16 TPU v2 for 4 days)
OpenAI's GPT-1 [5]
117M Parameters
100B Training Tokens
2.3 x 1019 Flops (Estimated by 2 x Params x Tokens. Information about HW not available.)
(State of the art GPU available at that time was V100.)
2019
OpenAI GPT-2 [6]
1.5B Parameters
40B Training Tokens
1.2 x 1020 Flops (Estimated by 2 x Params x Tokens. Information about HW not available.)
Note that GPT-2 compared to GPT-1 has almost 10x more parameters but was train on less tokens. GPT-2 was a scaling experiment focused on capacity, not data quantity. The goal was to see how much zero-shot performance improved with increasing model size, even without changing the dataset.
Google introduces Parameter-Efficient Fine-Tuning (PEFT) to fine-tune LLMs for a specific task. In PEFT, the original model parameters are frozen and only small number(3-4 % of the original parameters) of trainable parameters are added. [22]
2020
OpenAI GPT-3 [7]
175B Parameters
300B Training Tokens
3 x 1023 Flops
GPT-3 showed that massive scale alone enables zero-shot, one-shot, and few-shot learning — no parameter updates needed.
RAG was introduced.[8]
MoE was reintroduced within the context of transformers where only a subset of model parameters get activated during inference.[17]
EleutherAI Started as open source alternative to OpenAI. They released open source(code/data/weight) models such as GPT-J, GPT-Neo, GPT-NeoX. Their training infrastructure includes public compute on donated clusters and TPU Research Cloud.
2021
NVIDIA and Microsoft reveal Megatron-Turing NLG [14]
530B Parameters
270B Training Tokens
~3x1023 Flops
Trained using NVIDIA Selene supercomputer with 560 DGX A100 nodes. Each node has 8 NVIDIA 80GB A100 GPUs
The term prompt engineering started to gain popularity after the release of GPT-3
OpenAI introduces DALL-E as a zero-shot text-to-image generation model. This marked a significant milestone in transformer-based multimodal LLMs.[26]
12B Parameters
320B Tokens(250M images into 32x32 grids + 256 BPE tokens per image captions)
2.3x1021 Flops
Trained using 1024 NVIDIA V100 GPUs (16GB each).
2022
Google DeepMind Introduces Chinchilla. A model designed using an optimal compute/token scaling approach. [9]
70B Parameters
1.4T Training Tokens
5.7x1023 Flops
Most LLMS are undertrained.
For compute-optimal training, model size and data tokens should scale equally: a 1:1 ratio. Unlike Kaplan's scaling law[35], which favored larger models over more data.
Google releases PaLM. [10]
540B Parameters
780B Training Tokens
8.4x1023 Flops
6144 TPU v4 chips across two TPU Pods. Trained with JAX + T5X framework.
META AI releases OPT-175B.
175B Parameters
180B Training Tokens
6.3x1022 Flops
Trained using 992 × A100 (80GB) GPUs with PyTorch + FSDP + Megatron-LM framework.
Flash Attention was introduced.[15]
OpenAI fine-tunes GPT-3 via reinforcement learning with human feedback.[11]
Chain of thoughts is introduced by Google.[12]
ChatGPT is released in November 2022 to the public and reaches 100M users in 2 months.
Anthropic introduces Constitutional AI. The goal is to train an AI assistant through self improvement and only using a set of rules and principles provided by humans. They also call it Reinforcement Learning from AI Feedback (RLAIF). [19]
2023
Georgi Gerganov, a Bulgarian developer, introduced the Georgi Gerganov Machine Learning (GGML) library. It is a C/C++ tensor library optimized for CPU inference of LLMs. After releasing GGML, it became more apparent that LLMs can run on consumer-grade hardware, especially CPUs.[24]
GPT-4 is released. It is estimated to have 100s of billions of parameters. Trained on trillions of tokens using Microsoft-built GPU clusters.
Anthropic releases Claude and Claude 2 with 100K-token context window.
Google introduces Gemini family of models as direct competitor to OpenAI's GPT-4.
Gemini Ultra is rumored to have trillion of parameters with image generation capabilities.
Meta releases Llama family of models 7B-65B parameters.
Google introduces ReAct. LLMs are used for both reasoning and task-specific actions in an interleaved manner, allowing for synergy between reasoning and action. This in turn helped in creating agentic flows in LLM applications.[18]
Mistral AI released Mistral 7B. This model achieved better results than models 2–5x larger on benchmarks for reasoning, math, and code generation. Some of the key innovations included: sliding window attention to handle long sequences and grouped query attention. Mistral AI open sourced this model. [23]
EvoLved Sign Momentum(Lion) was introduced as a more memory efficient optimizer than Adam as it only keeps track of the momentum. It converges faster and scales better to larger models.[25]
2024
Google DeepMind introduces Gemini 1.5 with 1 million context window. Roughly an hour of video or 700k words in a single prompt. [13]
In May, Google's Deepmind introduces Veo that is capable of generating high quality 1080p videos from text, image, and video prompts. Later that year, they introduces Veo 2 with the ability to generate 4k videos and improved understanding of the physical world.
DeepSeek V3, a 671B-parameter model, outperforms GPT-4 on certain coding and math benchmarks. [28]
671B Parameters
14.8T Training Tokens
6.7x1024 Flops
Trained on a cluster with 2048 NVIDIA H800 GPUs. They introduced multiple techniques to improve training efficiency such as custom cuda kernels, 8bit weight precision when possible, multi-token prediction, communication optimization between the GPUS, and mixture of experts with routing.
Only 37 billion parameters activated per token, due to the use of Mixture of Experts (MoE) layers.
2025
Shortly after DeepSeek V3, they released DeepSeek-R1. DeepSeek-V3 was good at math, coding, and multilingual reasoning. DeepSeek-R1 was more of a general purpose foundational model. [29]
236B Parameters
3T Training Tokens
5.8x1024 Flops
2.3 million H100 GPU hours.
Only 12.9 billion activated per token, due to the use of Mixture of Experts (MoE) layers.
DeepSeek's open-source models took the market by surprise, as people realized they could achieve state-of-the-art performance on benchmarks with much less compute than previously thought possible. Not surprisingly, NVIDIA took the biggest hit.
Meta introduced LLaMA 4, it's fourth series of open source models, with multimodal capabilities. [16]
Google's Gemini 2 emphasizes on agentic capabilities - planning and sequence of actions.
Specialized models tailored for a specific task gained popularity as people realized that they don't need to have an enormous model for a specific task. Instead they could fine-tune a much smaller specialized model for a specific task.
OpenAI released GPT-03, which is capable of reflective reasoning. It is the successor to the 01 model, which was released in late 2024. They are designed for questions that require step by step reasoning. [36]
Quality of Models VS Number of Parameters Over the Years
So far, we've mostly focused on scale—things like model size, number of tokens, and FLOPs—and
haven't really discussed the quality of models. The improvements we've seen aren't simply due
to increasing scale—we've also made considerable progress in model quality. Let's look at a math reasoning benchmark and see
how the performance of models has improved over time. For this we are going to pick a dataset called GSM8K[30] which contains grade school math word problem dataset for testing mathematical reasoning. Here is one example:
Question
Answer
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72
Performance comparison of language models on the GSM8K math reasoning benchmark over time. The graph shows that despite a dramatic reduction in model size in recent years, model accuracy was maintained or even improved.
Above is a graph showing the performance of different models on this math reasoning benchmark over time.[23,31,32,33,34] Looking at this graph two things are apparent:
1. Despite a dramatic reduction in model size in recent years, model accuracy was maintained or even improved.
2. In a very short span of time, smaller models have become increasingly more capable. All of this happened in 4 just years.
Conclusion
This post is by no means an exhaustive list of progress that occurred in AI in the past decade. We primarily focused on the model side and didn’t include areas such as hardware accelerators and software side (like agentic frameworks).
The pace of progress has been nothing short of astonishing in the past decade.
We saw a dramatic increase in the size of models that led to an increase in their capabilities. We also witnessed innovations in model architectures that resulted in better model performance and training efficiency.
Despite this rapid progress, there are still critical areas that LLMs can be improved.
One very important area is hallucinations. These models are trained as next token prediction models which can cause them to hallucinate and generate irrelevant information. Another critical area is continuous
model improvement without making significant architectural changes to the underlying model. Alignment is another key area that needs to be addressed. How can
we ensure that these models are aligned with human goals and values? This in turn brings in the ethical and societal implications of LLMs. Who defines the goals and values?
One more thing
Finally, my last point above brings in the point that we have a fair amount of control over how these models behave. If you look at these LLM models at a higher level and consider them similar to a system—like many other systems—we can define and design their behavior. We, as a society, created the system of national governments to better help us live together. We have bad systems of government and good systems of government, which to a huge degree are determined by their original design principles. The same is true about these large language models—the way we design them is going to have a huge impact on how they behave and perform in the real world.
Thanks to Sean McGregor and Mohammadreza Heydari for reading drafts of this post.
🎉 Thanks for subscribing!
References:
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate", 2014; arXiv:1409.0473.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu: "Exploring the Limits of Language Modeling", 2016; arXiv:1602.02410.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: "Attention Is All You Need", 2017; arXiv:1706.03762.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", 2018; arXiv:1810.04805.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever: "Improving Language Understanding by Generative Pre-Training", 2018; Generative Pre-Training.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei: "Language Models are Few-Shot Learners", 2020; arXiv:2005.14165.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020; arXiv:2005.11401.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre: "Training Compute-Optimal Large Language Models", 2022; arXiv:2203.15556.
Google Research Team(Sorry list was too long to include all the authors): "PaLM: Scaling Language Modeling with Pathways", 2022; arXiv:2204.02311.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe: "Training language models to follow instructions with human feedback", 2022; arXiv:2203.02155.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou: "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", 2022; arXiv:2201.11903.
Gemini Team(Sorry list was too long to include all the authors) : "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context", 2024; arXiv:2403.05530.
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro: “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model", 2022; arXiv:2201.11990.
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", 2022; arXiv:2205.14135.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
MoE in LLMs.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao: “ReAct: Synergizing Reasoning and Acting in Language Models", 2022; arXiv:2210.03629.
Anthropic Team(Sorry list was too long to include all the authors): “Constitutional AI: Harmlessness from AI Feedback", 2022; arXiv:2212.08073.
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei: “Deep reinforcement learning from human preferences", 2017; arXiv:1706.03741.
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving: “Fine-Tuning Language Models from Human Preferences", 2019; arXiv:1909.08593.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly: “Parameter-Efficient Transfer Learning for NLP", 2019; arXiv:1902.00751.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed: “Mistral 7B", 2023; arXiv:2310.06825.
Gerganov, G. llama.cpp: Low-Latency Audio Streaming Library for C++. github, 2023.
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le: “Symbolic Discovery of Optimization Algorithms", 2023; arXiv:2302.06675.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever: “Zero-Shot Text-to-Image Generation", 2021; arXiv:2102.12092.
DeepSeek-AI team: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", 2025; arXiv:2501.12948.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman: “Training Verifiers to Solve Math Word Problems”, 2021; arXiv:2110.14168.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample: “LLaMA: Open and Efficient Foundation Language Models”, 2023; arXiv:2302.13971.
LLama Team(List was too long to include all the authors): “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023; arXiv:2307.09288.
LLama Team(List was too long to include all the authors): “The Llama 3 Herd of Models”, 2024; arXiv:2407.21783.
Gemma Team at Deepmind: “Gemma 2: Improving Open Language Models at a Practical Size”, 2024; arXiv:2408.00118.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei: “Scaling Laws for Neural Language Models”, 2020; arXiv:2001.08361.