By: Thomas Stahura
Is Grok SOTA?
If that phrase comes across as gibberish, allow me to explain.
On Monday, xAI (Elon’s AI company) launched Grok 3, claiming state-of-the-art (SOTA) in terms of performance. SOTA has become a sort of catch-all term for crowning AI models. Grok’s benchmarks are impressive, scoring a 93, 85, and 79 on AIME (math), GPQA (science), and LCB (coding). These marks outperform the likes of o3-mini-high, o1, DeepSeek R1, sonnet-3.5, and gemini 2.0 flash. Essentially, Grok 3 outperforms every model except for the yet-to-be released o3. An impressive feat for a 17-month-old company!
I could mention that Grok used 100k+ GPUs during training, or that it built an entire data center in a matter of months. But much has been documented there. So given all that's happened this year with open source, distillation, and a number of tiny companies achieving SOTA performance, it’s much more useful to discuss walls, moats, and bottlenecks in the AI industry.
Walls
The question about a “Wall” in AI is really a question about where, when, or if AI researchers will reach a point where model improvements stall. Some say we will run out of viable high-quality data and hit the “data wall”. Others claim more compute during training will cause models to reach a “training wall”. Regardless of this panic, AI has yet to hit the brakes on improvement. Synthetic data (reinforcement learning) seems to be working, and more compute, demonstrated by grok 3, continues to lead to better performance.
So where is this “Wall”?
Image source.
The scaling laws in AI suggest that while there isn't a hard "wall" per se, there is a fundamental relationship between compute, model size, and performance that follows a power law distribution. This relationship, often expressed as L ∝ C^(-α) where L is the loss (lower is better) and C is compute, shows that achieving each incremental improvement requires exponentially more resources. For instance, if we want to reduce the loss by half, we might need to increase compute by a factor of 10 or more, depending on where we are on the scaling curve. This doesn't mean we hit an absolute wall, but rather face increasingly diminishing returns that create economic and practical limitations — essentially there exists a "soft wall" where the cost-benefit ratio becomes prohibitively expensive. So how then have multiple small AI labs reached SOTA so quickly?
Moats
When OpenAI debuted ChatGPT in November 2022, the consensus was it would take years for competitors to develop their own models and catch up. Ten months later Mistral, a previously unknown AI lab out of France, launched Mistral 7b, a first-of-its-kind open-source small language model. Turns out that training a model, while still extremely expensive, costs less than a single Boeing 747 plane.
The power law relationship can also help us understand how smaller AI firms catch up so quickly. The lower you are on the curve, the steeper the improvements are for each unit of compute invested, allowing smaller players to achieve significant gains with relatively modest resources. This "low-hanging fruit" phenomenon means that while industry leaders might need to spend billions to achieve marginal improvements at the frontier, newer entrants can leverage existing research, open-source implementations, and more efficient architectures to rapidly climb the steeper part of the curve. (At Ascend, we define this as AI’s “fast followers”.)
Costs have only gone down since 2022, thanks to new techniques like model distillation and synthetic data generation. Techniques that DeepSeek used to build R1 for a reported $6 million.
The perceived "moat" of computational resources isn't as defensible as initially thought. It seems the application layer is the most defensible part of the AI stack. But what is holding up mass adoption?
Bottlenecks
Agents, as I mentioned last week, are the main AI application. And agents, in their ultimate form, are autonomous systems tasked with accomplishing a goal in the digital environment. These systems need to be consistently reliable if they are to be of value. Agent reliability is mainly affected by two things: prompting and pointing.
Since an agent is in a reasoning loop until its given goal is achieved, the prompt that is used to set up and maintain that loop is crucial. The loop prompt will be run on every step and should reintroduce the task, tools, feedback, and response schema to the LLM. Ultimately, these AI systems are probabilistic so the loop prompt should be worded in a way to increase the probability of a correct response as much as possible. Much harder said than done.
Vision is another bottleneck. For example, if an agent decides it needs to open the Firefox browser to get online, it first needs to move the mouse to the Firefox icon, which means it needs to see and understand the user interface (UI).
Thankfully, we have vision language models (VLMs) for this! The thing is, these VLMs, while they can caption an image, do not understand the precise icon location well enough to provide pixel perfect x and y coordinates. At least not yet to any reliable degree.
To prove this point, I conducted a VLM pointing competition wherein I had gpt-4o, sonnet-3.5, moondream 2, llama 3.3 70b, and molmo 7b (running on replicate) point at various icons on my Linux server.
Our perception of icons and logos is second nature to us humans. Especially those of us who grew up in the information age. It boggles the mind that these models, who are now as smart as a graduate student, can’t do this simple task ten times in a row. In my opinion, agents will be viable only when they can do hundreds or even thousands of correct clicks. So maybe in a few months… Or you can tune in next week for Token Talk 8!
P.S. If you have any questions or just want to talk about AI, email me! thomas@ascend.vc