Originally posted: 2025-01-01. Last updated: 2025-01-02. View source code for this page here.
The results from LLM benchmarks contain an apparent paradox: How can models have PhD-level performance but often fail at seemingly straightforward tasks to the extent that many people don't find them useful?
Even as a prolific user, I often find LLMs frustrating. It feels like I'm not using them right, and if I could somehow improve my prompts I'd be able to solve whole problems in one shot, rather than the iterative approach I rely on.
The underlying issue is that LLMs have a very different skill profile from humans. Only by understanding their relative strengths and weaknesses can we use them effectively.
But there's no simple rule or shortcut to developing this understanding. Benchmarks map out only a small and biased fraction of their capabilities, so it is up to the user to uncover the rest.
In this post I set out my mental model of LLMs and the heuristics I use to figure out when and how to use them effectively. I finish with some conclusions about what this implies about how they may evolve in 2025.
In basic conversations LLMs are able to imitate humans, so it's easy to treat them like a human helper. But this is a mistake because LLMs' skills are not human-like at all.
Instead, when I think about their skills, I imagine a radar diagram a bit like this:
The human is an all-rounder1, whereas the AI's skills are very spiky: it's really bad at some things, but vastly superhuman at others. The upshot is that at the moment, LLMs mostly complement human abilities rather than replace them — at least for the kind of data science and data engineering work I do.
To give some example of the user experience - from my use of frontier models in December 2024:
Overall, the mental model I often come back to is that LLMs interpolate imperfectly over existing knowledge. They have specialist knowledge of all subjects, and if they've seen a similar problem before they'll likely be useful. But they're unlikely to combine information from different fields to generate new insights.
Much of the above is fairly well understood, but I'd like to dig into the implications of this skills profile and why current-generation LLMs fall a long way short of replacing my job (a data scientist and FOSS maintainer).
In a nutshell the answer is that, if LLMs' skills are spiky and some skills remain very underdeveloped, these become bottlenecks. Furthermore, progress has tended to be most rapid in areas of existing competence, whereas progress in the 'bottleneck' areas seems relatively slow. So the bottlenecks increasingly dominate the sense of overall progress.
So what are these bottlenecks, and when is it time to close ChatGPT and think for yourself?
Many professional decisions are heavily constrained and dependent on vast amounts of institutional context that LLMs do not have.
For example, whilst a piece of software may be relatively straightforward to implement in an unconstrained environment, in an institutional context it may have to align with architectural principles, conform to a range of existing APIs, and so on.
In addition, an LLM's reasoning ability is likely to be greater on pre-training data than on in-context data. In my experience, whilst recall of in-context data can be good, I've had limited success in getting truly insightful answers from long-context models, either via automated RAG (like custom GPTs) or just putting the whole knowledgebase into long-context models.
At the moment, I find that simply managing the information relevant to a decision is often too burdensome a task for the LLM to add value. For run-of-the-mill corporate applications I don't foresee progress in this area will be very rapid as information is too disparate, or not written down at all, and human curation is very time consuming. I don't get the sense we're close to systems that can automatically manage this, and then use it intelligently.
Conversely, for very high value applications I can imagine, in the short term, humans will be employed to manage context and fine tune models.
Humans often do not know what they want. This may not initially seem like a flaw with the LLM, but on closer thought, the weaknesses of LLMs amplify the problem.
That said, I think it's often underappreciated just how much the quality of results can change depending on the quality of the prompt. A good example of the amount of effort that can be needed to get good results is here and see also here.
In my experience LLMs are unable to make significant logical leaps or explain when you've made logical errors3.
This constraint may be starting to be released with chain-of-thought models like o1 and o3, although these capabilities still seem fairly nascent and require a lot of compute to make apparently straightforward logical leaps.
Another aspect to this constraint is that humans take an iterative approach to problem solving - building up context by trying things, seeing if they work, and working towards a solution. I haven't yet seen much evidence of this ability developing, perhaps because LLMs make too many mistakes, and usually can't verify when they have got things correct. Functionality similar to ChatGPT's memory function could potentially go some way to addressing this shortcoming, but I've found it very underwhelming.
Putting all of this together, I think you end up in a situation where a typical user treats the LLMs too much like a human4 - and as a result they get unsatisfactory results and undervalue LLMs.
Since the bottlenecks dominate many users' experience, it obscures the almost unbelievable progress at the other end of the spectrum: on things like symbolic maths and competitive programming that few users are chatting to LLMs about.
Whilst this suggests LLMs may be undervalued, it's useful to think of examples that I think are currently way outside of LLMs capabilities to illustrate their real-life shortcomings. Two recent examples have stuck in my mind are:
LLMs are difficult to work with because they are not human-like, I think they're going to get even stranger as their performance gets more spiky. Rapid progress will continue in their areas of strength, but progress will continue to be slow in other areas. For most users, this means they will not 'seem' that much better, but their performance will be vastly superhuman in an increasing number of areas.
To give an example, it's plausible that LLMs may start making substantive contributions to scientific breakthroughs, whilst remaining seemingly quite stupid to everyday users.
As a result, in the short term, I don't see an imminent prospect of LLMs replacing knowledge workers wholesale, though I do think they enable increased productivity, and in turn smaller teams.
Here are some articles I've found interesting in shaping my thoughts:
It's important to distinguish skills from knowledge. For general knowledge, the LLM is clearly a better all-rounder. ↩
On completely novel problems, its reasoning powers are almost non existent (see the ARC benchmark). But in real-world usage, you're often asking it to solve problem that are novel to you, but other humans have solved, and so it appears to be good at it. The problem is that it's difficult to know which problems are novel and which are not. ↩
Certainly in the past year, I can think of several 'aha' moments when colleagues have either suggested an idea, or identified a problem - where a previously difficult problem suddenly becomes clear. I can't think of a real eurika moment I've had from an LLM. ↩
To be clear, I fall into this trap all the time; I'm far from some sort of (probably apocryphal) superuser who's able to consistently get great results. ↩