Originally posted: 2025-01-01. Last updated: 2025-01-07. View source code for this page here.

#AI probably won't replace me in 2025

My mental model of LLMs, their strengths and shortcomings

The results from LLM benchmarks contain an apparent paradox: How can models show PhD-level performance when in practice, they often fail at seemingly straightforward tasks?

Even as a prolific user, I often find LLMs frustrating. It feels like I'm not using them right, and if I could somehow improve my prompts I'd be able to solve whole problems in one shot rather than the iterative approach I rely on.

The underlying issue is that LLMs have a very different skill profile from humans. Only by understanding their relative strengths and weaknesses can we use them effectively.

But there's no simple rule or shortcut to developing this understanding. Benchmarks map out only a small and biased fraction of their capabilities, so it is up to the user to uncover the rest.

In this post I set out my mental model of LLMs and the heuristics I use to figure out when and how to use them effectively. I finish with some conclusions about what this implies about how they may evolve in 2025.

#The Mental Model I have of LLMs' skills

In basic conversations LLMs are able to imitate humans, so it's natural to treat them like a human helper. But this is a mistake because LLMs' skills are not human-like at all.

Instead, when I think about their skills, I imagine a radar diagram a bit like this:

The human is an all-rounder¹, whereas the AI's skills are very spiky: it's really bad at some things, but vastly superhuman at others. The upshot is that at the moment, LLMs mostly complement human abilities rather than replace them — at least for the kind of data science and data engineering work I do.

To give some examples of the user experience - from my use of frontier models in December 2024:

LLMs have superhuman knowledge
They create content at superhuman speed
They often 'understand' structure and can accurately (but not infallibly) follow precise instructions
They have highly unpredictable power of reasoning, sometimes superhuman, sometimes almost non-existent². (This is perhaps better understood as powerful pattern matching rather than reasoning)
They frequently make 'obvious' errors (sometimes misleadingly called hallucinations)
They don't know what they're capable of
They usually aren't able to quantify their confidence in the answer and won't usually seek clarification if you ask them something improbable
When given new information, they often can't make connections that require logical leaps

Overall, the mental model I often come back to is that LLMs interpolate imperfectly over existing knowledge. They have broad and deep knowledge of a huge range of subjects, and if they've seen a similar problem before they'll likely be useful. But they're unlikely to combine information from different fields to generate new insights.

#Why LLMs don't look set to replace me imminently

Much of the above is fairly well understood, but I'd like to dig into the implications of this skills profile and why current-generation LLMs fall a long way short of replacing my job (a data scientist and FOSS maintainer).

In a nutshell the answer is that, if LLMs' skills are spiky and some skills remain very underdeveloped, these become bottlenecks. Furthermore, progress has tended to be most rapid in areas of existing competence, whereas progress in the bottleneck areas seems relatively slow. So the bottlenecks increasingly dominate the sense of overall progress.

So what are these bottlenecks, and when should we think for ourselves rather than reaching for a chatbot?

#LLMs lack context

Many professional decisions are heavily constrained and dependent on vast amounts of institutional context that LLMs do not have.

For example, whilst a piece of software may be relatively straightforward to implement in an unconstrained environment, in an institutional context it may have to align with architectural principles, conform to a range of existing APIs, and so on.

In addition, an LLM's reasoning ability is likely to be greater on pre-training data than on in-context data. In my experience, whilst recall of in-context data can be good, I've had limited success in getting truly insightful answers from long-context models, either via automated RAG (like custom GPTs) or just putting the whole knowledgebase into context.

At the moment, I find that simply managing the information relevant to a decision is often too burdensome a task for the LLM to add value. For run-of-the-mill corporate applications I don't foresee progress in this area will be very rapid as information is too disparate, or not written down at all, and human curation is very time consuming. I don't get the sense we're close to systems that can automatically manage this, and then use it intelligently.

Conversely, for very high value applications I can imagine, in the short term, humans will be employed to manage context and fine tune models.

#Humans often do not know what they want

Humans often do not know what they want. This may not initially seem like a flaw with the LLM, but on closer thought, the weaknesses of LLMs amplify the problem.

LLMs do not know what is possible and what is not. They will often attempt a solution for something that, with careful thought, is impossible, sending the user into an infinite loop of unsatisfactory answers and attempts to refine the prompt. They probably won't tell you what you're trying to do is the wrong way of approaching the problem, or that the problem cannot be solved.
A lot of the process of knowledge work is defining precisely what the problem is. Often you only understand the problem after making several different attempts at solutions. This is a highly iterative process, and requires a human to understand new information. Without agentic LLMs, this process still needs to be driven by a human, and however fast the LLM works, the human has to understand its answers before moving to the next step. LLMs can be verbose and not great at pinpointing the key information the human needs to make a decision

That said, I think it's often underappreciated just how much the quality of results can change depending on the quality of the prompt. A good example of the amount of effort that can be needed to get good results is here and see also here.

#Reasoning

In my experience LLMs are unable to make significant logical leaps or explain when you've made logical errors³.

This constraint may be starting to be released with chain-of-thought models like o1 and o3, although these capabilities still seem fairly nascent and require a lot of compute to make apparently straightforward logical leaps.

Another aspect to this constraint is that humans take an iterative approach to problem solving - building up context by trying things, seeing if they work, and working towards a solution. I haven't yet seen much evidence of this ability developing, perhaps because LLMs make too many mistakes, and usually can't verify when they have got things correct. Functionality similar to ChatGPT's memory function could potentially go some way to addressing this shortcoming, but I've found it very underwhelming.

#Undervaluing LLMs

Putting all of this together, I think you end up in a situation where a typical user treats the LLMs too much like a human⁴. They get unsatisfactory results for some use cases, and therefore place too much empahsis on models' shortcomings without more fully exploring their capabilities.

As such, the shortcomings obscure the almost unbelievable progress at the other end of the spectrum: on things like symbolic maths and competitive programming that few users are chatting to LLMs about.

On the other hand, there remain many technical tasks that are way outside LLMs' current capabilities to solve in a single prompt. Two recent examples that have stuck in my mind are:

Moon by Bartosz Ciechanowski Such an article requires not just immense skill in computer programming, but deep thought about how to connect and present ideas. In my mind, if a computer could produce such an article from a single prompt, I would consider it to be generally intelligent.
John Burn-Murdoch's visualisations for the FT e.g. here. At first glance, he's good at data visualisation. But on second thoughts, the primary skill is in collecting data and identifying interesting insights, the data visualisation is merely the icing on the cake. Whilst I've seen LLMs analyse data and produce charts, they currently seem to have no capability of identifying and presenting interesting insights.

To be clear, LLMs can help with many 'sub tasks' to assist with the above, but they seem incapable of the 'bigger picture' thinking needed to make such work coherent.

#What this may mean for LLMs in 2025

It can be difficult to fully exploit LLMs' capabilities because they are not human-like, and I think this is going to get even more difficult as their performance gets more spiky. Rapid progress will continue in their areas of strength, but progress will continue to be slow in other areas. For many users, they will not 'seem' that much better, but their performance will be vastly superhuman in an increasing number of areas.

For example, I think it's plausible that LLMs may start making substantive contributions to scientific breakthroughs, whilst remaining seemingly quite stupid to everyday users.

In the short term, I don't see an imminent prospect of LLMs replacing knowledge workers wholesale, though I do think they enable increased productivity, and in turn smaller teams.

In 2025 I anticipate:

Further progress against benchmarks in areas such as maths and computer programming due to the feedback loop that allows synthetic data to be used to improve models as part of larger training runs
Continued progress in how software helps users create and manage prompts, particularly long-context prompts. For programming, tools like Cursor will continue to improve and will start to become pervasive and indispensible. We will see similar progress in tools like Copilot of Microsoft Office. This will make progress in model performance feel rapid because models it will be easier to get the best out of them.
No dramatic improvements in models' reasoning abilities - at least, not cheaply and quickly.
Agentic behaviour will start to creep in for some tasks with clear feedback (e.g. linting, test failures), but the problem of compounding errors will hamper progress more generally.

#Annex: Further reading

For further reading, see: LLM links and quotes page.

#Footnotes

It's important to distinguish skills from knowledge. For general knowledge, the LLM is clearly a better all-rounder. ↩
On completely novel problems, its reasoning powers are almost non existent (see the ARC benchmark). But in real-world usage, you're often asking it to solve problems that are novel to you, but other humans have solved, and so it appears to be good at it. The problem is that it's difficult to know which problems are novel and which are not. ↩
Certainly in the past year, I can think of several 'aha' moments when colleagues have either suggested an idea, or identified a problem - where a previously difficult problem suddenly becomes clear. I can't think of a real eureka moment I've had from an LLM. ↩
I fall into this trap all the time; I'm far from some sort of (probably apocryphal) superuser who's able to consistently get great results. ↩