Originally posted: 2025-01-16. View source code for this page here.

Teaching AI models to debug

Weird experience tonight teaching o3-mini to debug. It can code better than me, but seems clueless at debugging. 'Don't try and fix this all at once - create a minimal reprex that outputs debugging information that will help you solve this problem'. Only then does it succeed

Feels like a combination of

  1. asking for clarification
  2. knowing how to debug
  3. adding code interpreter

would make Cusror far more powerful, without any significant breakthroughs needed in underlying LLM capability. Knowing when to do (1) is perhaps hardest

19 Feb 2025
aidebugging

Why don't AI models ask for clarification?

Why don't AI models ask for clarification more often? I've often under-specified a question and can remember very few times the model has asked for more info before proceeding. Only later do I realise I didn't provide enough context,

Feels potentially hard to get data for training/RL but once or twice I've gone into a 'doom loop' of failed code suggestions, when zooming out a bit the problem was I'd asked for a specific approach which was never going to work, and the model one-shotted it as soon as it realised the intent and that the direction was wrong.

17 Feb 2025
aillms

Using uv timestamp for dependency management

From epistasis:

One other key part of this is freezing a timestamp with your dependency list, because Python packages are absolutely terrible at maintaining compatibility a year or three or five later as PyPI populates with newer and newer versions. The special toml incantation is [tool.uv] exclude-newer:

# /// script
# dependencies = [
#   "requests",
# ]
# [tool.uv]
# exclude-newer = "2023-10-16T00:00:00Z"
# ///

https://docs.astral.sh/uv/guides/scripts/#improving-reproducibility This has also let me easily reconstruct some older environments in less than a minute, when I've been version hunting for 30-60 minutes in the past. The speed of uv environment building helps a ton too.

16 Feb 2025
python

LLMs for government productivity

I think the single most productivity-enhancing use of LLMs in government would be to give all government devs and data scientists access to Cursor (or similar). I am not yet convinced of the widespread value of 'behind the scenes' use cases of LLMs, but very bullish on skilled human-in-the-loop applications, especially coding.

Undoubtedly this situation will change as models improve, but at the moment there's usually not enough 9s of reliability to use in fully automated use cases.

13 Feb 2025
LLMs

LLMs getting better at coding

Latest Lex episode at 02:43:31 has a good section on why chain-of-thought matters over and above just 'it increases the benchmark results'.

It hit me very similar to when I first heard (transcript) the idea that code generation abilities would improve faster than natural language. It feels like potentially chain-of-thought is the key to making this work.

These two Kapathy tweets also very relevant: tweet1 tweet2

Seems quite likely we'll see superhuman coding abilities in the not too distant future.

05 Feb 2025
LLMs

Using DuckDB in ChatGPT Code Interpreter

You can use DuckDB in ChatGPT's code interpreter by providing this specific wheel file:

duckdb-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

from here.

If you encounter any issues, you can copy paste the available wheels into ChatGPT and ask it to analyze which is the most likely to work.

For Altair, you need: narwhals-1.24.1-py3-none-any.whl then altair-5.5.0-py3-none-any.whl

Note: I could not get png generation to work because the vl_convert_python wheel would not install.

28 Jan 2025
duckdbLLMs

Using a uv shebang line

Rob Allen says:

I create a fair few scripts in my ~/bin/ directory to automate tasks. Since discovering uv and inline script metadata, I've started using Python far more for these.

As ~/bin is on my path, I want to run the script by calling it directly on the command line. To do this, I use this shebang:

#!/usr/bin/env -S uv run --script

Full article here.

28 Jan 2025
python

ClickHouse Obfuscator - Data Anonymization Tool

The ClickHouse obfuscator is a tool for anonymizing production data while preserving its key statistical properties. It maintains:

  • Value cardinalities and distributions
  • Data compression ratios
  • String lengths and UTF-8 validity
  • Time series continuity

See also here.

16 Jan 2025
data

National Data Library thoughts

Three key priorities for building a National Data Library:

  1. Data sharing and governance must come first - before any technical solutions. Success depends on data owners being convinced the NDL will make their lives easier, not harder.

  2. Access permissions and management systems need to be rock-solid before building analytical capabilities. Get the foundations right.

  3. Design for an AI-first future - by the time the NDL is delivered, most analysis will likely be AI-assisted. The architecture needs to anticipate this.

16 Jan 2025
data

An early email I sent on LLMs

The following is an extract from an email I sent in April 2023 - of interest because it feels like this situation hasn't changed dramatically in the intervening 2 years.

--

No doubt you've seen some of the media coverage around new AI models, especially OpenAI GPT models, and have been thinking about how to use them with your teams.

I've been using them quite extensively, and the more I use them, the more impressed I am - i.e. I'm pretty sure this is not just a fad, but is a serious technological breakthrough that has the potential to revolutionise quite significant parts of Civil Service work.

I thought it may be useful to note a handful of the areas where it feels like there is low hanging fruit where the models may be appropriate for serious applications:

  1. Zero shot labelling. This is an ability to take an input document and categorise it according to any criterion of your choice without training a new AI model (that's the 'zero shot' bit). For example, taking a sentencing transcript as an input and the model categorising whether there was 'use of a weapon'. The important technical advance here is that these new models understand semantics, not just keywords, so the transcript could contain 'the victim was stabbed', and the computer would recognise this as use of a weapon.

  2. Semantic search across a large corpus of potentially unstructured documents. There's an example here of using these models to analyse 1000 pages of Tesla's annual reports: Tweet.

Both (1) and (2) have been 'somewhat' possible in the past, but have been lots of work and haven't worked that well. What's new is that these are now much easier and more accurate.

  1. Code completion. As a data scientist, I'm getting ChatGPT 4 to write probably 50%+ of my code. So at the very least, these models are a huge productivity amplifier to data scientists.

The biggest challenge is around data sharing and using these tools with sensitive government data. It feels like getting a head start on understanding these legal issues may be an important first step.

16 Apr 2023
LLMs