Originally posted: 2025-01-16. View source code for this page here.

#Microblog

Comparing AI Coders for Composing and Decomposing

Built an app to illustrate composing and decomposing. Tried OpenAI Codex, Claude Code, and Gemini 2.5 Pro on AI studio for implementing this app.

Claude is good but expensive. It cost >£5 in less than 2 hrs.
Codex failed pretty quickly to follow the spec.
Gemini 2.5 is around as good as Claude, possibly a little better, and currently free on AI studio.

On Gemini I'm putting the whole app in context, and asking for 'precise instructions for an LLM to implement', then giving those to GPT-4.1 to implement in VS Code Copilot.

Mainly illustrates the annoying tradeoff that the companies are currently having to navigate in their agentic coders between good results, and including too much context (which increases cost too much).

App here: https://rupertlinacre.com/d3_regrouping_exchanging/

Code inc spec and step by step implementation plan here: https://github.com/RupertLinacre/d3_regrouping_exchanging

Codex as a Control Centre

Codex feels like the future for a range of async tasks. Sitting back and watching it successfully prepare multiple small PRs to my website in parallel last night felt like a pretty big paradigm shift. It makes me want all my software (docs, sheets etc.) to have a textfile/diffable edit mode - and have something like codex be an agentic 'control centre' for a wide range of work. As it stands, this video is closest to how I think it's currently useful: https://openai.com/index/introducing-codex/?video=1084810891

Editing Altair Charts in the Vega-Lite Editor

Working in Jupyter or VS Code and want to edit an Altair chart in the Vega-Lite editor easily? Use:

import altair as alt
alt.renderers.enable("browser")
chart.show()

chart.show() will open the chart in your default web browser.

Making Histograms with DuckDB SQL

Want to make a histogram directly in DuckDB? Use the built-in histogram and equi_width_bins functions:

import duckdb

sql = """
WITH bins AS (
    SELECT
        UNNEST(
            MAP_ENTRIES(
                histogram(
                    your_numeric_column,
                    equi_width_bins(-5.0, 20.0, <num_bins>, nice := true)
                )
            )
        ) AS bin
    FROM your_table
)
SELECT
    bin.key   AS bin_upper_bound,
    bin.value AS count_in_bin
FROM bins
ORDER BY bin.key;
"""


histogram_df = duckdb.sql(sql)

Regex-Based Column Renaming in DuckDB

Need to strip suffixes on the fly? DuckDB’s COLUMNS star-expression can do it with one regex:

sql = """
SELECT COLUMNS('^(.*)_(cen)$') AS '\\1'
FROM df
"""
duckdb.sql(sql).show(max_width=10000)

This removes the suffix _cen from all columns, returning only columns with the suffix.

sql = """
SELECT COLUMNS('^(.+?)_cen$|(.+)$') AS '\\1\\2'
FROM your_table;
FROM df
"""
duckdb.sql(sql).show(max_width=10000)

This removes the suffix _cen from any column name that has it, returning all columns (the names of cols without the suffix are unaffected).

See Star Expressions → COLUMNS in the DuckDB docs: https://duckdb.org/docs/sql/query_syntax/select.html#star-expressions

Why aren't tables first class citizens in programming languages?

If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.

R is perhaps the closest, because it has data.table as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet. It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.

Open Sourcing Code in Government in the LLM Era

Open sourcing code in government has become immensely more valuable in the LLM era because it means there is no sensitivity around applying the best AI models to understand or improve the code. I'm finding o4-mini incredible at searching and explaining entire repos, and identifying which files do what, just by pasting the URL and asking questions.

AI hypothetical in the Civil Service

AI hypothetical: If you assigned a motivated person with an IQ of 200 to a random civil service position, would they be exceptionally productive, they were given all available recorded context? I suspect the returns to intelligence would be relatively modest at the moment.

This seems quite relevant to question of how quickly can AI speed up progress in the short term. At the moment, based on this sort of example, I tend to agree with the ideas of Ege Erdil and Tamay Besiroglu e.g. here and Tyler Cowen here.

Opentimes and a clever use of DuckDB

OpenTimes are using a clever combination of duckdb and .parquet files to serve huge datasets to a web frontend.

See this blog from Dan Snow:

The entire OpenTimes backend is just static Parquet files on Cloudflare’s R2. There’s no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a great way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).

The query layer uses a single DuckDB database file with views that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.

To dig into this a bit deeper we can attach their database at https://data.opentimes.org/databases/0.0.1.duckdb:

And see that the 'tables' are just reference to very large numbers of parquet files:

import duckdb

duckdb.execute("ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb'")

sql = """
SELECT sql
FROM duckdb_views()
where schema_name = 'public'

"""

duckdb.sql(sql).fetchone()[0]

returns:

CREATE VIEW public.points AS
SELECT * FROM
read_parquet(
main.list_value(
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=01/points-0.0.1-car-2020-state-01-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=02/points-0.0.1-car-2020-state-02-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=04/points-0.0.1-car-2020-state-04-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=05/points-0.0.1-car-2020-state-05-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=06/points-0.0.1-car-2020-state-06-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=08/points-0.0.1-car-2020-state-08-0.parquet',
...700 of these!
)
)

More info from Simon W here and here.

AI Upskilling Over AI Engineering

I see too much focus on trying to find applications of LLMs to help other people 'at scale' with their jobs. At the moment, the output of LLMs is rarely useful for business rules or passive consumption. The lower hanging fruit is encouraging and helping people use AI directly, interactively, and however they see fit for their own jobs.

Addendum 2025-04-07: One thing in particular you want to avoid is giving LLM outputs to people who don't use LLMs. I suspect this is why Apple Intelligence has done so badly thus far. If your user is experienced with LLMs and is expecting it, it's fine to give them LLM outputs because they will treat them with the appropriate degree of scepticism. Otherwise, they could be misled.

Easily Transcribe Podcasts with Gemini 2.5 Pro

Here's a simple workflow for transcribing podcasts using Gemini 2.5 Pro in Google AI Studio. Initial tests suggest it works very well.

Find the direct MP3 link to your podcast using GetRSSFeed
Drag the MP3 file directly into Gemini 2.5 Pro
Use a simple prompt like "transcribe this" with the following structured output specification (click 'Structured output' then 'edit')

{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "timestamp": {
            "type": "string",
            "description": "mm:ss"
          },
          "text": {
            "type": "string"
          },
          "speaker_name": {
            "type": "string"
          }
        },
        "required": [
          "timestamp",
          "text",
          "speaker_name"
        ]
      }
    }
  },
  "required": [
    "items"
  ]
}

This gives you a nicely formatted transcript with timestamps and speaker identification - no special tools needed beyond Gemini 2.5 Pro's multimodal capabilities.

Combining VS Code Debugger with Jupyter Interactive Window

Today I learned you can combine the VS Code debugger with the Jupyter interactive window enabling debugging with rich output.

My Workflow: I use plain .py files in VS Code, executing "cells" by selecting code and pressing Shift+Enter to send it to the Jupyter (ipykernel) interactive window.

Previously, I found it frustrating having to choose between:

Sending code to the interactive window to get rich output (pandas tables, charts, etc.)
Setting a debug point to explore variables interactively

Today I discovered you can have both! Use the Jupyter: Debug current file in interactive window command.

At first, this doesn't seem to do anything special - you still execute debug commands in the debug console, making it appear no different from standard debugging.

But if you wrap any command in display(), the output also appears in the interactive window with full rich formatting.

This gives you the best of both worlds: interactive debugging with the ability to visualize complex data structures and plots in the same workflow.

DuckDB Dedupe

Want to retain one row per unique value of a certain column or columns? Use this:

SELECT DISTINCT ON (column1) *
FROM your_table

Quick DuckDB UI Access in Python

If you're using duckdb in a python script or jupyter notebook, you can run con.execute('CALL start_ui()') at any point, and the ui will pop right up in your web browser with the current database automatically available.

(I knew about the UI, but I had missed this trick!)

pure.md: A CDN for LLMs to access web content

pure.md turns webpages into markdown files, making it easy to paste web content as context for an LLM, see this post

DuckDB Cube

Want percentages and the total? Use group by cube:

import duckdb

NUM_ROWS = 1000

duckdb.sql(f"""
CREATE OR REPLACE TABLE array_dataset AS
WITH fruits AS (
    SELECT ['apple', 'banana', 'pear', 'plum'] AS fruit_list
),
random_fruits AS (
    SELECT
        list_extract((SELECT fruit_list FROM fruits), 1 + (random() * 3)::INTEGER) AS fruit
    FROM range({NUM_ROWS})
)
SELECT * FROM random_fruits
""")

duckdb.table("array_dataset").show()


result = duckdb.sql("""
WITH fruit_counts AS (
  SELECT
    fruit,
    COUNT(*) AS count
  FROM array_dataset
  GROUP BY CUBE(fruit)
),
total_count AS (
  SELECT count FROM fruit_counts WHERE fruit IS NULL
)
SELECT
  COALESCE(fruit, 'TOTAL') AS fruit,
  count,
  FORMAT('{:.2f}%%', 100.0 * count / (SELECT count FROM total_count)) AS percentage
FROM fruit_counts
ORDER BY fruit = 'TOTAL', fruit
""")
result.show()

DuckDB Literals

When working with complex nested types such as structs, I often want to try syntax out, but it can be hard to figure out the syntax to write the literal value.

You can do this fairly easily with the duckdb CLI using insert model (web shell does not work). See output formats.

duckdb
.mode insert

Now if I write any SQL statement, it will return the output in the format of a SQL INSERT statement - i.e how to write the literal value.

Example:

SELECT ARRAY[struct_pack(key1 := 'value1', key2 := 42::float),struct_pack(key1 := 'value1', key2 := 43.2::float)] AS s;

returns:

INSERT INTO "table"(s) VALUES('[{''key1'': value1, ''key2'': 42}, {''key1'': value1, ''key2'': 43}]');

Of course, you could also SELECT from the first row of an existing dataframe to see how to create literals for each column.

AI coding tips

Progress is enormously faster if you can work on micro-libraries that fit easily into context. Split out bits of functionality into separate libraries more often than you would if you weren't using an LLM.
When implementing a new feature, first ask the LLM to perform an architectural review of the whole codebase to see whether a refactor is needed that will make the new feature easier to implement
Regularly give the LLM the whole codebase and ask it whether there's any used code or functions. They often leave a mess by accident when refactoring
Have a set of tests or e.g. a demo webpage page (written by the LLM) that give you immediate feedback on whether the library is working. The idea is to have a source of information you can easily copy and paste back into the LLM to help it fix bugs
Regularly ask the LLM whether the codebase currently seems healthy and whether there's a way to achieve the same thing in a simpler or clearer fashion

Teaching AI models to debug

Weird experience tonight teaching o3-mini to debug. It can code better than me, but seems clueless at debugging. 'Don't try and fix this all at once - create a minimal reprex that outputs debugging information that will help you solve this problem'. Only then does it succeed

Feels like a combination of

asking for clarification
knowing how to debug
adding code interpreter

would make Cusror far more powerful, without any significant breakthroughs needed in underlying LLM capability. Knowing when to do (1) is perhaps hardest

Why don't AI models ask for clarification?

Why don't AI models ask for clarification more often? I've often under-specified a question and can remember very few times the model has asked for more info before proceeding. Only later do I realise I didn't provide enough context,

Feels potentially hard to get data for training/RL but once or twice I've gone into a 'doom loop' of failed code suggestions, when zooming out a bit the problem was I'd asked for a specific approach which was never going to work, and the model one-shotted it as soon as it realised the intent and that the direction was wrong.

Using uv timestamp for dependency management

From epistasis:

One other key part of this is freezing a timestamp with your dependency list, because Python packages are absolutely terrible at maintaining compatibility a year or three or five later as PyPI populates with newer and newer versions. The special toml incantation is [tool.uv] exclude-newer:
# /// script
# dependencies = [
#   "requests",
# ]
# [tool.uv]
# exclude-newer = "2023-10-16T00:00:00Z"
# ///
https://docs.astral.sh/uv/guides/scripts/#improving-reproducibility This has also let me easily reconstruct some older environments in less than a minute, when I've been version hunting for 30-60 minutes in the past. The speed of uv environment building helps a ton too.

LLMs for government productivity

I think the single most productivity-enhancing use of LLMs in government would be to give all government devs and data scientists access to Cursor (or similar). I am not yet convinced of the widespread value of 'behind the scenes' use cases of LLMs, but very bullish on skilled human-in-the-loop applications, especially coding.

Undoubtedly this situation will change as models improve, but at the moment there's usually not enough 9s of reliability to use in fully automated use cases.

LLMs getting better at coding

Latest Lex episode at 02:43:31 has a good section on why chain-of-thought matters over and above just 'it increases the benchmark results'.

It hit me very similar to when I first heard (transcript) the idea that code generation abilities would improve faster than natural language. It feels like potentially chain-of-thought is the key to making this work.

These two Kapathy tweets also very relevant: tweet1 tweet2

Seems quite likely we'll see superhuman coding abilities in the not too distant future.

Using DuckDB in ChatGPT Code Interpreter

You can use DuckDB in ChatGPT's code interpreter by providing this specific wheel file:

duckdb-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

from here.

If you encounter any issues, you can copy paste the available wheels into ChatGPT and ask it to analyze which is the most likely to work.

For Altair, you need: narwhals-1.24.1-py3-none-any.whl then altair-5.5.0-py3-none-any.whl

Note: I could not get png generation to work because the vl_convert_python wheel would not install.

Using a uv shebang line

Rob Allen says:

I create a fair few scripts in my ~/bin/ directory to automate tasks. Since discovering uv and inline script metadata, I've started using Python far more for these.

As ~/bin is on my path, I want to run the script by calling it directly on the command line. To do this, I use this shebang:
#!/usr/bin/env -S uv run --script

Full article here.

ClickHouse Obfuscator - Data Anonymization Tool

The ClickHouse obfuscator is a tool for anonymizing production data while preserving its key statistical properties. It maintains:

Value cardinalities and distributions
Data compression ratios
String lengths and UTF-8 validity
Time series continuity

National Data Library thoughts

Three key priorities for building a National Data Library:

Data sharing and governance must come first - before any technical solutions. Success depends on data owners being convinced the NDL will make their lives easier, not harder.
Access permissions and management systems need to be rock-solid before building analytical capabilities. Get the foundations right.
Design for an AI-first future - by the time the NDL is delivered, most analysis will likely be AI-assisted. The architecture needs to anticipate this.

Understanding DuckDB Connection Types in Python

DuckDB offers special connection types in Python:

:memory:name - Creates/connects to a named in-memory database that can be shared across connections
:default: - Uses the default connection stored in the DuckDB module

Example:

import duckdb

# Create table in default connection
duckdb.execute("CREATE TABLE tbl AS SELECT 42 as value")

# Access same table through explicit default connection
con = duckdb.connect(":default:")
con.sql("SELECT * FROM tbl")  # Works!

# Shared named memory connection
con3 = duckdb.connect(":memory:shared_db")
con4 = duckdb.connect(":memory:shared_db")  # Same database as con3

See docs here

Configuring Python Path Visibility in VS Code interactive window (Jupyter)

When working with Jupyter notebooks in VS Code, add these settings to your .vscode/settings.json:

{
    "jupyter.notebookFileRoot": "${workspaceFolder}",
    "python.analysis.extraPaths": [
        "${workspaceFolder}"
    ]
}

Now, the interactive window will run any script as if it's running from the root directory even if it's nested.

How analytical profession will change

Quote I provided for One Big Thing for how the analytical profession is changing:

Large Language Models democratise and superpower open source, code-based data tools, resulting in higher quality and more transparent.

An early email I sent on LLMs

The following is an extract from an email I sent in April 2023 - of interest because it feels like this situation hasn't changed dramatically in the intervening 2 years.

No doubt you've seen some of the media coverage around new AI models, especially OpenAI GPT models, and have been thinking about how to use them with your teams.

I've been using them quite extensively, and the more I use them, the more impressed I am - i.e. I'm pretty sure this is not just a fad, but is a serious technological breakthrough that has the potential to revolutionise quite significant parts of Civil Service work.

I thought it may be useful to note a handful of the areas where it feels like there is low hanging fruit where the models may be appropriate for serious applications:

Zero shot labelling. This is an ability to take an input document and categorise it according to any criterion of your choice without training a new AI model (that's the 'zero shot' bit). For example, taking a sentencing transcript as an input and the model categorising whether there was 'use of a weapon'. The important technical advance here is that these new models understand semantics, not just keywords, so the transcript could contain 'the victim was stabbed', and the computer would recognise this as use of a weapon.
Semantic search across a large corpus of potentially unstructured documents. There's an example here of using these models to analyse 1000 pages of Tesla's annual reports: Tweet.

Both (1) and (2) have been 'somewhat' possible in the past, but have been lots of work and haven't worked that well. What's new is that these are now much easier and more accurate.

Code completion. As a data scientist, I'm getting ChatGPT 4 to write probably 50%+ of my code. So at the very least, these models are a huge productivity amplifier to data scientists.

The biggest challenge is around data sharing and using these tools with sensitive government data. It feels like getting a head start on understanding these legal issues may be an important first step.

Future Skills in an AI-Tutored World

Seems likely every child will soon have access to an always-on AI personal tutor for every subject (using tech like ChatGPT). Makes me wonder what skills will be most important for (my) kids to have - intrinsic motivation and self control to turn off distractions seem critical