Originally posted: 2025-01-16. View source code for this page here.

Why aren't tables first class citizens in programming languages?

If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.

R is perhaps the closest, because it has data.table as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet. It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.

23 Apr 2025
data

Open Sourcing Code in Government in the LLM Era

Open sourcing code in government has become immensely more valuable in the LLM era because it means there is no sensitivity around applying the best AI models to understand or improve the code. I'm finding o4-mini incredible at searching and explaining entire repos, and identifying which files do what, just by pasting the URL and asking questions.

22 Apr 2025
LLMs

AI hypothetical in the Civil Service

AI hypothetical: If you assigned a motivated person with an IQ of 200 to a random civil service position, would they be exceptionally productive, they were given all available recorded context? I suspect the returns to intelligence would be relatively modest at the moment.

This seems quite relevant to question of how quickly can AI speed up progress in the short term. At the moment, based on this sort of example, I tend to agree with the ideas of Ege Erdil and Tamay Besiroglu e.g. here and Tyler Cowen here.

20 Apr 2025
LLMs"

Opentimes and a clever use of DuckDB

OpenTimes are using a clever combination of duckdb and .parquet files to serve huge datasets to a web frontend.

See this blog from Dan Snow:

The entire OpenTimes backend is just static Parquet files on Cloudflare’s R2. There’s no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a great way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).

The query layer uses a single DuckDB database file with views that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.

To dig into this a bit deeper we can attach their database at https://data.opentimes.org/databases/0.0.1.duckdb:

And see that the 'tables' are just reference to very large numbers of parquet files:

import duckdb

duckdb.execute("ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb'")

sql = """
SELECT sql
FROM duckdb_views()
where schema_name = 'public'

"""

duckdb.sql(sql).fetchone()[0]

returns:

CREATE VIEW public.points AS
SELECT * FROM
read_parquet(
main.list_value(
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=01/points-0.0.1-car-2020-state-01-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=02/points-0.0.1-car-2020-state-02-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=04/points-0.0.1-car-2020-state-04-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=05/points-0.0.1-car-2020-state-05-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=06/points-0.0.1-car-2020-state-06-0.parquet',
'https://data.opentimes.org/points/version=0.0.1/mode=car/year=2020/geography=state/state=08/points-0.0.1-car-2020-state-08-0.parquet',
...700 of these!
)
)

More info from Simon W here and here.

08 Apr 2025
duckdb

AI Upskilling Over AI Engineering

I see too much focus on trying to find applications of LLMs to help other people 'at scale' with their jobs. At the moment, the output of LLMs is rarely useful for business rules or passive consumption. The lower hanging fruit is encouraging and helping people use AI directly, interactively, and however they see fit for their own jobs.

Addendum 2025-04-07: One thing in particular you want to avoid is giving LLM outputs to people who don't use LLMs. I suspect this is why Apple Intelligence has done so badly thus far. If your user is experienced with LLMs and is expecting it, it's fine to give them LLM outputs because they will treat them with the appropriate degree of scepticism. Otherwise, they could be misled.

04 Apr 2025
LLMs

Easily Transcribe Podcasts with Gemini 2.5 Pro

Here's a simple workflow for transcribing podcasts using Gemini 2.5 Pro in Google AI Studio. Initial tests suggest it works very well.

  1. Find the direct MP3 link to your podcast using GetRSSFeed
  2. Drag the MP3 file directly into Gemini 2.5 Pro
  3. Use a simple prompt like "transcribe this" with the following structured output specification (click 'Structured output' then 'edit')
{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "timestamp": {
            "type": "string",
            "description": "mm:ss"
          },
          "text": {
            "type": "string"
          },
          "speaker_name": {
            "type": "string"
          }
        },
        "required": [
          "timestamp",
          "text",
          "speaker_name"
        ]
      }
    }
  },
  "required": [
    "items"
  ]
}

This gives you a nicely formatted transcript with timestamps and speaker identification - no special tools needed beyond Gemini 2.5 Pro's multimodal capabilities.

02 Apr 2025
llmsaudioproductivity

Combining VS Code Debugger with Jupyter Interactive Window

Today I learned you can combine the VS Code debugger with the Jupyter interactive window enabling debugging with rich output.

My Workflow: I use plain .py files in VS Code, executing "cells" by selecting code and pressing Shift+Enter to send it to the Jupyter (ipykernel) interactive window.

Previously, I found it frustrating having to choose between:

  1. Sending code to the interactive window to get rich output (pandas tables, charts, etc.)
  2. Setting a debug point to explore variables interactively

Today I discovered you can have both! Use the Jupyter: Debug current file in interactive window command.

At first, this doesn't seem to do anything special - you still execute debug commands in the debug console, making it appear no different from standard debugging.

But if you wrap any command in display(), the output also appears in the interactive window with full rich formatting.

This gives you the best of both worlds: interactive debugging with the ability to visualize complex data structures and plots in the same workflow.

02 Apr 2025
pythonjupytervscodedebuggingproductivity

DuckDB Dedupe

Want to retain one row per unique value of a certain column or columns? Use this:

SELECT DISTINCT ON (column1) *
FROM your_table
01 Apr 2025
duckdb

Quick DuckDB UI Access in Python

If you're using duckdb in a python script or jupyter notebook, you can run con.execute('CALL start_ui()') at any point, and the ui will pop right up in your web browser with the current database automatically available.

(I knew about the UI, but I had missed this trick!)

01 Apr 2025
duckdbpythonjupyter

DuckDB Cube

Want percentages and the total? Use group by cube:

import duckdb

NUM_ROWS = 1000

duckdb.sql(f"""
CREATE OR REPLACE TABLE array_dataset AS
WITH fruits AS (
    SELECT ['apple', 'banana', 'pear', 'plum'] AS fruit_list
),
random_fruits AS (
    SELECT
        list_extract((SELECT fruit_list FROM fruits), 1 + (random() * 3)::INTEGER) AS fruit
    FROM range({NUM_ROWS})
)
SELECT * FROM random_fruits
""")

duckdb.table("array_dataset").show()


result = duckdb.sql("""
WITH fruit_counts AS (
  SELECT
    fruit,
    COUNT(*) AS count
  FROM array_dataset
  GROUP BY CUBE(fruit)
),
total_count AS (
  SELECT count FROM fruit_counts WHERE fruit IS NULL
)
SELECT
  COALESCE(fruit, 'TOTAL') AS fruit,
  count,
  FORMAT('{:.2f}%%', 100.0 * count / (SELECT count FROM total_count)) AS percentage
FROM fruit_counts
ORDER BY fruit = 'TOTAL', fruit
""")
result.show()

27 Mar 2025
duckdb

DuckDB Literals

When working with complex nested types such as structs, I often want to try syntax out, but it can be hard to figure out the syntax to write the literal value.

You can do this fairly easily with the duckdb CLI using insert model (web shell does not work). See output formats.

duckdb
.mode insert

Now if I write any SQL statement, it will return the output in the format of a SQL INSERT statement - i.e how to write the literal value.

Example:

SELECT ARRAY[struct_pack(key1 := 'value1', key2 := 42::float),struct_pack(key1 := 'value1', key2 := 43.2::float)] AS s;

returns:

INSERT INTO "table"(s) VALUES('[{''key1'': value1, ''key2'': 42}, {''key1'': value1, ''key2'': 43}]');

Of course, you could also SELECT from the first row of an existing dataframe to see how to create literals for each column.

27 Mar 2025
duckdb

AI coding tips

  • Progress is enormously faster if you can work on micro-libraries that fit easily into context. Split out bits of functionality into separate libraries more often than you would if you weren't using an LLM.
  • When implementing a new feature, first ask the LLM to perform an architectural review of the whole codebase to see whether a refactor is needed that will make the new feature easier to implement
  • Regularly give the LLM the whole codebase and ask it whether there's any used code or functions. They often leave a mess by accident when refactoring
  • Have a set of tests or e.g. a demo webpage page (written by the LLM) that give you immediate feedback on whether the library is working. The idea is to have a source of information you can easily copy and paste back into the LLM to help it fix bugs
  • Regularly ask the LLM whether the codebase currently seems healthy and whether there's a way to achieve the same thing in a simpler or clearer fashion
11 Mar 2025
LLMs

Teaching AI models to debug

Weird experience tonight teaching o3-mini to debug. It can code better than me, but seems clueless at debugging. 'Don't try and fix this all at once - create a minimal reprex that outputs debugging information that will help you solve this problem'. Only then does it succeed

Feels like a combination of

  1. asking for clarification
  2. knowing how to debug
  3. adding code interpreter

would make Cusror far more powerful, without any significant breakthroughs needed in underlying LLM capability. Knowing when to do (1) is perhaps hardest

19 Feb 2025
aidebugging

Why don't AI models ask for clarification?

Why don't AI models ask for clarification more often? I've often under-specified a question and can remember very few times the model has asked for more info before proceeding. Only later do I realise I didn't provide enough context,

Feels potentially hard to get data for training/RL but once or twice I've gone into a 'doom loop' of failed code suggestions, when zooming out a bit the problem was I'd asked for a specific approach which was never going to work, and the model one-shotted it as soon as it realised the intent and that the direction was wrong.

17 Feb 2025
aillms

Using uv timestamp for dependency management

From epistasis:

One other key part of this is freezing a timestamp with your dependency list, because Python packages are absolutely terrible at maintaining compatibility a year or three or five later as PyPI populates with newer and newer versions. The special toml incantation is [tool.uv] exclude-newer:

# /// script
# dependencies = [
#   "requests",
# ]
# [tool.uv]
# exclude-newer = "2023-10-16T00:00:00Z"
# ///

https://docs.astral.sh/uv/guides/scripts/#improving-reproducibility This has also let me easily reconstruct some older environments in less than a minute, when I've been version hunting for 30-60 minutes in the past. The speed of uv environment building helps a ton too.

16 Feb 2025
python

LLMs for government productivity

I think the single most productivity-enhancing use of LLMs in government would be to give all government devs and data scientists access to Cursor (or similar). I am not yet convinced of the widespread value of 'behind the scenes' use cases of LLMs, but very bullish on skilled human-in-the-loop applications, especially coding.

Undoubtedly this situation will change as models improve, but at the moment there's usually not enough 9s of reliability to use in fully automated use cases.

13 Feb 2025
LLMs

LLMs getting better at coding

Latest Lex episode at 02:43:31 has a good section on why chain-of-thought matters over and above just 'it increases the benchmark results'.

It hit me very similar to when I first heard (transcript) the idea that code generation abilities would improve faster than natural language. It feels like potentially chain-of-thought is the key to making this work.

These two Kapathy tweets also very relevant: tweet1 tweet2

Seems quite likely we'll see superhuman coding abilities in the not too distant future.

05 Feb 2025
LLMs

Using DuckDB in ChatGPT Code Interpreter

You can use DuckDB in ChatGPT's code interpreter by providing this specific wheel file:

duckdb-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

from here.

If you encounter any issues, you can copy paste the available wheels into ChatGPT and ask it to analyze which is the most likely to work.

For Altair, you need: narwhals-1.24.1-py3-none-any.whl then altair-5.5.0-py3-none-any.whl

Note: I could not get png generation to work because the vl_convert_python wheel would not install.

28 Jan 2025
duckdbLLMs

Using a uv shebang line

Rob Allen says:

I create a fair few scripts in my ~/bin/ directory to automate tasks. Since discovering uv and inline script metadata, I've started using Python far more for these.

As ~/bin is on my path, I want to run the script by calling it directly on the command line. To do this, I use this shebang:

#!/usr/bin/env -S uv run --script

Full article here.

28 Jan 2025
python

National Data Library thoughts

Three key priorities for building a National Data Library:

  1. Data sharing and governance must come first - before any technical solutions. Success depends on data owners being convinced the NDL will make their lives easier, not harder.

  2. Access permissions and management systems need to be rock-solid before building analytical capabilities. Get the foundations right.

  3. Design for an AI-first future - by the time the NDL is delivered, most analysis will likely be AI-assisted. The architecture needs to anticipate this.

16 Jan 2025
data

Understanding DuckDB Connection Types in Python

DuckDB offers special connection types in Python:

  • :memory:name - Creates/connects to a named in-memory database that can be shared across connections
  • :default: - Uses the default connection stored in the DuckDB module

Example:

import duckdb

# Create table in default connection
duckdb.execute("CREATE TABLE tbl AS SELECT 42 as value")

# Access same table through explicit default connection
con = duckdb.connect(":default:")
con.sql("SELECT * FROM tbl")  # Works!

# Shared named memory connection
con3 = duckdb.connect(":memory:shared_db")
con4 = duckdb.connect(":memory:shared_db")  # Same database as con3

See docs here

16 Jan 2024
pythonduckdbdatabase

An early email I sent on LLMs

The following is an extract from an email I sent in April 2023 - of interest because it feels like this situation hasn't changed dramatically in the intervening 2 years.

--

No doubt you've seen some of the media coverage around new AI models, especially OpenAI GPT models, and have been thinking about how to use them with your teams.

I've been using them quite extensively, and the more I use them, the more impressed I am - i.e. I'm pretty sure this is not just a fad, but is a serious technological breakthrough that has the potential to revolutionise quite significant parts of Civil Service work.

I thought it may be useful to note a handful of the areas where it feels like there is low hanging fruit where the models may be appropriate for serious applications:

  1. Zero shot labelling. This is an ability to take an input document and categorise it according to any criterion of your choice without training a new AI model (that's the 'zero shot' bit). For example, taking a sentencing transcript as an input and the model categorising whether there was 'use of a weapon'. The important technical advance here is that these new models understand semantics, not just keywords, so the transcript could contain 'the victim was stabbed', and the computer would recognise this as use of a weapon.

  2. Semantic search across a large corpus of potentially unstructured documents. There's an example here of using these models to analyse 1000 pages of Tesla's annual reports: Tweet.

Both (1) and (2) have been 'somewhat' possible in the past, but have been lots of work and haven't worked that well. What's new is that these are now much easier and more accurate.

  1. Code completion. As a data scientist, I'm getting ChatGPT 4 to write probably 50%+ of my code. So at the very least, these models are a huge productivity amplifier to data scientists.

The biggest challenge is around data sharing and using these tools with sensitive government data. It feels like getting a head start on understanding these legal issues may be an important first step.

16 Apr 2023
LLMs

Future Skills in an AI-Tutored World

Seems likely every child will soon have access to an always-on AI personal tutor for every subject (using tech like ChatGPT). Makes me wonder what skills will be most important for (my) kids to have - intrinsic motivation and self control to turn off distractions seem critical

23 Jan 2023
LLMs