Originally posted: 2025-01-16. View source code for this page here.

DuckDB Cube

Want percentages and the total? Use group by cube:

import duckdb

NUM_ROWS = 1000

duckdb.sql(f"""
CREATE OR REPLACE TABLE array_dataset AS
WITH fruits AS (
    SELECT ['apple', 'banana', 'pear', 'plum'] AS fruit_list
),
random_fruits AS (
    SELECT
        list_extract((SELECT fruit_list FROM fruits), 1 + (random() * 3)::INTEGER) AS fruit
    FROM range({NUM_ROWS})
)
SELECT * FROM random_fruits
""")

duckdb.table("array_dataset").show()


result = duckdb.sql("""
WITH fruit_counts AS (
  SELECT
    fruit,
    COUNT(*) AS count
  FROM array_dataset
  GROUP BY CUBE(fruit)
),
total_count AS (
  SELECT count FROM fruit_counts WHERE fruit IS NULL
)
SELECT
  COALESCE(fruit, 'TOTAL') AS fruit,
  count,
  FORMAT('{:.2f}%%', 100.0 * count / (SELECT count FROM total_count)) AS percentage
FROM fruit_counts
ORDER BY fruit = 'TOTAL', fruit
""")
result.show()

27 Mar 2025
duckdb

DuckDB Literals

When working with complex nested types such as structs, I often want to try syntax out, but it can be hard to figure out the syntax to write the literal value.

You can do this fairly easily with the duckdb CLI using insert model (web shell does not work). See output formats.

duckdb
.mode insert

Now if I write any SQL statement, it will return the output in the format of a SQL INSERT statement - i.e how to write the literal value.

Example:

SELECT ARRAY[struct_pack(key1 := 'value1', key2 := 42::float),struct_pack(key1 := 'value1', key2 := 43.2::float)] AS s;

returns:

INSERT INTO "table"(s) VALUES('[{''key1'': value1, ''key2'': 42}, {''key1'': value1, ''key2'': 43}]');

Of course, you could also SELECT from the first row of an existing dataframe to see how to create literals for each column.

27 Mar 2025
duckdb

AI coding tips

  • Progress is enormously faster if you can work on micro-libraries that fit easily into context. Split out bits of functionality into separate libraries more often than you would if you weren't using an LLM.
  • When implementing a new feature, first ask the LLM to perform an architectural review of the whole codebase to see whether a refactor is needed that will make the new feature easier to implement
  • Regularly give the LLM the whole codebase and ask it whether there's any used code or functions. They often leave a mess by accident when refactoring
  • Have a set of tests or e.g. a demo webpage page (written by the LLM) that give you immediate feedback on whether the library is working. The idea is to have a source of information you can easily copy and paste back into the LLM to help it fix bugs
  • Regularly ask the LLM whether the codebase currently seems healthy and whether there's a way to achieve the same thing in a simpler or clearer fashion
11 Mar 2025
LLMs

Teaching AI models to debug

Weird experience tonight teaching o3-mini to debug. It can code better than me, but seems clueless at debugging. 'Don't try and fix this all at once - create a minimal reprex that outputs debugging information that will help you solve this problem'. Only then does it succeed

Feels like a combination of

  1. asking for clarification
  2. knowing how to debug
  3. adding code interpreter

would make Cusror far more powerful, without any significant breakthroughs needed in underlying LLM capability. Knowing when to do (1) is perhaps hardest

19 Feb 2025
aidebugging

Why don't AI models ask for clarification?

Why don't AI models ask for clarification more often? I've often under-specified a question and can remember very few times the model has asked for more info before proceeding. Only later do I realise I didn't provide enough context,

Feels potentially hard to get data for training/RL but once or twice I've gone into a 'doom loop' of failed code suggestions, when zooming out a bit the problem was I'd asked for a specific approach which was never going to work, and the model one-shotted it as soon as it realised the intent and that the direction was wrong.

17 Feb 2025
aillms

Using uv timestamp for dependency management

From epistasis:

One other key part of this is freezing a timestamp with your dependency list, because Python packages are absolutely terrible at maintaining compatibility a year or three or five later as PyPI populates with newer and newer versions. The special toml incantation is [tool.uv] exclude-newer:

# /// script
# dependencies = [
#   "requests",
# ]
# [tool.uv]
# exclude-newer = "2023-10-16T00:00:00Z"
# ///

https://docs.astral.sh/uv/guides/scripts/#improving-reproducibility This has also let me easily reconstruct some older environments in less than a minute, when I've been version hunting for 30-60 minutes in the past. The speed of uv environment building helps a ton too.

16 Feb 2025
python

LLMs for government productivity

I think the single most productivity-enhancing use of LLMs in government would be to give all government devs and data scientists access to Cursor (or similar). I am not yet convinced of the widespread value of 'behind the scenes' use cases of LLMs, but very bullish on skilled human-in-the-loop applications, especially coding.

Undoubtedly this situation will change as models improve, but at the moment there's usually not enough 9s of reliability to use in fully automated use cases.

13 Feb 2025
LLMs

LLMs getting better at coding

Latest Lex episode at 02:43:31 has a good section on why chain-of-thought matters over and above just 'it increases the benchmark results'.

It hit me very similar to when I first heard (transcript) the idea that code generation abilities would improve faster than natural language. It feels like potentially chain-of-thought is the key to making this work.

These two Kapathy tweets also very relevant: tweet1 tweet2

Seems quite likely we'll see superhuman coding abilities in the not too distant future.

05 Feb 2025
LLMs

Using DuckDB in ChatGPT Code Interpreter

You can use DuckDB in ChatGPT's code interpreter by providing this specific wheel file:

duckdb-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

from here.

If you encounter any issues, you can copy paste the available wheels into ChatGPT and ask it to analyze which is the most likely to work.

For Altair, you need: narwhals-1.24.1-py3-none-any.whl then altair-5.5.0-py3-none-any.whl

Note: I could not get png generation to work because the vl_convert_python wheel would not install.

28 Jan 2025
duckdbLLMs

Using a uv shebang line

Rob Allen says:

I create a fair few scripts in my ~/bin/ directory to automate tasks. Since discovering uv and inline script metadata, I've started using Python far more for these.

As ~/bin is on my path, I want to run the script by calling it directly on the command line. To do this, I use this shebang:

#!/usr/bin/env -S uv run --script

Full article here.

28 Jan 2025
python

ClickHouse Obfuscator - Data Anonymization Tool

The ClickHouse obfuscator is a tool for anonymizing production data while preserving its key statistical properties. It maintains:

  • Value cardinalities and distributions
  • Data compression ratios
  • String lengths and UTF-8 validity
  • Time series continuity

See also here.

16 Jan 2025
data

National Data Library thoughts

Three key priorities for building a National Data Library:

  1. Data sharing and governance must come first - before any technical solutions. Success depends on data owners being convinced the NDL will make their lives easier, not harder.

  2. Access permissions and management systems need to be rock-solid before building analytical capabilities. Get the foundations right.

  3. Design for an AI-first future - by the time the NDL is delivered, most analysis will likely be AI-assisted. The architecture needs to anticipate this.

16 Jan 2025
data

Understanding DuckDB Connection Types in Python

DuckDB offers special connection types in Python:

  • :memory:name - Creates/connects to a named in-memory database that can be shared across connections
  • :default: - Uses the default connection stored in the DuckDB module

Example:

import duckdb

# Create table in default connection
duckdb.execute("CREATE TABLE tbl AS SELECT 42 as value")

# Access same table through explicit default connection
con = duckdb.connect(":default:")
con.sql("SELECT * FROM tbl")  # Works!

# Shared named memory connection
con3 = duckdb.connect(":memory:shared_db")
con4 = duckdb.connect(":memory:shared_db")  # Same database as con3

See docs here

16 Jan 2024
pythonduckdbdatabase

An early email I sent on LLMs

The following is an extract from an email I sent in April 2023 - of interest because it feels like this situation hasn't changed dramatically in the intervening 2 years.

--

No doubt you've seen some of the media coverage around new AI models, especially OpenAI GPT models, and have been thinking about how to use them with your teams.

I've been using them quite extensively, and the more I use them, the more impressed I am - i.e. I'm pretty sure this is not just a fad, but is a serious technological breakthrough that has the potential to revolutionise quite significant parts of Civil Service work.

I thought it may be useful to note a handful of the areas where it feels like there is low hanging fruit where the models may be appropriate for serious applications:

  1. Zero shot labelling. This is an ability to take an input document and categorise it according to any criterion of your choice without training a new AI model (that's the 'zero shot' bit). For example, taking a sentencing transcript as an input and the model categorising whether there was 'use of a weapon'. The important technical advance here is that these new models understand semantics, not just keywords, so the transcript could contain 'the victim was stabbed', and the computer would recognise this as use of a weapon.

  2. Semantic search across a large corpus of potentially unstructured documents. There's an example here of using these models to analyse 1000 pages of Tesla's annual reports: Tweet.

Both (1) and (2) have been 'somewhat' possible in the past, but have been lots of work and haven't worked that well. What's new is that these are now much easier and more accurate.

  1. Code completion. As a data scientist, I'm getting ChatGPT 4 to write probably 50%+ of my code. So at the very least, these models are a huge productivity amplifier to data scientists.

The biggest challenge is around data sharing and using these tools with sensitive government data. It feels like getting a head start on understanding these legal issues may be an important first step.

16 Apr 2023
LLMs