Originally posted: 2026-05-18. Last updated: 2026-05-20. View source code for this page here.

Optimising DuckDB performance on large EC2 instances

I have been recommending for some time that DuckDB is often a faster and simpler choice than Spark for most data engineering workloads.¹

For big data workloads, this suggests processing data on a single very large EC2 instance with hundreds of vCPUs.

However, there are some performance footguns with this approach: the default configuration can result in surprisingly poor performance, and a few simple changes can make workloads run multiple times faster.

So how can we optimise performance? This post sets out results from a variety of benchmarking experiments (code here).

Key findings

High thread counts and performance

Do not assume more threads always gives better performance. On some workloads performance doubled or more when using fewer threads, but is dependent on workload and machine type, so experiment with different settings.
Do not assume that 100% usage of all cores means they’re being used efficiently: the workload may still run faster on fewer threads.
On queries where fewer threads result in better performance, running multiple thread-constrained queries in parallel does not save time compared with running them serially.
With all that said, DuckDB is capable of fully using all cores on some types of queries, such as the Jaro-Winkler similarity benchmark below.

Overall, you should experiment with different THREADS settings depending on your workload.

Memory bandwidth

Memory bandwidth on large EC2 instances was measured at about 330GB/s (r8id.32xlarge) to 480GB/s (r8id.metal-48xl). Memory bandwidth on a MacBook Pro M4 Max is 410-546GB/s, so for memory-bandwidth-constrained workloads, we may not expect the larger CPU counts on EC2 to give much of a performance boost over a high-end laptop.
However, for the workloads tested, I was not able to find evidence they were bottlenecked by memory bandwidth.

Disk I/O

Throughput of writing to disk can be an important bottleneck. Use instance store (SSDs), or if unavailable, set EBS throughput to maximum.

S3 transfer

With the CRT transfer client enabled, aws s3 sync can be substantially faster than DuckDB’s built-in S3 write capabilities. Therefore DuckDB -> local SSD -> S3 via aws s3 sync can be faster than writing to S3 directly from DuckDB.

Evidence

What follows contains more detailed evidence to substantiate the above findings.

All results used duckdb==1.5.2 and were run on instances in the r8id family, mostly r8id.32xlarge (128 vCPU, 1 TiB RAM, 2x3.8TB NVMe SSD in RAID 0).

High thread counts and performance on a Group By operation

I ran the following simple query against 800m rows of data, first on an r8id.32xlarge and then, for the raw Parquet path, on an r8id.metal-48xl.

 SELECT col_1, col_2, COUNT(*) AS cnt
 FROM benchmark_input
 GROUP BY col_1, col_2
 ORDER BY cnt DESC
 LIMIT 10

The performance varied depending on the DuckDB THREADS pragma setting:

‘DuckDB table’ refers to running the query against a table in a db.duckdb database on the SSD, whereas ‘Parquet files’ refers to running the query directly from the data stored in 100 raw Parquet files on the SSD.

On the r8id.32xlarge, using 128 threads made performance unambiguously worse, particularly when reading from parquet, whereas on the r8id.metal-48xl, more threads are unambiguously better, but with diminishing returns.

I’m currently unable to explain this finding.

Click to expand CPU usage time series charts

The charts below show CPU usage over time for the DuckDB-table runs on r8id.32xlarge, and for the raw-Parquet runs on r8id.metal-48xl. The 128-thread r8id.32xlarge run does hit the ceiling briefly, but not in a way that translates into lower runtime. On the r8id.metal-48xl, CPU usage likewise rises and falls in phases rather than smoothly pinning the whole machine.

High thread counts and performance on a CPU intensive query

The group/order query is not representative of every workload. I also ran a simple query over 800m rows on r8id.metal-48xl that computed sum(jaro_winkler_similarity(col1, col2)), varying the DuckDB THREADS setting.

This query is intentionally bottlenecked by CPU. Results are in line with expectations:

We achieve 100% CPU usage irrespective of the THREADS setting.

Click to expand CPU usage time series chart

Does running multiple queries in parallel help?

If fewer threads can improve performance, a natural follow-up is to test whether that means we can run multiple queries in parallel, each with fewer threads.

To test this, I ran the group/order query against raw Parquet files on the r8id.32xlarge, but launched multiple independent queries in parallel.

Further experiments suggested that running in parallel can also make things worse, e.g. running four 32-thread queries serially was faster than running them in parallel.

Overall, this evidence suggests the machine is bottlenecked when running at lower thread counts, just not by CPU, and I would not recommend trying to run queries in parallel.

Memory bandwidth

I conducted various experiments to see whether these bottlenecks could be explained by memory bandwidth. I could not find evidence that memory bandwidth was the bottleneck.

Click to expand memory bandwidth analysis

First, I measured synthetic memory bandwidth on the r8id.32xlarge using a STREAM-style triad benchmark:

For the r8id.metal-48xl run, I also measured DRAM read and write bandwidth during the group/order query using Linux perf uncore IMC CAS-count events, sampled during each workload phase. The following charts show seconds from query start for the 32-thread and 192-thread runs.

These charts do not suggest that memory bandwidth is saturated at either 32 or 192 threads: the query usually sits well below the roughly 480 GB/s synthetic triad level measured on this machine, with only brief late-query spikes. The parallel-query result points the same way. If memory bandwidth were the main limit, running two or four 32-thread queries in parallel should have made better use of the machine; instead it caused a large performance hit.

Disk I/O

DuckDB can be significantly bottlenecked by disk throughput when writing large files.

My benchmarks showed it was capable of writing at about 8GiB/s to the two SSDs (connected in RAID0) on the r8id.32xlarge, similar to their expected maximum performance.

The benchmarks also showed that slower disks will cause correspondingly slower runtimes as you wait for data to be written to disk. EBS throughput is configurable, but gp3 can default to just 125 MiB/s, so the slowdown can be dramatic.

Since DuckDB will often spill to disk this finding is relevant even for operations that do not explicitly write out large datasets to disk.

Write performance was tested by converting .parquet files into a 4.9GiB mydb.duckdb database file on disk using a CREATE TABLE AS SELECT * statement on a r8id.32xlarge with threads=64.

On EBS, we can see the long ‘tail’ of low CPU activity whilst the machine is waiting on disk writes to complete. We can be more precise about this by running the benchmark with RECOVERY_MODE=no_wal_writes (see here), which splits the workload into a CTAS phase and a CHECKPOINT phase, in which the data is written. This is shown in the lower charts.

For this run, EBS throughput was set to 1000 MiB/s. We can see the CHECKPOINT phase takes around 4.1 seconds, consistent with the final 4.9 GiB database. Further experiments with different throughput showed the length of the CHECKPOINT phase scales proportionately with EBS throughput.

EBS throughput can be increased to a maximum of 2,000 MiB/s, but a better option is to use instance storage (SSDs), which can achieve much higher effective DuckDB write rates.

S3 I/O

Writes

For S3 writes, I benchmarked an 800 million-row table on an r8id.32xlarge with DuckDB threads=64 in two ways:

A direct DuckDB COPY to S3,
DuckDB writing Parquet to local NVMe SSD before uploading with aws s3 sync.

Further research showed aws s3 sync is expected to be slow with its default classic transfer client settings, and a recommended solution is to use the CRT transfer client instead.

The output was 8.3 GiB across 64 Parquet files.

Results are shown below. Hover over a segment to see the stage throughput and the overall throughput for that option.

The direct DuckDB -> S3 route doesn’t come close to saturating S3 bandwidth, at 998 MiB/s. s3 sync with CRT is much faster at 2,572 MiB/s, closer to the quoted EC2 network performance of 50 Gbit/s, or about 5,960 MiB/s.

I experimented with DuckDB’s s3 settings for the DuckDB-to-S3 write, but they did not make a significant difference.

Reads

I used the same aws s3 sync approach to download the benchmark dataset from S3 to local SSD. With default settings, downloading the 3.9 GiB dataset took about 8 seconds, which is about 460.0 MiB/s.

After enabling the CRT transfer client for the worker-side dataset sync, the same download completed in about 1 second, implying a throughput of about 3,680.0 MiB/s. That is roughly an 8x improvement.

As with writes, this still does not quite saturate the quoted EC2 network bandwidth for this instance type of about 5,960 MiB/s, but it is much closer than the default transfer path.

Future work

This blog post leaves a few unanswered questions:

Why does the same query give different performance patterns with different thread counts on the r8id.32xlarge vs the r8id.metal-48xl?
What is the bottleneck on the group by query?
For large files, is it better to use DuckDB’s native S3 reader, or to use aws s3 sync with CRT enabled to download the files to the local SSD first?
In what situations does DuckDB use all cores on EC2?

Annex

For my reference, the main findings are based on these benchmark runs:

Topic	Benchmark run(s)
EBS vs SSD	`i-056f3061f47031643/2026-05-13T133349Z-r8id-32xlarge-report.html`
S3 IO	`2026-05-15T190335Z-r8id-32xlarge-i-0a37266c6a531dedd-report.html`
Threads and performance	`2026-05-16T080410Z-r8id-32xlarge-i-0c9713718622fb751-report.html` `2026-05-19T160719Z-r8id-metal-48xl-i-00c24d85a1f7998d5-report.html`
Running queries in parallel	`2026-05-19T091501Z-r8id-32xlarge-i-0b0cd166828897ad0-report.html`
Memory bandwidth	`2026-05-19T160719Z-r8id-metal-48xl-i-00c24d85a1f7998d5-report.html` `2026-05-19T100058Z-r8id-32xlarge-i-0cbab4d579ba49c52-report.html`

Hat tip to this presentation where I first saw this chart. ↩