Originally posted: 2023-01-30. View source code for this page here.

#SQL should be the default choice for data transformation logic

#Summary

SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes.

#Introduction

My first exposure to analytics programming was using SQL in Microsoft Access in the early 2000s. Later, I became a data scientist, and loved pandas and dplyr for their expressiveness and power. As a data engineer, I dabbled in PySpark. Most recently, I’ve returned to SQL for data manipulation.

These alternative tools were developed to address deficiencies in SQL, and they are undoubtedly better in certain respects. But overall, I’m convinced that SQL is better in most circumstances, especially when working in a team or on long-term projects.

This post will make the case for SQL. I’ll then suggest when other tools may be preferable. Finally, I’ll finish by mentioning some future directions, and new libraries to keep an eye on.

#Advantages of SQL

#More people will be able to understand your code

Code is read more often than it is written. By using SQL, a much wider range of people can read your code, including BI developers, business analysts, data engineers and data scientists.

Data engineering pipelines have a habit of sticking around for many years, and SQL has stood the test of time. It is the language most likely to still be understood 10 or even 20 years from now.

SQL is also declarative - meaning the author describes what results they want, rather than how to compute them, meaning it’s arguably closer to self-documenting than some other imperative data processing languages.

#Future proofed, with automatic speed improvements and ‘autoscaling’

Despite having been around since the 1970s, SQL-based tools have been one of the most active areas of innovation in data tooling in the past decade. Data pipelines written in SQL are capable of running much faster, and on larger data than would have been possible a decade ago, with little or no changes in the underlying SQL code.

In addition to continuous improvements in more traditional SQL engines, we’ve seen the advent of distributed tools like Spark and Presto that have enabled SQL to run on huge datasets. More recently, DuckDB enables extremely fast parallelised analytics queries on a single machine - competitive with some of the fastest alternatives such as data.table and polars, and able to operate directly on csv and parquet files. If you’re using any non-standard features of a particular flavour of SQL, SQLGlot allows automatic translation.

Overall, SQL is probably the most future-proof tool for writing data pipelines - which have a habit of sticking around for longer than expected. Whilst there are many competitors, SQL is the most likely to be still in use in 20 years time.

#Make data typing someone else’s problem

Experienced data engineers will know data types (datetimes, strings, etc.) can be a notorious time sink. The main problem is the lack of one-to-one mappings between data types in different programming languages and SQL engines. A secondary problem is that some tools such as pandas and sqlite do not enforce data type constraints - meaning a column can have a mix of data types.

Use of SQL mitigates these problems because SQL engines are usually strongly typed. The ecosystem also provides tools that go further: Apache Arrow seeks to address this challenge by enabling robust and consistent handling of data types across different tools (such as R, Python and different databases). By choosing SQL engines that are compatible with Arrow, many of these problems of compatibility of data typing between tools disappear.

#Simpler maintenance, with less dependency management

Dependency management adds a significant maintenance burden to data pipelines and means maintainers need additional skills. Whilst the use of SQL does not eliminate the problem, it simplifies it considerably since SQL syntax changes much less frequently, and requires few dependencies to run.

For example, a pipeline written in R or Python five years ago could require days or even weeks of work to bring up to date. It’s likely to take significant effort to set up an environment to even run the code. The same pipeline written in SQL would need far fewer changes to bring up to date, and a reader seeking to simply understand the code could easily execute the SQL without setting up a new development environment.

SQL can also be executed from almost any programming language, making it easier to migrate pipelines to different tools, or to embed logic in other applications.

#Compatibility with good practice software engineering

Many data engineers will be familiar with the pain of uncovering an incomprehensible thousand-line SQL script that was written a decade ago, but is a critical part of an organisation’s data pipelines. Or the challenges of making small changes to Spark SQL pipelines that run on huge volumes of data. Overall, historically it has been challenging to write SQL that aligns to good engineering practice - such as clear, concise, and tested code that is split out into easily digestible components.

Some of these challenges are now much easier to overcome with the use of three components:

DuckDB, a zero-dependency SQL engine optimised for analytics queries that can be used to run unit tests, and more generally makes it much more ergonomic to rapidly iterate SQL code
CTEs (Common Table Expressions) - a way of splitting up large queries into multiple semantically meaningful parts, which can be tested individually. These tests can form part of your CI pipelines
SQLGlot, a SQL transpilation engine, that allows you to more easily test your code in DuckDB even if it’s written against a different target engine (e.g. PySpark).

There are even tools such as dbt which take similar ideas and compose them into a framework.

#SQL is more expressive and versatile than it used to be

Modern SQL engines have support for a range of functions that make complex operations much simpler than they used to be, addressing earlier shortcomings:

There’s now the ability to store and operate on nested data structures, including writing arbitrary lambda functions to process these structures
Window functions have been around a long time, but are extremely powerful

Other features with more narrow applications include full text search, geospatial functions, PIVOT operations and user defined functions - though these should be used with care because they have limited support.

#Situations in which SQL may be inappropriate

When may there be a strong argument for using other tools? I’ve argued in this post that SQL will often enable you to write pipelines which are simple, readable and easy to test. And where the totality of code and infrastructure is comparatively simple and maintainable in the long run.

But there are some situations where the opposite is true. For example, to interpolate a time series in pandas, you can use the resample method - a single line of code, where the intention is clear. In many SQL engines, the equivalent SQL is more complex and difficult to read. Similarly, SQL is probably not the best tool to operate on graph-like data structures.

Ultimately SQL should not be the only tool you consider - but I suggest a presumption against other tools without a strong justification.

#Putting it all together: Why I wrote Splink using SQL

Part of the motivation for writing this post is the success I’ve had in using SQL to power Splink, a library for probabilistic record linkage at scale. All the core computations are written in SQL.

This has the following advantages:

Splink is able to support multiple execution backends: workloads can be executed using DuckDB, SQLite, Spark or AWS Athena (Presto). DuckDB is by far the fastest option for small to medium size data, but Spark and Athena are available for very large jobs that need a distributed backend. This means that Splink is optimised for a wide variety of scenarios without the team having to maintain any of that execution logic.
Splink is future-proofed. If a new SQL engine comes along that’s faster, we can support it by writing a small adaptor. In fact, that’s exactly what happened with DuckDB.
Once a model is trained, Splink is able to output SQL code that expresses the model, enabling the logic to be embedded in other applications without Splink being a dependency. In this way, a Splink model could be productionised e.g. in dbt!
All SQL outputted by Splink is outputted as a step-by-step CTE pipeline. This makes it easy to understand and unit test each part of the algorithm, making its inner workings much less of a black box.

This has all been possible despite the need to express some fairly complex iterative algorithms (such as the Expectation Maximisation algorithm) using SQL.

#Future trends

Far from becoming outdated, SQL is going from strength to strength.

One interesting trend is the development of alternative APIs that ‘compile down’ to SQL for execution. The idea here is that SQL engines are so well optimised that library authors are better reusing this work than writing their own execution logic.

Some interesting things I have an eye on:

Malloy promises to offer a clearer and more concise declarative alternative to SQL that compiles to SQL.
Similarly, Ibis offers a replacement API for SQL queries which is more familiar to Python users, but ‘compiles down’ to SQL for execution against its supported backends. Dbplyr is similar in R. Polars allows the execution of SQL, though it looks like the SQL is transpiled in the opposite direction.
SQLGlot has the potential for lots of interesting applications such as keeping track of lineage. It seems possible that it might be used in future as a tool to re-write a CTE pipeline to improve the efficiency of operation without changing its meaning (since CTEs can sometimes be bad for performance).
dbt has gained widespread adoption for data engineering. I’m particularly interested in future directions such as functions that can provide more information about lineage, or that introspect the SQL and improve the efficiency of execution
Polars is a fast DataFrame library that supports SQL with a Python API.