Originally posted: 2023-01-30. View source code for this page here.
SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes.
My first exposure to analytics programming was using SQL in Microsoft Access in the early 2000s. Later, I became a data scientist, and loved pandas and dplyr for their expressiveness and power. As a data engineer, I dabbled in PySpark. Most recently, I’ve returned to SQL for data manipulation.
These alternative tools were developed to address deficiencies in SQL, and they are undoubtedly better in certain respects. But overall, I’m convinced that SQL is better in most circumstances, especially when working in a team or on long-term projects.
This post will make the case for SQL. I’ll then suggest when other tools may be preferable. Finally, I’ll finish by mentioning some future directions, and new libraries to keep an eye on.
Code is read more often than it is written. By using SQL, a much wider range of people can read your code, including BI developers, business analysts, data engineers and data scientists.
Data engineering pipelines have a habit of sticking around for many years, and SQL has stood the test of time. It is the language most likely to still be understood 10 or even 20 years from now.
SQL is also declarative - meaning the author describes what results they want, rather than how to compute them, meaning it’s arguably closer to self-documenting than some other imperative data processing languages.
Despite having been around since the 1970s, SQL-based tools have been one of the most active areas of innovation in data tooling in the past decade. Data pipelines written in SQL are capable of running much faster, and on larger data than would have been possible a decade ago, with little or no changes in the underlying SQL code.
In addition to continuous improvements in more traditional SQL engines, we’ve seen the advent of distributed tools like Spark and Presto that have enabled SQL to run on huge datasets. More recently, DuckDB enables extremely fast parallelised analytics queries on a single machine - competitive with some of the fastest alternatives such as data.table and polars, and able to operate directly on csv and parquet files. If you’re using any non-standard features of a particular flavour of SQL, SQLGlot allows automatic translation.
Overall, SQL is probably the most future-proof tool for writing data pipelines - which have a habit of sticking around for longer than expected. Whilst there are many competitors, SQL is the most likely to be still in use in 20 years time.
Experienced data engineers will know data types (datetimes, strings, etc.) can be a notorious time sink. The main problem is the lack of one-to-one mappings between data types in different programming languages and SQL engines. A secondary problem is that some tools such as pandas and sqlite do not enforce data type constraints - meaning a column can have a mix of data types.
Use of SQL mitigates these problems because SQL engines are usually strongly typed. The ecosystem also provides tools that go further: Apache Arrow seeks to address this challenge by enabling robust and consistent handling of data types across different tools (such as R, Python and different databases). By choosing SQL engines that are compatible with Arrow, many of these problems of compatibility of data typing between tools disappear.
Dependency management adds a significant maintenance burden to data pipelines and means maintainers need additional skills. Whilst the use of SQL does not eliminate the problem, it simplifies it considerably since SQL syntax changes much less frequently, and requires few dependencies to run.
For example, a pipeline written in R or Python five years ago could require days or even weeks of work to bring up to date. It’s likely to take significant effort to set up an environment to even run the code. The same pipeline written in SQL would need far fewer changes to bring up to date, and a reader seeking to simply understand the code could easily execute the SQL without setting up a new development environment.
SQL can also be executed from almost any programming language, making it easier to migrate pipelines to different tools, or to embed logic in other applications.
Many data engineers will be familiar with the pain of uncovering an incomprehensible thousand-line SQL script that was written a decade ago, but is a critical part of an organisation’s data pipelines. Or the challenges of making small changes to Spark SQL pipelines that run on huge volumes of data. Overall, historically it has been challenging to write SQL that aligns to good engineering practice - such as clear, concise, and tested code that is split out into easily digestible components.
Some of these challenges are now much easier to overcome with the use of three components:
There are even tools such as dbt which take similar ideas and compose them into a framework.
Modern SQL engines have support for a range of functions that make complex operations much simpler than they used to be, addressing earlier shortcomings:
Other features with more narrow applications include full text search, geospatial functions, PIVOT operations and user defined functions - though these should be used with care because they have limited support.
When may there be a strong argument for using other tools? I’ve argued in this post that SQL will often enable you to write pipelines which are simple, readable and easy to test. And where the totality of code and infrastructure is comparatively simple and maintainable in the long run.
But there are some situations where the opposite is true. For example, to interpolate a time series in pandas, you can use the resample
method - a single line of code, where the intention is clear. In many SQL engines, the equivalent SQL is more complex and difficult to read. Similarly, SQL is probably not the best tool to operate on graph-like data structures.
Ultimately SQL should not be the only tool you consider - but I suggest a presumption against other tools without a strong justification.
Part of the motivation for writing this post is the success I’ve had in using SQL to power Splink, a library for probabilistic record linkage at scale. All the core computations are written in SQL.
This has the following advantages:
This has all been possible despite the need to express some fairly complex iterative algorithms (such as the Expectation Maximisation algorithm) using SQL.
Far from becoming outdated, SQL is going from strength to strength.
One interesting trend is the development of alternative APIs that ‘compile down’ to SQL for execution. The idea here is that SQL engines are so well optimised that library authors are better reusing this work than writing their own execution logic.
Some interesting things I have an eye on: