Originally posted: 2018-08-11. View source code for this page here.
In government and beyond, organisations are aiming to become more data driven. The widespread adoption of data science approaches throughout analytical teams is key to achieving these aims. However, the value of data science is often understood too narrowly, and a misplaced focus can cause businesses to miss out on the most significant benefits.
This post describes how data science provides a new approach to tackling analytical problems. This approach can be used by all analysts, and results in a sustained increase in efficiency, quality of analysis, impact, and the pace of analytical innovation. I call it post-hype data science.
For me, data science boils down to applying the most powerful tools, techniques and ways of working to solve analytical problems. Most of the value comes from applying new ways of working and better tools to traditional problems, rather than through the application of the cutting edge techniques like machine learning and neural networks, which are sometimes seen as synonymous with data science.
If data science is really just a new way of working, what explains its dramatic rise in recent years? It’s a consequence of a rapid and radical improvement in the analysts’ toolset, which enable this new workflow. Free and open source tools like R and Python are now the most popular tools used by data scientists, and are some of the world’s fastest growing programming languages. ‘Social Coding’ tools like Github, which greatly facilitate collaboration between analysts who may never have met in person, have become enormously more popular in the last decade.
Across multiple disciplines, analysts are finding that these tools can lead to radically different ways of working. For instance, in academia, the reproducibility crisis is leading academics to question the traditional approach to publishing papers, and a leading economist has recently said that open source data science tools are “well on the way to becoming a standard for exchanging research results¹.”
For an analytical result to be useful, we need to know three things:
Historically, analysts have solved these problems with a customer-facing write up, a technical write up, and a quality assurance log — standalone documents written mainly in prose. The analysis itself has been conducted separately, most often in Excel and VBA. This package of documents underpins Analytical Quality Assurance (AQA), and is also essential for corporate knowledge retention.
This approach works, but it is time consuming, and difficult to iterate. Each element needs to be updated or recreated as things change, and we risk the documents getting out of sync with each other. Since copies of data are often embedded in the same spreadsheet as the main logic, the code cannot be shared widely for data security reasons, resulting in analytical projects being held in silos.
The approach taken by data scientists is different. We believe that the most precise and succinct statement of a method is computer code which, when run, transforms the ‘raw materials’ (data and assumptions) into an analytical result.
This new way of working is an improvement because:
By enabling much faster iteration with greater confidence of quality, analysts can deliver more relevant evidence faster.
This new workflow also enables analysis to be embedded more widely in organisations — into situations where speed is critical. For example, operational decision makers often data to be up-to-date for analysis to be useful to them. This is only possible when the analytical process is fully automated and reproducible.
Data scientists’ heavier reliance on high-quality, reproducible code has unlocked an explosion in the sharing and re-use of analytical work. As data scientists, we have a tangible sense of being part of a worldwide community who are empowered to continuously improve our own tools. For example, there are now over 10,000 R ‘packages’ — reusable chunks of code and analysis, which can be downloaded and used for free².
Re-use is great for saving time, but some of the biggest benefits relate to its ability to help us manage complexity, create a better division of labour, and incrementally increase quality. By creating small, high quality code libraries which can be re-used in new analytical projects, analysts can quickly assemble these ‘lego blocks’ into powerful new results using surprisingly little code, which in turn makes new work easier to understand. Since these ‘lego blocks’ are re-used in multiple projects, any improvements can result in widespread benefits.
The result is a sustained increase in analytical innovation. We can continuously improve the quality and power of our models, rather than reinventing the wheel in different organisations or teams at different times³.
At an organisational level, the transition to new skills and ways of working is a huge challenge, and I may write about it separately.
However, as an individual analyst, one of the the most exciting things about data science is that the tools and high quality training materials are available online for free. Building a data science skillset requires time and dedication, but doesn’t necessarily need the kind of expensive classroom-based courses that are often needed for proprietary tools.
You can begin your journey by trying out some of the most powerful data science tools in your web browser by clicking this link⁴. You can find the cross government data science community’s list of free, high quality learning resource for Python and R here and here. What are you waiting for?
¹ To get a sense of the power of this software, you can run the analysis behind a recent Nobel Prize winning physics paper in a web browser with no special software here.
² We see this — for example — in the outstanding tidyxl package, which was written by Duncan Garmonsway at GDS.
³ You can read more about the benefits of coding in the open here.
⁴ May be blocked on some corporate IT systems, but will work at home or on your smartphone!