Originally posted: 2021-10-29. View source code for this page here.
As data science has become mainstream, organisations face huge challenges in providing the tools and infrastructure needed to harness its value.
On face value, the problem is that analysts are asking for a greater variety of tools than traditionally have been required. But it would be a mistake to conclude that the value of a platform can be unlocked simply by making a few new tools available.
A large part of the value comes from enabling analysts to solve their own problems. Analysts’ frustration is less about access to specific tools, and more about the barriers that typically exist on corporate systems that prevent them from exploiting the rapidly-changing landscape of tools and technologies. Failure to confront this problem risks the emergence of ‘shadow IT’ - staff resorting to building their own parallel analytical infrastructure.
Analysts are creative problem solvers, and they are often best placed to understand the value that data infrastructure can deliver to the organisation. Disempowering them by viewing them as passive consumers of analytics infrastructure undermines this purpose. Instead, as part of multidisciplinary teams, they should take an active role in building and iterating tools and infrastructure that unlock the power of data.
As such, a key requirement is that the platform provides the ‘meta’ ability for a large number of people simultaneously to change and improve the platform itself.
Whilst this is by no means the only challenge in building a platform, it is one of the most important. Without collaborative platform building tools, analysts will be continuously frustrated by bottlenecks in the speed at which the platform team can respond to the wide variety of complex requirements. The platform is also at risk of reaching a ceiling of complexity at which point it becomes prohibitively difficult to add new features.
This capability may only be used by a small subset of the most technical analysts. But it’s the difference between the platform being transformative and enabling continuous improvement of analytical work, and merely providing a one-off improvement to tools.
Furthermore, the kind of distributed innovation that is enabled by this approach can help set the technical direction, and significantly de-risk big ticket changes in platform architecture.
This remainder of the article delves further into the problem, and some tools and approaches which can help meet this need.
If analysts want to solve their own problems, why is this capability so difficult to deliver? The answer lies in the tension between individual freedom and the need for the organisation to keep complexity, duplication and security under control.
The advent of cloud computing makes it almost too easy to provision complex analytics infrastructure. Data storage and compute, and even cloud based software like Jupyter or R Studio are becoming commodities. The big challenge is managing the permissions to create and use infrastructure for hundreds of users, and the corresponding data access permissions. Without a structured approach to this problem, the organisation can quickly lose track of what exists, why it exists, and who should have access to it.
For example, who can spin up a Spark cluster? What data can that Spark cluster access? Are there limits to the size of the cluster? Who has access to which logs that are produced by the cluster. Can a workflow manager like Apache Airflow trigger a Spark job? If so, what data access can be conferred onto the job?
As the range of tools and infrastructure increases, this is a problem that can get fiendishly difficult to manage. The principle of least privilege suggests that to achieve the most robust security, users should have the minimum set of permissions they need to do their job. Ensuring this is the case for hundreds of users is probably beyond the capacity of a small platform team unless they impose significant barriers to what infrastructure can be used. For analysts with the most complex needs, the only way to prevent a bottleneck is for them to manage some of this task themselves.
The rest of this post explores three tools to help organisations successfully navigate this tradeoff: Infrastructure as code, containerisation and the delegation of governance. These tools help provide a structure for analysts to solve their own problems that is self-documenting and promotes maintainability.
These approaches by no means eliminate the complexity of additional infrastructure and tools. However, by significantly lowering costs, experimentation is no longer prohibitively expensive, providing the fuel for innovation.
‘Infrastructure as code’ (IoC) is the practice of defining all platform infrastructure in computer code. A software tool like Terraform or Pulumi is then used to automatically keep the real-world infrastructure in sync with the code. The types of infrastructure that can be defined in code are very wide ranging, encompassing everything from databases, to data access permissions, network security configurations, and even permissions to alter the platform itself.
The code thereby serves as an authoritative record of the infrastructure that exists and who created it. Source control systems like Github can also store additional metadata, such as the purpose of the infrastructure and who approved its creation.
IoC therefore provides a solution to the problem of maintaining an auditable inventory of assets. Whilst this is valuable, there are other approaches that could achieve the same outcome.
More importantly, IoC enables collaborative development of the platform by large numbers of contributors. This is something that is much harder to achieve via alternative means. In particular, since everything is defined in code, teams can leverage the mature ecosystem of software development tools that have been used for decades to enable large teams of developers to collaborate on code.
For simple requests, such as enabling a user to access a particular database table, a GUI may be built on top of the IoC tool.
For more complex requests, the IoC tool allows anyone with the skills to define precisely the infrastructure they need and propose the changes to the platform team. This provides an important escape valve that means well-resourced teams are much less likely to be blocked by the bottleneck of the platform team.
Containerisation tools like Docker allow users to use code to define a complete computational environment in which to run their analysis. The process of building a Docker container is equivalent to setting up a new computer from scratch, installing all the tools needed for a task, but doing so in a repeatable way.
This is an important escape valve that gives more technical users the freedom to use their preferred tools whilst avoiding the bottleneck of tools needing to be centrally provisioned.
It also helps to reduce complexity. Rather than attempt to configure single servers capable of completing many tasks (e.g. running ETL, machine learning models etc.), you can configure a container for a single task, eliminating the potential for conflicts.
Containerisation is especially important for cutting edge data science work, given the rapidly-changing landscape of tools and techniques. The range of tools is so vast that no person or even centralised team can be expected to understand them all. With a containerisation tool, the analyst becomes responsible for configuring the tools they need. Since containers are specified in code, they are forced to do this in a reproducible, auditable way, reducing the problems of corporate knowledge retention.
Collaborative platform building requires governance mechanisms that establish trust and ensure quality and security amongst a large number of contributors. IoC and containerisation help this process, since they ensure contributions are fully auditable and can be peer reviewed before they are actioned.
However, peer review is only effective if reviewers have the requisite skills and enough time to understand what the user is trying to achieve. This is unlikely to happen if reviewers are exclusively situated in a centralised platform team. Instead, these skills need to be more widespread, and embedded within analytical and data teams. In this way, the central team may maintain the IoC tooling and other tools used for the technical parts of governance, but the use of these tools can be delegated to other suitably-skilled teams. This way of distributing skills and knowledge has much inß common with the thinking behind the data mesh.
Modern analytical platforms have the potential to be revolutionary, supercharging innovation and leading to rapid and sustained improvement in organisations’ ability to use data effectively.
The skills and knowledge to unlock this value are widely distributed across the analytical community, in staff who are often begging to be able to innovate, but are blocked by technology. An important purpose of the platform is to unlock this creative energy. But this impetus can quickly turn to frustration or ineffective workarounds if the platform does not offer sufficient flexibility.
The purpose of the platform is therefore undermined by excessive centralisation. Flexibility that enables users to solve many of their own problems should be seen as a critical user need, and platform teams should see themselves primarily as enablers rather than solutioneers. With this in mind, some of Brian Eno’s design principles seem particularly apt:
Think like a gardener, not an architect.
Design beginnings, not endings.
Unfinished=fertile.
Artists are to cities what worms are to soil.