Originally posted: 2019-03-14. View source code for this page here.
Governance is a particularly hard problem for data improvement projects¹ because it is difficult to assess and communicate how well things are going.
I suspect that the difficulty in communicating a clear picture of progress and the value that is being delivered is the key driver for the high failure rates for this type of work.
This blog post contains a set of questions which help improve communication between data delivery teams and their senior leaders. The aim is to help senior leaders to ask incisive questions — questions which enable them effectively to steer the project and assess progress. These questions should also help delivery teams organise their thoughts in a way which is meaningful to non-specialists. This kind of clear communication helps increase the chance of success and reduce the governance burden² because it builds trust and understanding, ultimately leading to more empowered delivery teams.
Why is hard to quantify and communicate progress on data improvement projects?
Data is as an intermediate product in the organisation’s value chain. It percolates through the business, starting off as a raw material, and ending up being used in some form in almost every part of the organisation, often having passed through algorithms, human analysts, reporting systems and other systems along the way. Rarely does the organisation have a complete picture of this data lineage, let alone of the value of each application of the data, making it extremely hard to build a global view of how much value improvements to data bring to the business.
The quality of improvements to data is difficult to describe and define. What works for some users may be totally useless to others. Easily available measures such as the amount of data available and how quickly it can be processed may be poor proxies for business value.
The exact mechanisms of how data helps business decision on the ‘front line’ are often poorly understood by technical specialists.
The technical complexity of data projects can hamper clear communication between teams and senior leaders, obscuring outcomes.
The costs of poor data are are usually hidden from management because they mostly fall outside the governance and reporting lines of the data project: they’re borne elsewhere in the business, at different times over the course of many years.
I think the following three questions can provide the basis of a challenging, insightful and mutually beneficial conversation discussion between senior leaders and their delivery teams.
A huge amount of hype surrounds data — partly a result of the phenomenal success of companies like Google and Amazon.
As a result, many businesses have programmes of work to become a ‘data driven organisation’ or to have a ‘data revolution’. But this can be dangerously vague, or have too wide a scope, and clear thinking is needed to narrow the focus to something tangible and achievable.
More data or faster data will not automatically make an organisation more data driven. It’s therefore important to be as specific as possible about how data is expected to improve things, using a range examples centred around real business problems. A good example articulates:
The new problem the business can solve with better data, or the new question it can answer
Why it is important to answer this question or solve this problem
Who will be using the new data, and what tangible changes can it make to how the business operates? Is the new data actionable or merely interesting?
How much of a cultural change will be needed for the new data to deliver the expected benefits? Are staff data-literate and actively asking for better data or does data not currently play much of a role in their day job? Is the change that’s envisaged plausible?
This conversation shouldn’t be technical because it should focus on business outcomes. Senior management might not be able to offer much challenge about whether the right algorithm is being applied, but should offer effective scrutiny of whether the problems the delivery team is trying to solve are important, and whether any required cultural change is achievable.
It is also valuable to discuss the benefits which are being targeted at a higher level. Does the business think the biggest benefits lie in improving the timeliness of data, its accuracy, the breadth of fields available, or something else?³. Again, these are areas where senior leaders are likely to be able to offer effective challenge — so long as it’s clear that this is a question of prioritisation, and it’s not possible to have all at once.
There’s something about data projects that seems to invite long design phases. With hyped-up language about ‘data revolutions’ it’s easy to assume a radical new enterprise data architecture is needed. A system with every piece of data in its right place intuitively sounds like the best solution, and design is a stimulating intellectual challenge, so it’s easy to get sidetracked into developing highly detailed plans before starting work.
I think a good data architecture is something that emerges from iterative cycles of delivery of real data products and feedback from users. The resultant architecture is likely to be more ‘messy’, but actually serves user needs better at lower cost. Modern cloud-based services and the use of infrastructure-as-code mean that decisions about storage, databases and tools are much easier to change, so this iterative approach is now more viable.
A good discussion should therefore focus on the value the team is delivering to their users over timescales of weeks (not months). This should be evidenced by the feedback the team has received from their users, and how the team is responding. It is useful to refer back to the examples of real-world problems discussed above, to see which ones have progressed.
A discussion should also help make explicit the trade-offs the delivery team is making to protect the pace of delivery. For instance, the team is likely to have had to make compromises on data quality (e.g. speed, accuracy, or usefulness) in order to deliver at pace. Or the team may have focussed on the user needs of only one customer. Discussing these trade-offs is often insightful because it highlights what is *not *being done, which is important for transparency and trust and essential for a real understanding of current priorities.
Ultimately, a clear understanding of the benefits for the business, coupled with a strong user focus, allows the team to prioritise work effectively enabling work with a tight focus that delivers products over short timescales.
Work should be focussed around the needs of specific users. But there is a tension: the data products being developed are likely to be being used by the business for many years to come, by users or parts of the organisation which may not even exist yet.
Anyone who’s ever used bad data systems knows they can impose huge costs on data users and their organisations in terms of wasted effort, costs which often last for decades⁴. And more importantly, they prevent effective use of this data, hampering evidence based decision making.
To give a tangible example of the problem, it’s common to encounter datasets which are locked away — only accessible via a custom Excel plugin on a particular network which provides the data in a strange format. These may have been adequate for a specific set of users ten years ago but now they pose a huge problem.
How do we stop this happening? Some points of discussion should be:
Is development work adhering to a set of standards or principles⁵ that ensure compatibility between different efforts to improve data? If these principles are effective, work can be loosely coupled but tightly aligned and different teams can autonomously deliver value⁶.
Does the organisation own the data and has it tested that it has direct and performant access to the raw data from general-purpose programming languages, even if development work is contracted out to third parties? The presumption should be that secure access should be available via the internet (i.e. data shouldn’t be locked behind a specific corporate network).
What mechanisms have been designed to help disparate teams agree on common definitions so that data from multiple sources can be integrated and linked?
Are good software engineering principles being followed? All source code, particularly code that transforms data, should be accessible and version controlled so that future users can understand data lineage and what changes have been made to the raw data.
Demos of working new data products should be a key part of updates to senior leaders (in addition to being shown to users on a regular basis). An effective demo will brings the above three questions to life, providing a point of reference for their discussion. The delivery team should consider whether they can:
Show how the data products fit into ways of working by describing in as much detail as possible who is going to be using the product, in what context, and how much of a change this is to their current ways of working.
Present user research that includes unbiased feedback from users about the new data, how it is affecting ways of working and business outcomes, and/or bring a user along to show how they’re using the data product.
Presents some metrics about user engagement (e.g. pageviews, numbers of users). Particular attention should be paid to engagement after the ‘honeymoon period’ where curiosity of a new data product may drive an initial peak — is there evidence that the data has been embedded long term in day-to-day decision making?
Clarify how the new data products are different to what was previously available, which might include a comparison of old and new, and whether adherence to data principles has been improved.
These questions and this approach represents my emerging thinking on how to help to improve the quality of conversations with senior leaders. I’m sure I’ve got things wrong, and there are other things that can help — if so, please do leave a comment!
¹ This post is mainly about teams working on improving how the business ingests, stores and exposes data to the wide range of users who may find it useful — from front-line staff to data scientists. The focusses on the ‘middle bit’ — the part between the system that collects the data, and the variety of systems, tools and people which may consume this data.
² A high governance burden is often a rational response to management’s lack of clarity about how things are going.
³ A non-exhaustive list of the possible scope includes improvements in the timeliness of data, its accuracy, the breath of fields available, the frequency of observations, data linking, accessibility of data, improvements in metadata, discoverability and search.
⁴ The costs of bad data can include months of staff time finding, understanding, accessing and wrangling bad data. The old trope that 80% of data scientists’ time is spent data wrangling is, if anything, an underestimate in my experience. See also here and here.
⁵ Examples are here and here (these may be fleshed out more by internal teams, who would decide on things like metadata management conventions, specific data formats etc.)
⁶ I think it’s critical for data teams to be small enough to feel empowered to deliver value in their own right, not reliant on the success of a much larger programme of work (even if they are part of a wider programme). Principles are important because they allow teams to work in a distributed/federated model, autonomously delivering value, but prevent the worst kinds of siloed working.