Harnessing Dark Data
The pharma industry creates significant amounts of data – so why is most of it hoarded and never made accessible?
Graeme Dennis | | Opinion
Data is generally recognized as an organization’s top asset after talent – but it is rarely treated as such. Although the amount of data produced every day is massive, it has been estimated that 60–73 percent of goes unused for analytics1, instead relegated to storage. This is known as “dark data.”
Local storage, an explosion in the number of users at all levels of computer literacy, and well-intended (and justified) data security have all contributed to the volume of dark data that companies retain whether or not they plan to use it. How can the biopharma and healthcare industries treat this data as a free-standing and central asset by making it accessible in the long term?
Like the unexpressed parts of the human genome, some dark data can be expected to have great meaning and significance – but not all. A key reason this data continues to accumulate is that there is no suitable system to house it. For instance, many in vivo preclinical study results generated by contract research organizations are in portable document formats unsuited to analyzing or in email, an unshared environment by design. This choice is dictated by convenience, but it makes the data not only unshareable, but also invisible to the organization. Even if exposed, many scientists would not even consider using or reinterpreting the data without significant context as to how, when, by whom, and under what precise conditions it was gathered. They may instead opt to rerun the study, depleting time, money, and resources.
The antidote to the dark data morass should go beyond just exposing it. Most companies are guilty of data hoarding, where data is retained regardless of quality or significance. Our goal should not be only to find (and find value in) what is stored, but also to store less dark data. Storage may be cheap, but it’s not always best to keep everything.
Instead, the focus should be on retaining contextually rich data. Legacy methods of recording and collating data are prone to errors, but they also fail to provide the full context in which the data was captured. Sometimes called data provenance, these conditions frequently dictate the reusability or applicability of data for interpretation. Instrumentation, lab location, conditions, and materials used are just some of the many factors that should be considered throughout the drug development lifecycle. Sample origin, transport conditions, and custody are also essential. Capturing this information relies on advanced informatics infrastructure and significant forethought in system design.
IDBS has a longstanding presence in the data space and I have observed a proliferation of standalone systems across siloed specialties and disciplines. When it comes to realizing the benefits of data, I have three central pieces of advice: recognize data maturity, explore integration approaches, and embrace a “data first” cultural shift.
First, classifying data according to its level of maturity is wise because it enables assignment of resources, effort, and priority to align with broader company goals. For instance, data may be considered fully dark or sequestered, shared (perhaps on a shared drive or SharePoint), structured (stored in a database), and ultimately standardized (both structured and harmonized with internal or industry standards). Stratifying data this way and then functionally – perhaps according to the most active project, candidate, or biological relevance – can make the process manageable.
Second, it is important to explore integration approaches. The nearer data is captured to the moment of acquisition, the less likely it is to become sequestered. Vendor assessments must raise integration capabilities early and often in the evaluation of candidates. Data pipelining tools, database replication, or scripting can close these gaps, but the best solutions eliminate the gap via integration.
Finally, a “data first” strategy acknowledges the importance of data, socializes it, and provides the tools that enable success. The F.A.I.R. (findable, accessible, interoperable, reusable) data principles provide broadly accepted guidelines for when such a program may be implemented. This type of program will succeed when it is visibly endorsed by leadership and prioritized at the bench. And you should absolutely invite not only champions, but also skeptics to participate – they will provide some of your most valuable input!
Big pharma is increasingly turning to third-party R&D firms to accelerate the early stages of drug development. Today, the most effective of these firms are technology-focused, with a cultural mindset that recognizes the power of data. Method execution and sample management – approaches with origins in manufacturing and QC – have extended into drug development.
By revealing dark data, optimal conditions and processes can be replicated, workload reduced, and efficiency increased. Error detection, for example, becomes much easier when one can pinpoint when and where a certain context changed. By rolling back to this point, it is possible to resume development, rather than starting the entire process again. In essence, dark data can shine a light on the best way forward.
- Forrester, “Hadoop Is Data’s Darling For A Reason,” (2016). Available at https://bit.ly/3bRlyJU.