In the field of data analytics and machine learning, data quality and the “right” data architectures are the key to success. Data governance enables organizations to establish important elements and processes for generating value-added information from data. With the immense growth of data that can be collected in laboratories thanks to developments in analytical technologies, it is becoming more and more important to utilize such data accordingly.
Considering the large number of articles published on data integrity [1] for the regulated environment, it would seem that all data issues in a laboratory setting have been resolved. This assumption would be true if the focus were on compliance only. An increasingly important aspect, especially in data-driven research fields such as life sciences and biotechnology, is to analyze the huge amounts of data and convert them into useable knowledge. Big data applications are a rapidly growing field that poses new challenges, particularly with regard to data quality management.
Data as the most valuable asset
This article takes a closer look at “data as the most valuable asset,” with data governance being the most important topic for obtaining said asset. From the perspective of corporate IT, data governance is an “old” discipline, as regulatory requirements (recording obligations, financial regulations such as the U.S. Sarbanes-Oxley Act (SOX) of 2002 (see [2]), etc. have had to be fulfilled for decades. Nowadays, as part of the progressive digitization trend, data are becoming increasingly important in terms of their potential to create value. In addition to compliance requirements, both industry and research institutions are increasingly shifting their focus towards how to use data. However, the concept of data governance is not yet clearly understood by many companies (especially outside of IT departments), whereby awareness is growing rapidly as the amount of data collection increases. This topic has only recently become the focus of increased scientific interest (see [3]).
Laboratories in the fields of life sciences business, biotechnology and chemical research typically collect data that derive from a variety of sources and systems, e.g. from HPLC analyses, pH values, UV‑Vis spectral analyses, weight values, etc. Innovative developments of high-throughput technologies are delivering a vast amount of complex and unstructured data. Collected in what are known as data lakes, these data could offer new insights and therefore harbor the potential to generate new knowledge and applications in life sciences and biotechnology. In order to conduct data analytics, it is crucial that data management provides full transparency, including all additional metadata (device, serial number, date & time, operator, IDs, etc.). In this context, the quality and completeness of the data is of the highest priority.
Mining data jewels in the data lake
Fig. 1 From data to wisdom: The Data-Information-Knowledge-Wisdom pyramid (own diagram). The graphic is based on the so-called “DIKW” model, a widespread hierarchical model commonly used in information science.
Data analytics very often goes hand-in-hand with machine learning to develop new growth sectors such as drug development [4] and really is the desired level to aim for. The term “data analytics” is often used in connection with big data in life sciences and artificial intelligence/machine learning (AI/ML). As mentioned above, the fast growth of data is based on new technologies such as high-throughput screening and new analytical technologies such as high-throughput-screening (HTS), DNA sequencing (next-generation sequencing (NGS)) or in the “omics” fields, as explained in [5]. Success can only be achieved if the data architecture reflected in the data lake meets the data analysis requirements. In addition, data analytics is used to make faster decisions, increase productivity or make predictions. All of this can only be successful with defined data architectures and laboratory data that are enriched with metadata. When all of this information comes together, we can speak of value-added data as shown in Figure 1: “The Data Pyramid of Wisdom.” The diagram is based on the so-called Data-Information-Knowledge-Wisdom (DIKW) model. This is a widespread hierarchical model commonly used in knowledge management and information science, and is nowadays also critically discussed because of its limitations [6].
Data governance inside the laboratory
In order for data to become a valuable asset in the laboratory, it is important to understand and adapt best practice procedures within the organization. Involving the IT or lab IT experts to help the organization understand the possible strategic direction for topics such as data harmonization, metadata requirements or data architectures is a good starting point. The data governance framework supports the development of a standard in the laboratory. Figure 2 illustrates many aspects of data governance [7], where several aspects overlap with the best practices of data integrity. The areas of “Document & Content Management,” “Data Quality Management,” “Metadata Management” and “Reference & Master Data Management” (light blue in Figure 2) are particularly similar to data integrity.
Let’s take a closer look at three important laboratory data issues (highlighted in green in Figure 2) that should be addressed in addition to the data integrity requirements already associated with data governance. Note: the other two areas (shown in dark blue) are not discussed in the scope of this article, because they are more IT-related.
Fig. 2 Key areas of data governance, adapted from [7]
i) Data architecture management is the process of defining and maintaining specifications that
- provide a standard common business vocabulary,
- express strategic data requirements,
- outline high-level integrated designs to meet these requirements, and
- align with enterprise strategy and related business architecture.
ii) Data security management is the planning, development and execution of security policies and procedures to provide proper authentication, authorization, access and auditing of data and information assets.
iii) Data operation management is the development, maintenance and support of structured data to maximize the value of the data resources for the enterprise.
What is required here is the big picture of the laboratory data, referred to as data architecture – irrespective of the nature of the data (whether purely digital based on connected solutions or manually collected values from offline devices). Within this big picture, data classification (e.g. personal data, test results, measurement data, etc.) with access requirements, which is a part of data security management, must be defined.
In order to drive digital transformation in the laboratory and within data operation management to maximize the value of data, data breaks with interrupting dataflows should be eliminated or strictly limited. Data breaks are defined as situations such as the manual transcription of values, manual data entries inside lab systems or attaching a document which cannot be electronically accessed. Manual transcription from an unconnected device to lab data solutions is error-prone and reduces the amount of valuable metadata that becomes available because the process is time-consuming and limits the user’s visibility of the metadata.
The data architecture helps define the dataflow and workflows for acquiring laboratory data. In connection with data operation management, the dataflow between IT systems, analytical equipment and measurement data can be defined and implemented step-by-step in a meaningful way.
Data security and access rights are a must
As explained above, digital transformation is only possible if multiple areas of data governance are implemented simultaneously.
Data security management handles access rights and data classification by limiting the global visibility of valuable data. In combination with the security concept for data, valuable data assets can be successfully protected. This is a contradiction in the field of data lakes or big data analytics, where full access to all data is a requisite. Anonymizing personal or patient data for data analytics cases can solve this contradiction and help fulfill regulatory requirements like the European Union’s GDPR (General Data Protection Regulation), which has been in effect since May 25th, 2018 (see [8]).
From manual to fully integrated data collection
ig. 3 Data flow in a digital lab with automated transfer of data from the devices to the overlaying data lake
Manual data transcription has very limited opportunities for metadata, but with the next step of direct data acquisition, it will offer new possibilities for enriching measurement data with some metadata. Automating data streams from the acquisition software to an overlay data management system is part of the end-to-end dataflow designed to meet data architecture requirements (see Figure 3.).
Summary
Expanding data integrity through data governance principles with a strategy for data lakes and data analytics will strongly support the growth of value-added data. The data governance principles manage access rights, backup and archiving, whilst also having a strong focus on security aspects, which are increasingly becoming a key issue. Collecting useful metadata from the beginning of an experiment through to all of the results in the form of an end-to-end workflow will, in future, grant new insights by expanding the capabilities of data analytics.
________________________________________________________________________________________
Category: Laboratory Management | Data Management
Literature:
[1] U.S. Food and Drug Administration, https://www.fda.gov/files/drugs/published/Data-Integrity-and-Compliance-With-Current-Good-Manufacturing-Practice-Guidance-for-Industry.pdf, 2016 Apr, accessed on 2020 Oct 02
[2] Coates, John, C IV. (2007) The Goals and Promise of the Sarbanes-Oxley Act, Journal of Economic Perspectives, 21 (1): 91-116, DOI: 10.1257/jep.21.1.91
[3] Krotova, A., Eppelsheimer, J. (2019) Was bedeutet Data Governance? Eine Clusteranalyse der wissenschaftlichen Literatur zu Data Governance, Institut der Deutschen Wirtschaft, Köln, https://www.iwkoeln.de/fileadmin/user_upload/Studien/Gutachten/PDF/2019/Gutachten_Data_Governance_DEMAND_Template.pdf, accessed on 2020 Sep 29
[4] Mijuk, G., Drug development get big data analytics boost, https://www.novartis.com/stories/discovery/drug-development-gets-big-data-analytics-boost, Novartis, 2018 Jul 02, accessed on 2020 Oct 02
[5] Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine; Micheel CM, Nass SJ, Omenn GS, editors. Evolution of Translational Omics: Lessons Learned and the Path Forward. Washington (DC): National Academies Press (US); 2012 Mar 23. 2, Omics-Based Clinical Discovery: Science, Technology, and Applications. Available from: https://www.ncbi.nlm.nih.gov/books/NBK202165/
[6] Williams, D. (2014) Models, Metaphors and Symbols for Information and Knowledge Systems, Journal of Entrepreneurship, Management and Innovation 10 (2014), 79-107, DOI: 10.7341/20141013
[7] Dama International, The DAMA Guide to the Data Management Body of Knowledge (DAMA-DMBOK), first edition, Basking Ridge, NJ, USA, Technics Publications, April 2009
[8] Official Journal of the European Union: Regulation (EU) 2016/679 (General Data Protection Regulation) of 27 April 2016, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, accessed on 2020 Oct 02
Header image: iStock.com | BlackJack3D, koto_feja
Date of publication:
21-Oct-2020