My watch list

Securing data jewels within the data lake

Data governance as a strategy for value-added data and digitalization in the laboratory

Wolfgang Boos (Tecan Trading AG)

In the field of data analytics and machine learning, data quality and the “right” data architectures are the key to success. Data governance enables organizations to establish important elements and processes for generating value-added information from data. With the immense growth of data that can be collected in laboratories thanks to developments in analytical technologies, it is becoming more and more important to utilize such data accordingly.

Considering the large number of articles published on data integrity [1] for the regulated environment, it would seem that all data issues in a laboratory setting have been resolved. This assumption would be true if the focus were on compliance only. An increasingly important aspect, especially in data-driven research fields such as life sciences and biotechnology, is to analyze the huge amounts of data and convert them into useable knowledge. Big data applications are a rapidly growing field that poses new challenges, particularly with regard to data quality management.

Data as the most valuable asset

This article takes a closer look at “data as the most valuable asset,” with data governance being the most important topic for obtaining said asset. From the perspective of corporate IT, data governance is an “old” discipline, as regulatory requirements (recording obligations, financial regulations such as the U.S. Sarbanes-Oxley Act (SOX) of 2002 (see [2]), etc. have had to be fulfilled for decades. Nowadays, as part of the progressive digitization trend, data are becoming increasingly important in terms of their potential to create value. In addition to compliance requirements, both industry and research institutions are increasingly shifting their focus towards how to use data. However, the concept of data governance is not yet clearly understood by many companies (especially outside of IT departments), whereby awareness is growing rapidly as the amount of data collection increases. This topic has only recently become the focus of increased scientific interest (see [3]).

Laboratories in the fields of life sciences business, biotechnology and chemical research typically collect data that derive from a variety of sources and systems, e.g. from HPLC analyses, pH values, UV‑Vis spectral analyses, weight values, etc. Innovative developments of high-throughput technologies are delivering a vast amount of complex and unstructured data. Collected in what are known as data lakes, these data could offer new insights and therefore harbor the potential to generate new knowledge and applications in life sciences and biotechnology. In order to conduct data analytics, it is crucial that data management provides full transparency, including all additional metadata (device, serial number, date & time, operator, IDs, etc.). In this context, the quality and completeness of the data is of the highest priority.

Mining data jewels in the data lake

Fig. 1 From data to wisdom: The Data-Information-Knowledge-Wisdom pyramid (own diagram). The graphic is based on the so-called “DIKW” model, a widespread hierarchical model commonly used in information science.

Data analytics very often goes hand-in-hand with machine learning to develop new growth sectors such as drug development [4] and really is the desired level to aim for. The term “data analytics” is often used in connection with big data in life sciences and artificial intelligence/machine learning (AI/ML). As mentioned above, the fast growth of data is based on new technologies such as high-throughput screening and new analytical technologies such as high-throughput-screening (HTS), DNA sequencing (next-generation sequencing (NGS)) or in the “omics” fields, as explained in [5]. Success can only be achieved if the data architecture reflected in the data lake meets the data analysis requirements. In addition, data analytics is used to make faster decisions, increase productivity or make predictions. All of this can only be successful with defined data architectures and laboratory data that are enriched with metadata. When all of this information comes together, we can speak of value-added data as shown in Figure 1: “The Data Pyramid of Wisdom.” The diagram is based on the so-called Data-Information-Knowledge-Wisdom (DIKW) model. This is a widespread hierarchical model commonly used in knowledge management and information science, and is nowadays also critically discussed because of its limitations [6].

Data governance inside the laboratory

In order for data to become a valuable asset in the laboratory, it is important to understand and adapt best practice procedures within the organization. Involving the IT or lab IT experts to help the organization understand the possible strategic direction for topics such as data harmonization, metadata requirements or data architectures is a good starting point. The data governance framework supports the development of a standard in the laboratory. Figure 2 illustrates many aspects of data governance [7], where several aspects overlap with the best practices of data integrity. The areas of “Document & Content Management,” “Data Quality Management,” “Metadata Management” and “Reference & Master Data Management” (light blue in Figure 2) are particularly similar to data integrity.

Let’s take a closer look at three important laboratory data issues (highlighted in green in Figure 2) that should be addressed in addition to the data integrity requirements already associated with data governance. Note: the other two areas (shown in dark blue) are not discussed in the scope of this article, because they are more IT-related.

Fig. 2 Key areas of data governance, adapted from [7]

i) Data architecture management is the process of defining and maintaining specifications that

  • provide a standard common business vocabulary,
  • express strategic data requirements,
  • outline high-level integrated designs to meet these requirements, and
  • align with enterprise strategy and related business architecture.

ii) Data security management is the planning, development and execution of security policies and procedures to provide proper authentication, authorization, access and auditing of data and information assets.

iii) Data operation management is the development, maintenance and support of structured data to maximize the value of the data resources for the enterprise.

What is required here is the big picture of the laboratory data, referred to as data architecture – irrespective of the nature of the data (whether purely digital based on connected solutions or manually collected values from offline devices). Within this big picture, data classification (e.g. personal data, test results, measurement data, etc.) with access requirements, which is a part of data security management, must be defined.

In order to drive digital transformation in the laboratory and within data operation management to maximize the value of data, data breaks with interrupting dataflows should be eliminated or strictly limited. Data breaks are defined as situations such as the manual transcription of values, manual data entries inside lab systems or attaching a document which cannot be electronically accessed. Manual transcription from an unconnected device to lab data solutions is error-prone and reduces the amount of valuable metadata that becomes available because the process is time-consuming and limits the user’s visibility of the metadata.

The data architecture helps define the dataflow and workflows for acquiring laboratory data. In connection with data operation management, the dataflow between IT systems, analytical equipment and measurement data can be defined and implemented step-by-step in a meaningful way.

Data security and access rights are a must

As explained above, digital transformation is only possible if multiple areas of data governance are implemented simultaneously.

Data security management handles access rights and data classification by limiting the global visibility of valuable data. In combination with the security concept for data, valuable data assets can be successfully protected. This is a contradiction in the field of data lakes or big data analytics, where full access to all data is a requisite. Anonymizing personal or patient data for data analytics cases can solve this contradiction and help fulfill regulatory requirements like the European Union’s GDPR (General Data Protection Regulation), which has been in effect since May 25th, 2018 (see [8]).

From manual to fully integrated data collection

ig. 3 Data flow in a digital lab with automated transfer of data from the devices to the overlaying data lake

Manual data transcription has very limited opportunities for metadata, but with the next step of direct data acquisition, it will offer new possibilities for enriching measurement data with some metadata. Automating data streams from the acquisition software to an overlay data management system is part of the end-to-end dataflow designed to meet data architecture requirements (see Figure 3.).


Expanding data integrity through data governance principles with a strategy for data lakes and data analytics will strongly support the growth of value-added data. The data governance principles manage access rights, backup and archiving, whilst also having a strong focus on security aspects, which are increasingly becoming a key issue. Collecting useful metadata from the beginning of an experiment through to all of the results in the form of an end-to-end workflow will, in future, grant new insights by expanding the capabilities of data analytics.


Category: Laboratory Management | Data Management

[1] U.S. Food and Drug Administration, https://www.fda.gov/files/drugs/published/Data-Integrity-and-Compliance-With-Current-Good-Manufacturing-Practice-Guidance-for-Industry.pdf, 2016 Apr, accessed on 2020 Oct 02
[2] Coates, John, C IV. (2007) The Goals and Promise of the Sarbanes-Oxley Act, Journal of Economic Perspectives, 21 (1): 91-116, DOI: 10.1257/jep.21.1.91
[3] Krotova, A., Eppelsheimer, J. (2019) Was bedeutet Data Governance? Eine Clusteranalyse der wissenschaftlichen Literatur zu Data Governance, Institut der Deutschen Wirtschaft, Köln, https://www.iwkoeln.de/fileadmin/user_upload/Studien/Gutachten/PDF/2019/Gutachten_Data_Governance_DEMAND_Template.pdf, accessed on 2020 Sep 29
[4] Mijuk, G., Drug development get big data analytics boost, https://www.novartis.com/stories/discovery/drug-development-gets-big-data-analytics-boost, Novartis, 2018 Jul 02, accessed on 2020 Oct 02
[5] Committee on the Review of Omics-Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine; Micheel CM, Nass SJ, Omenn GS, editors. Evolution of Translational Omics: Lessons Learned and the Path Forward. Washington (DC): National Academies Press (US); 2012 Mar 23. 2, Omics-Based Clinical Discovery: Science, Technology, and Applications. Available from: https://www.ncbi.nlm.nih.gov/books/NBK202165/
[6] Williams, D. (2014) Models, Metaphors and Symbols for Information and Knowledge Systems, Journal of Entrepreneurship, Management and Innovation 10 (2014), 79-107, DOI: 10.7341/20141013
[7] Dama International, The DAMA Guide to the Data Management Body of Knowledge (DAMA-DMBOK), first edition, Basking Ridge, NJ, USA, Technics Publications, April 2009
[8] Official Journal of the European Union: Regulation (EU) 2016/679 (General Data Protection Regulation) of 27 April 2016, https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679, accessed on 2020 Oct 02

Date of publication: 21-Oct-2020

Facts, background information, dossiers

  • data governance
  • data integrity
  • big data
  • metadata
  • Machine Learning
  • artificial intelligence
  • information science
  • digitalization
  • lab systems
  • data security

Other articles by these authors

All articles

More about Tecan

  • Authors

    Wolfgang Boos

    Wolfgang Boos, born in 1965, initially trained as an Information Technician before studying Technical Computer Science at Flensburg University of Applied Sciences (Germany), graduating with a diploma degree. He subsequently worked as a software developer at the Forschungszentrum Jülich, Ins ... more

q&more – the networking platform for quality excellence in lab and process

The q&more concept is to increase the visibility of recent research and innovative solutions, and support the exchange of knowledge. In the broad spectrum of subjects covered, the focus is on achieving maximum quality in highly innovative sectors. As a modern knowledge platform, q&more offers market participants one-of-a-kind networking opportunities. Cutting-edge research is presented by authors of international repute. Attractively presented in a high-quality context, and published in German and English, the original articles introduce new concepts and highlight unconventional solution strategies.

> more about q&more

q&more is supported by:


Your browser is not current. Microsoft Internet Explorer 6.0 does not support some functions on Chemie.DE