Just as an auditor controls financial processes but does not actually execute financial management, data governance ensures data is properly managed without directly executing data management. Data governance is fundamentally about organizational behaviour. This is not a problem that can be solved through technology. However, there are tools and processes that support the overall process. A lot of the tasks related to data governance can benefit from automation. There are Machine learning tools and tasks can augment complete end-to-end support for the processes involved in and the personnel responsible for data governance organization. Data Governance Automation not only supports the overall governance framework but also provides end-to-end support, ensuring accuracy, consistency, and compliance in data management.
The Enterprise Dictionary
A data dictionary is necessary to support the use of a Data Warehouse. It defines the structure and contents of data sets. The dictionary can be used to manage the names, descriptions, structure, characteristics, storage requirements, default values, relationships, uniqueness, and other attributes of every data element in a model. It is one that can take many shapes, from a paper document to a tool that encodes or automates certain policies.
In addition to technical details, data dictionaries can also include information described in business terminology. This means that instead of just listing technical specifications, they can provide explanations in language that is understandable to business users. This may include details about security restrictions governing access to the data, as well as how the data is utilized within various business processes or functions. The enterprise dictionary is normally owned by either the legal department (whose focus would be compliance) or the data office (whose focus would be standardization of the data elements used).
In the absence of a consolidated data dictionary, multiple systems could implement different date formats, which in turn lead to data mismatch and data loss when data synchronization takes place between different source systems. Once the enterprise dictionary is defined, the various individual info-types within it can be grouped into data classes, and a policy can be defined for each data class.
Data Classification
A good enterprise dictionary will contain a listing of the classes of data the organization processes. Risk classifications describe the sensitivity of the data and the likelihood that it might be sought after for malicious purposes. Classifications are used to determine who can access the data. For example, an organization will not want to treat “street addresses, “phone numbers,” “city, state” and “zip code” differently in a granular manner but must rather be able to set a policy that “all location information” for consumers must be accessible only to authorised personnel.
Similarly, Info-types such as street address, zip code, phone number, and IP address are grouped together under personally identifiable information (PII). These elements are easily identifiable automatically, and policies are defined for all PII data. PII falls under the ‘restricted data’ category suggesting the existence of further policies defined for all data grouped under this heading.
The variety and kind of data classes will change with the business vertical and interest. Data classes are usually maintained by a central body within the organization because policies on “types of data classes” usually affect compliance to regulation.
Data Classes and Policies
Once data the organization handles is defined in an enterprise dictionary, policies that govern the data classes can be assigned to data across multiple containers. Confidentiality classification is an important Metadata characteristic, guiding how users are granted access privileges. The desired relationship is between a policy for example access control, retention, and a data class rather than a policy and an individual container.
Data classes and policies specify the types of data utilized within an organization, aiding in understanding the nature of data processing and outlining permissible actions. They help answer fundamental questions regarding the types of data being processed, how they are processed, and establish guidelines on permissible and prohibited actions concerning data usage.
Metadata Management
The kind of information that can be classified as Metadata is wide-ranging. Metadata includes information about technical and business processes, data rules and constraints, and logical and physical data structures. When talking about data, data classification, and data classes, we need to discuss the metadata, or the “data about data”-specifically, where it’s stored and what governance controls there are on it. The organisations also needs to be very careful when it comes to implementing Automation in managing metadata.
Metadata helps an organization understand its data, its systems, and its workflows. It describes the data itself (databases, data elements, data models), the concepts the data represents (business processes, application systems, software code, technology infrastructure) and the connections (relationships) between the data and concepts. For example, consider searching in a metadata catalog for a specific table containing customer names. While you may not have access to the table itself, knowing such a table exists is valuable (you can then request access, you can attempt to review the schema and figure out if this table is relevant, and you can avoid creating another iteration of this information if it already exists).
Data Catalog
Crucial to metadata management is a data catalog, a tool to manage this metadata. While enterprise data warehouses are capable at processing structured data within their own systems, they may not be equipped to manage metadata across various storage systems or formats. Hence, having a tool that is capable of operating across multiple storage systems, allowing it to manage metadata from various sources or formats.
This includes where the data is and what technical information is associated with it (table schema, table name, column name, column description) – but you also should allow for the attachment of additional “business” metadata, such as who in the organization owns the data, whether the data is locally generated or externally purchased, whether it relates to production use cases or testing, and so on. It is important to integrate data governance information into a data catalog as the data governance strategy expands. SCIKIQ Data hub uses generative AI+Machine learning to understand data and plug the missing gaps when it comes to managing the data catalog. It does it so beautifully that it feels like magic and it is all under Automated Data Governance tool called SCIKIQ control.
This integration allows for better management and organization of data by attaching specific details such as data class, data quality, sensitivity, etc. Schematizing this information enables efficient searching and filtering.
Data Profiling
A crucial step in insight generation workflows is to review that data for outliers. Outliers, which are data points that significantly differ from the rest of the dataset, can arise due to various reasons such as errors in data entry, unique occurrences, or emerging patterns. The normalization process, which involves either keeping or removing outliers, should be conducted within the context of the specific business purpose for which the data is being used. For instance, if the goal is to identify unusual patterns, outliers may be retained for analysis. On the contrary, if the aim is to understand the general trends of the data, outliers might be removed to ensure a more representative dataset for deriving insights.
The reason for normalizing data is to ensure data quality and consistency and this must be done in the context of the business purpose of the data use case. Using Data Governance Automation with help of Generative AI or AI can help a lot here when it comes to establishing context. Data quality cannot be performed without a directing business case, because “reviewing transactions” is not the same for a marketing team (which is looking for big customers) and a fraud analysis team (which is looking for provable indications of fraud). These cleanup processes work to ensure that using the data in applications, such as generating a machine learning model, does not result in it being skewed by data outliers.
Data Quality
Data quality is crucial for identifying suitable use cases for a data source and ensuring its reliability for further analysis or integration with other datasets. Understanding the origin of data sources is key to assessing their quality. Knowing where data comes from allows for better decision-making regarding data integration and analysis. Implementing Data governance automation in this key area is a must. SCIKIQ utilises generative ai to ensure Data quality.
Various processes for managing data quality, including validation controls, monitoring, incident triage, root cause analysis, and recommending remedies for data issues, are essential for maintaining high-quality data throughout its lifecycle. Implementing a sense of ownership within business units responsible for generating data is a key to improve data quality. By ensuring that data owners are accountable for the quality of their data, organizations can establish a culture of data stewardship and accountability.
Lineage Tracking
Data does not live in a vacuum but rather generated from specific sources, undergoes transformations, and is combined with other data to support insights. The concept of data lineage refers to tracking the origin and transformation of data, which is crucial for understanding its context and ensuring its reliability for decision-making.
There are two key reasons why Data lineage tracking is important, one reason is understanding the quality of resulting dashboards. If a dashboard is created using high-quality data but later combined with lower-quality data, it can lead to a misinterpretation of the dashboard’s insights. Another important reason is ensuring the security of sensitive data. By tracking the movement of sensitive data across an organization’s data landscape, lineage tracking helps prevent the accidental exposure of sensitive information into unauthorized containers.
Lineage tracking is also important when thinking about explaining decisions. By identifying input information into a decision-making algorithm, you can rationalize later why some business decisions (loan approval) were made in a certain way in the past and will be made in a certain way in the future. By making business decisions explainable (past transactions explaining a current decision) and keeping this information transparent to the data users, you practice good data governance.
Data Retention and Data Deletion
One of the crucial elements of effective data governance is the ability to control how long data is kept. The challenges associated with retaining personally identifiable information (PII), such as privacy concerns, consent, and transparency. It suggests that setting shorter retention periods for PII, especially in specific contexts like location data during a commute, can simplify compliance and mitigate risks associated with data storage.
When talking about data retention and data deletion, were often thinking about them in the context of how to treat sensitive data- that is, whether to retain it, encrypt it, delete it, or handle it some other way. Proper governance policies not only ensure compliance with regulations but also protect against potential data breaches and lost work. By establishing clear guidelines for data retention and deletion, organizations can mitigate risks and safeguard their data assets effectively.
Authentication and Access Management
The workflow management for data acquisition involves several steps facilitated by a robust data governance plan. It begins with an analyst seeking data, accessing the organization’s data catalog, and identifying relevant data sources. Access to the data is then requested and granted through governance controls, ensuring safety and compliance.
Identity and Access Management is crucial in controlling data access. Authentication verifies user identity, typically through a combination of passwords, second-factor authentication, and sometimes biometrics. Additional contextual factors, such as device or time of access, enhance security. User authorization follows authentication, determining access rights based on predefined policies. These policies dictate actions such as reading data, editing metadata, updating content, or performing ETL operations.
Security
Data must be governed at the right levels with the right approaches to provide defense in depth. Organizations across various industries are recommended to adopt a multi-faceted approach and use Data governance automation. This includes diligently tracking data lineage and quality, specifying protection levels, and classifying data based on sensitivity.
When utilizing cloud services, ensuring data isolation and robust security for virtual machines is essential. Physical security should be verified, both in cloud provider infrastructure and on-premises setups. Securing data in transit through encryption and implementing role-based access controls are crucial aspects of identity and access management. Differential privacy can enhance data sharing for research while preserving privacy. Additionally, organizations should establish effective audit logs, employ data loss prevention techniques, and maintain access transparency through continuous monitoring using automated tools.
The SCIKIQ data platform is specifically designed to enhance Data governance automation and management by offering actionable insights. It encompasses a comprehensive set of features, including data cataloguing, metadata management, discovery capabilities, change detection mechanisms, data quality, and access control functionalities. The platform has evolved beyond merely fulfilling cost and compliance needs to become a catalyst for business growth and innovation.
Reference
Aiken, P., & Allen, D. M. (2004). XML in data management Burlington, MA: Morgan Kaufmann.
Evren Eryurek, Uri Gilad, Valliappa Lakshmanan, Anita Kibunguchy, Jessi AshdownData Governance: The Definitive Guide. People, Process and Tools to Operationalize Data Trustworthiness. March 2021.
Data Management Body of Knowledge, DAMA International Technics Publications, Basking Ridge, New Jersey.
Also Read: Automating Data Governance