Vague feelings of poor data quality can hamper business activity by casting doubt on the data that feeds both operational and analytical processes. Yet the absence of clearly defined measurements to demonstrate how business impacts are attributable to erred data prevents developing an appropriate business case for introducing data quality management improvement. When resources are allocated to correcting “bad data” without being able to evaluate the root causes there is no ability to objectively evaluate the relationship between poor data quality and business performance. This suggests a need for an approach that can quickly identify high priority data issues whose remediation can be justified.

The objective of Knowledge Integrity’s rapid data quality assessment is to meet this need. Relying on our many years of experience in data quality analysis techniques, Knowledge Integrity analysts can execute this assessment as well as provide training and knowledge transfer, with a rapid schedule that can be adapted to best suit the client’s needs. The resulting report will describe and prioritize clearly identified data quality issues, recommendations for remediation, and suggestions for instituting data quality inspection and control for data quality monitoring. In addition, our analysts will provide templates and train your team members so that they can perform the rapid data quality assessment on subsequent data sets.

For additional details, and pricing quote, contact:
David Loshin
loshin@knowledge-integrity.com
301-754-6350

Within the current economic environment, greater scrutiny is being placed on how poor data quality impacts business operations. Knowledge Integrity has been helping our clients identify critical data quality issues and linking those issues to critical business impacts, enabling some degree of transparency regarding the relationship between information and business success. Knowledge Integrity continues to research this area in alignment with different industries, such as financial services, healthcare and social services, insurance, manufacturing, as well as consumer products and telecommunications. A high level discussion of our process is described in a white paper titled

The Data Quality Business Case: Projecting Return on Investment

.

David Loshin’s upcoming book on Master Data Management is being released on September 19, 2008.

The book provides insight into the business drivers for MDM, planning, data governance, architectural paradigms, and business guidance for launching an MDM program.

We are also hosting a new web site to accompany the book to keep readers up to date on MDM activities and ideas at www.mdmbook.com.:

You can order the book directly from Amazon using this link:

10th Jul, 2008

Similarity and Distance

We have recently been tinkering with an open source tool for data mining, provided by the Weka project, and this has enticed us to do a little reading up on data mining techniques and how they are used.

One area of focus is the undirected approach to grouping stuff together using clustering algorithms. The basic idea is that given a set of data instances, organize them into groups such that all members of a single group are similar to each other, and every group is dissimilar to every other group (could one say, “the most diverse group of similar things?”).

In order to do this clustering, though, the algorithm must be able to determine whether two data instances are similar to each other or not, and that means some quantification of similarity. Of course, this is based on the values within the attributes of the data objects, suggesting the need for how close any pair of values are to each other. This notion of closeness is reflected in the typical approaches to gaigung similarity using a Euclidean distance function (remember high school geometry?).

In essence, there are different classes of values sets, some being interval or ranked (meaining they have magnitude and/or can be ordered according to some criteria) or are categorical (that is, discrete values but no specific ordering, such as shoe color or printer model number). Each value set must have some distance function that indicates some contribution to the similarity analysis.

Some data mining tools require the user to provide the data classifications and enables them to provide distance functions, while others apply statistical analysis and employ some heuristics to assign a distance function. And these functions become integral to other data mining activities as well. For example, consider:

- Case-based reasoning: this compares new situtions to a model built from existing instances and outcome. New instances are matched against the model to find the closest matches, and make suggestions for actions based on outcome probabilities from the model.
- Classification: comparing new data instances against existing group profiles to assign the new instance to an existing group also needs similarity functions.
- Link analysis: seeking to connect individual instances together yet again requires the ability to match records against each other to determine closeness or similarity.

There are more examples that are in wide use within the descriptive and predictive analytics world, so understanding the value of good similarity functions will go a long way in applying data mining techniques!

24th Jun, 2008

Independent Analyst Forum

David Loshin will be attending an Independent Analyst Forum in Phoenix June 30-July 2 2008 - if you are in the area and want to arrange a meeting, contact David at loshin@knowledge-integrity.com.

David Loshin will be at the Data Warehousing Institute conference in Chicago next week, teaching a course on Data Requirements Analysis.

The purpose of this session is to provide attendees with the following experiences and capabilities:

- Understand and be able to describe the importance of business information
- Understand, be able to describe and be prepared to execute each of the activities and tasks to work with business information consumers to understand their needs
- Examine and use various tools and techniques that assist in the preparation, execution and evaluation of the results of stakeholder interviews
- Understand, explore and create each of the key data discovery artifacts that will help in articulating the business requirements for business information

The process for consolidating name data requires identity resolution – the ability to assess the degree of similarity between two records and determine whether they meet or exceed the threshold for presumption of a match. Usually, prior to the identity resolution step, though, the data analyst can enhance the matching by standardizing the data values used as the matching criteria within all the records into a common format. This standardization process aligns the data values in a way that simplifies the matching algorithms.

A straightforward example using person data involves a matching algorithm that looks for exact matches in the last name field. Searching for duplicates within the set of records would require every record to be compared to every other record. A standardization that could be applied to this data set would be sorting the records by last name, thereby enforcing an ordering such that only records that are sorted into the same “neighborhood” in the lexicographic ordering need to be compared (that is, no name that starts with an “A” would match a name that begins with an “S”). Sorting is one typical standardization applied to a data set to simplify the data consolidation process.

Let’s consider one aspect of any decision-making process: the assessment of risk versus reward, and at what point we are willing to dispense with worrying about the risk in search of a greater reward. From another standpoint, we can ask that question in a slightly different way: to what extent should one “over-engineer” a process in deference to ignoring the risk of low probability events having material impact on the result, requiring significant effort to remediate?

This seems like a relatively academic question, yet people are faced with this numerous times each day, from mostly benign situations (“Do I have enough time to use the restroom before the next meeting or will I miss something important?”) to the more impactful (“If I am driving 75 miles per hour, will I get a ticket?”) to downright critical (“Should the bank buy more subprime mortgage-backed securities or shut down a hedge fund and take a $4 billion write-off?” or “Should we make every airline passenger passing through security take off their shoes?”). In each of these cases, the question revolves around risk and reward; and, ultimately, this question is relevant depending on conceptual variables associated with risk. Four examples are risk tolerance, risk prevention, risk mitigation and risk deferral.

In February’s article, we started to look at a high level view of the use of metadata as a control mechanism for managing master data management (MDM), and suggested that there were seven levels associated with a metadata stack supporting the transition to a master data environment. This article focuses on one layer in this metadata abstraction – the metadata associated with services. We positioned “service metadata” as describing the abstract functionality embedded within and used by business applications, and the degree to which those functions can be described as stand-alone services, along with the mapping from service to client applications.

Master data management is largely seen as providing value to client applications by virtue of providing access to a high quality data asset of uniquely identifiable master objects synchronized across the enterprise. However, it turns out that master service consolidation is a strong motivating factor for MDM, even (at times) trumping the value of the consolidated data asset. The process of analyzing the use of master data objects exposes the ways in which different applications create, access, modify, and retire similar objects; and this analysis helps in determining which data sets represent recognized master object types. The byproduct of this analysis is not just knowledge of the master objects, but also knowledge about the functionality applied to those objects.

At a purely technical level, there is a significant need for coordination required for overseeing and guiding the information management aspects of an enterprise initiative such as master data management (MDM). The political and organizational aspects of this coordination are addressed as part of the governance program that must accompany an MDM program. However, all aspects of the determination of need, planning, migration strategy, and future state require a clarified view of the information about the data that is used within the organization – its metadata.

It is easy for us to fall into the trap of referring to metadata by its industry accepted definition: “data about the data.” This relatively benign description does not provide the depth of understanding that adds value to the MDM deployment. Instead, the metadata associated with an enterprise master data set does more than just describe the size and types of each data element. It is the historically distributed application and data silos that are impacted by the variance in meaning and structure that necessitated MDM in the first place. Therefore, in order to develop a model, framework, and architecture that provide a unified view across these applications, there must be a control mechanism, or perhaps even a “clearinghouse” for unifying the view when possible, and determining when that unification is not possible.

In fact, the scale of metadata management needed for an enterprise migration differs from the relatively simple data dictionary-style repositories that support individual applications. Sizes and types are just the tip of the iceberg. Integration of records from different data sets can only be done when it is clear that the data elements have the same meaning, that their valid data domains are consistent and that the records represent similar or the same real-world entities. Not only that, there are more complex dependencies as well:

Do the client applications use the same entity types?

Do the different applications use different logical names for similar objects?

How is access for reading and writing data objects controlled?
There are also many other important variable aspects.

There is value in looking at a conceptual view of master metadata that starts with basic building blocks and grows to maintain comprehensive views of the information that is used to support the achievement of business objectives. The metadata stack described in this article is driven by business objectives from the top down and from the bottom up, and is intended to capture as much information as necessary to drive:

The analysis of enterprise data for the purpose of structural and semantic discovery;

The correspondence of meanings to data element types;

The determination of master data element types;

The models for master data object types;

The interaction models for applications touching master data;

The information usage scenarios for master data;

The data quality directives;

Access control and management;

The determination of core master services;

The determination of application-level master services; and

Business policy capture and correspondence to information policies.
We can look at seven levels of metadata that are critical to master data management, starting from the bottom up:

Business Definitions, which looks at the business terms used across the organization and the associated meanings;

Reference Metadata, which details data domains (both conceptual domains and corresponding value domains) as well as reference data and mappings between codes and values;

Data Element Metadata, focusing on data element definitions, structures, nomenclature, and determination of existence along a critical path of a processing stream;

Information Architecture, coagulating the representations of data elements into cohesive entity structures, how those structures reflect real-world objects, and how those objects interact within business processes;

Data Governance Management, which concentrates on the data rules governing data quality, data use, access control, and the protocols for rule observance (and processes for remediation of rule violations);

Service Metadata, which looks at the abstract functionality embedded and used by the applications, and the degree to which those functions can be described as stand-alone services, along with the mapping from service to client applications; and

Business Metadata, at the top of the stack, which captures the business policies driving application design and implementation, the corresponding information policies that drive the implementation decisions inherent in the lower levels of the stack, and the management and execution schemes for the business rules that embody both business and information policies.
Given this high-level description of a metadata stack, the challenge is to look at how these levels interact as part of an overall metadata management strategy. I keep thinking in terms of metadata as a “control panel,” since the cumulative knowledge embedded within the metadata management framework will ultimately contribute to the determination of the most appropriate methods for delivering a master data asset that is optimally suited to the organization.

Categories