David Loshin’s upcoming book on Master Data Management is being released on September 19, 2008.

The book provides insight into the business drivers for MDM, planning, data governance, architectural paradigms, and business guidance for launching an MDM program.

We are also hosting a new web site to accompany the book to keep readers up to date on MDM activities and ideas at www.mdmbook.com.:

You can order the book directly from Amazon using this link:

10th Jul, 2008

Similarity and Distance

We have recently been tinkering with an open source tool for data mining, provided by the Weka project, and this has enticed us to do a little reading up on data mining techniques and how they are used.

One area of focus is the undirected approach to grouping stuff together using clustering algorithms. The basic idea is that given a set of data instances, organize them into groups such that all members of a single group are similar to each other, and every group is dissimilar to every other group (could one say, “the most diverse group of similar things?”).

In order to do this clustering, though, the algorithm must be able to determine whether two data instances are similar to each other or not, and that means some quantification of similarity. Of course, this is based on the values within the attributes of the data objects, suggesting the need for how close any pair of values are to each other. This notion of closeness is reflected in the typical approaches to gaigung similarity using a Euclidean distance function (remember high school geometry?).

In essence, there are different classes of values sets, some being interval or ranked (meaining they have magnitude and/or can be ordered according to some criteria) or are categorical (that is, discrete values but no specific ordering, such as shoe color or printer model number). Each value set must have some distance function that indicates some contribution to the similarity analysis.

Some data mining tools require the user to provide the data classifications and enables them to provide distance functions, while others apply statistical analysis and employ some heuristics to assign a distance function. And these functions become integral to other data mining activities as well. For example, consider:

- Case-based reasoning: this compares new situtions to a model built from existing instances and outcome. New instances are matched against the model to find the closest matches, and make suggestions for actions based on outcome probabilities from the model.
- Classification: comparing new data instances against existing group profiles to assign the new instance to an existing group also needs similarity functions.
- Link analysis: seeking to connect individual instances together yet again requires the ability to match records against each other to determine closeness or similarity.

There are more examples that are in wide use within the descriptive and predictive analytics world, so understanding the value of good similarity functions will go a long way in applying data mining techniques!

24th Jun, 2008

Independent Analyst Forum

David Loshin will be attending an Independent Analyst Forum in Phoenix June 30-July 2 2008 - if you are in the area and want to arrange a meeting, contact David at loshin@knowledge-integrity.com.

David Loshin will be at the Data Warehousing Institute conference in Chicago next week, teaching a course on Data Requirements Analysis.

The purpose of this session is to provide attendees with the following experiences and capabilities:

- Understand and be able to describe the importance of business information
- Understand, be able to describe and be prepared to execute each of the activities and tasks to work with business information consumers to understand their needs
- Examine and use various tools and techniques that assist in the preparation, execution and evaluation of the results of stakeholder interviews
- Understand, explore and create each of the key data discovery artifacts that will help in articulating the business requirements for business information

The process for consolidating name data requires identity resolution – the ability to assess the degree of similarity between two records and determine whether they meet or exceed the threshold for presumption of a match. Usually, prior to the identity resolution step, though, the data analyst can enhance the matching by standardizing the data values used as the matching criteria within all the records into a common format. This standardization process aligns the data values in a way that simplifies the matching algorithms.

A straightforward example using person data involves a matching algorithm that looks for exact matches in the last name field. Searching for duplicates within the set of records would require every record to be compared to every other record. A standardization that could be applied to this data set would be sorting the records by last name, thereby enforcing an ordering such that only records that are sorted into the same “neighborhood” in the lexicographic ordering need to be compared (that is, no name that starts with an “A” would match a name that begins with an “S”). Sorting is one typical standardization applied to a data set to simplify the data consolidation process.

Let’s consider one aspect of any decision-making process: the assessment of risk versus reward, and at what point we are willing to dispense with worrying about the risk in search of a greater reward. From another standpoint, we can ask that question in a slightly different way: to what extent should one “over-engineer” a process in deference to ignoring the risk of low probability events having material impact on the result, requiring significant effort to remediate?

This seems like a relatively academic question, yet people are faced with this numerous times each day, from mostly benign situations (“Do I have enough time to use the restroom before the next meeting or will I miss something important?”) to the more impactful (“If I am driving 75 miles per hour, will I get a ticket?”) to downright critical (“Should the bank buy more subprime mortgage-backed securities or shut down a hedge fund and take a $4 billion write-off?” or “Should we make every airline passenger passing through security take off their shoes?”). In each of these cases, the question revolves around risk and reward; and, ultimately, this question is relevant depending on conceptual variables associated with risk. Four examples are risk tolerance, risk prevention, risk mitigation and risk deferral.

In February’s article, we started to look at a high level view of the use of metadata as a control mechanism for managing master data management (MDM), and suggested that there were seven levels associated with a metadata stack supporting the transition to a master data environment. This article focuses on one layer in this metadata abstraction – the metadata associated with services. We positioned “service metadata” as describing the abstract functionality embedded within and used by business applications, and the degree to which those functions can be described as stand-alone services, along with the mapping from service to client applications.

Master data management is largely seen as providing value to client applications by virtue of providing access to a high quality data asset of uniquely identifiable master objects synchronized across the enterprise. However, it turns out that master service consolidation is a strong motivating factor for MDM, even (at times) trumping the value of the consolidated data asset. The process of analyzing the use of master data objects exposes the ways in which different applications create, access, modify, and retire similar objects; and this analysis helps in determining which data sets represent recognized master object types. The byproduct of this analysis is not just knowledge of the master objects, but also knowledge about the functionality applied to those objects.

At a purely technical level, there is a significant need for coordination required for overseeing and guiding the information management aspects of an enterprise initiative such as master data management (MDM). The political and organizational aspects of this coordination are addressed as part of the governance program that must accompany an MDM program. However, all aspects of the determination of need, planning, migration strategy, and future state require a clarified view of the information about the data that is used within the organization – its metadata.

It is easy for us to fall into the trap of referring to metadata by its industry accepted definition: “data about the data.” This relatively benign description does not provide the depth of understanding that adds value to the MDM deployment. Instead, the metadata associated with an enterprise master data set does more than just describe the size and types of each data element. It is the historically distributed application and data silos that are impacted by the variance in meaning and structure that necessitated MDM in the first place. Therefore, in order to develop a model, framework, and architecture that provide a unified view across these applications, there must be a control mechanism, or perhaps even a “clearinghouse” for unifying the view when possible, and determining when that unification is not possible.

In fact, the scale of metadata management needed for an enterprise migration differs from the relatively simple data dictionary-style repositories that support individual applications. Sizes and types are just the tip of the iceberg. Integration of records from different data sets can only be done when it is clear that the data elements have the same meaning, that their valid data domains are consistent and that the records represent similar or the same real-world entities. Not only that, there are more complex dependencies as well:

Do the client applications use the same entity types?

Do the different applications use different logical names for similar objects?

How is access for reading and writing data objects controlled?
There are also many other important variable aspects.

There is value in looking at a conceptual view of master metadata that starts with basic building blocks and grows to maintain comprehensive views of the information that is used to support the achievement of business objectives. The metadata stack described in this article is driven by business objectives from the top down and from the bottom up, and is intended to capture as much information as necessary to drive:

The analysis of enterprise data for the purpose of structural and semantic discovery;

The correspondence of meanings to data element types;

The determination of master data element types;

The models for master data object types;

The interaction models for applications touching master data;

The information usage scenarios for master data;

The data quality directives;

Access control and management;

The determination of core master services;

The determination of application-level master services; and

Business policy capture and correspondence to information policies.
We can look at seven levels of metadata that are critical to master data management, starting from the bottom up:

Business Definitions, which looks at the business terms used across the organization and the associated meanings;

Reference Metadata, which details data domains (both conceptual domains and corresponding value domains) as well as reference data and mappings between codes and values;

Data Element Metadata, focusing on data element definitions, structures, nomenclature, and determination of existence along a critical path of a processing stream;

Information Architecture, coagulating the representations of data elements into cohesive entity structures, how those structures reflect real-world objects, and how those objects interact within business processes;

Data Governance Management, which concentrates on the data rules governing data quality, data use, access control, and the protocols for rule observance (and processes for remediation of rule violations);

Service Metadata, which looks at the abstract functionality embedded and used by the applications, and the degree to which those functions can be described as stand-alone services, along with the mapping from service to client applications; and

Business Metadata, at the top of the stack, which captures the business policies driving application design and implementation, the corresponding information policies that drive the implementation decisions inherent in the lower levels of the stack, and the management and execution schemes for the business rules that embody both business and information policies.
Given this high-level description of a metadata stack, the challenge is to look at how these levels interact as part of an overall metadata management strategy. I keep thinking in terms of metadata as a “control panel,” since the cumulative knowledge embedded within the metadata management framework will ultimately contribute to the determination of the most appropriate methods for delivering a master data asset that is optimally suited to the organization.

Assembling a master data management program, by necessity requires the consolidation of data into a data management framework that enables the consistent view of the uniquely identifiable objects used across the application architecture. However, as there are different approaches to managing the unique representation, we may find that there are characteristics associated with maintaining the synchronization and consistency of master data that impose implementation constraints when implementing master data consolidation and integration services.

For example, in a full master repository in which the data sources are combined into a single copy (such as the transaction hub style), by fiat, for each of the applications, all the data is synchronized since there is only one copy. On the other hand, in a thin registry architecture, records maintaining bits and pieces of master data are sprinkled across a federated data environment, in which case there will be situations in which local copies of master data associated with specific application silos are inconsistent. In fact, under the hood, the actual implementation of any master data management (MDM) architecture along the spectrum between registry and full repository may distribute and replicate master copies, in which case it is also subject to inconsistency at some point during operations.

In order to determine which architectural style is appropriate, it will be necessary for the MDM architect to assess the enterprise applications’ requirements for master data synchronization. Some operational environments may be very tolerant of inconsistency, allowing for batch consolidation on a periodic (e.g., nightly) basis, while others may require a high degree of consistency requiring immediate synchronization. Assessing the application environment suggests reviewing the application requirements for synchronization based on these synchronization dimensions:

Timeliness – ensuring the timely availability of master data, or specifying the enterprise-wide expectation for newly introduced data fully integrated and available within the master environment;

Latency – modulating the time it takes to deliver requested master data, as a way of monitoring application performance;

Currency – ensuring “freshness” of master data;

Consistency – the degree to which each application’s view is not different from any other application’s view;

Coherence – maintaining synchronization of the views of master data managed within local copies;

Determinism/Idempotence – asserting that issuing the same request for data results in same answer each time.
We can assess the degree of synchronization associated with each of the master data management architectural styles. In the registry master data management architecture, a thin master index maintains identifying information along with pointers to the data sources in which records referring to that master data object reside. Newly introduced data may be registered within the index and is available at the operational level as soon as it is persisted into the owning application’s data resource; therefore, timeliness is high, as is data currency and, accordingly, latency is low. However, since each application may have distinct records for each master object, the degree of consistency is low, and the variance between local copies also means that coherence will be low. In addition, for accesses to a conformed master record that is materialized and consolidated on demand, each request may result in slightly different views, so we might rate its determinism as low.

For the full repository, there is only one master copy. Newly introduced records are committed directly to the master repository; therefore, timeliness and consistency are high, as is the assurance of consistency across applications. In the absence of local copies, there is little concern regarding coherence, and accesses to the same record should almost always return the same result, so determinism is high.

For the hybrid model, in which a registry is augmented to hold some collection of master attributes that may be copied back to application data environments, it is difficult to assess any of these dimensions without a greater understanding of the implementation. A federated model may essentially reflect a cache paradigm, in which applications make copies of what is in the master repository;and as modifications are made to the local copy, they are forwarded back to the master. Actually, the issues associated with the hybrid model’s synchronization (as well as federated implementations of the transaction hub) are interesting enough to warrant deeper exploration, which we will target in an upcoming article.

However, the real issue is not the variant synchronization attributes of the MDM architecture styles, but the determination of business application horizontal requirements for the synchronization dimensions. This doesn’t look at consistency of data within one processing stream, but the interaction of applications with respect to the set of the same master objects. Of course, this cannot be done in a vacuum and requires a generalized view of the entire business process architecture prior to making any kind of architecture implementation decision.

We have been advocating a particular approach to understanding the data requirements associated with application development that focuses on three concepts:

Identifying the information needs for business information,

Locating and assessing the suitability of candidate source data sets to satisfy the business information needs, and

Qualifying selected data source candidates and developing a data integration strategy.
Each of these phases warrants its own elaboration, but as a preface to doing any of these tasks, it is necessary to characterize how the business application under development is supporting the organization’s business processes, and this introduces a critical step of describing and documenting those business processes.

Apparently, though, many technical application environments evolved in a way that supports the operational aspect of business needs, with each application (or subsequent enhancement) developed based on functional requirements specific to business operations. These applications, once moved into production, become the focus of scrutiny any time there is a change in the business environment regarding an application update, since there is a desire to reduce or eliminate any risks introduced by any modification needed to support that change, and this has been the driver for system development life cycle management, change control boards, and other IT and system governance activities.

So at the point that one is ready to begin evaluating business information needs for new applications, one of the steps in documenting the business process involves interviewing both the IT staff and the business subject matter experts about the business flow, what tasks and activities exist, how those tasks are ordered, and the data that is shared among them. Interestingly, on a number of occasions the interview process hits a few bumps in the road, and as we talk with both IT staff members and business SMEs, we find that instead of describing the business process, they describe the business application. In other words, instead of providing insight into the way that the business works, we are provided with a description of how the business application works, with details about which programs are used, how they are invoked, which databases are used, the kinds of transactions performed, etc.

This introduces two questions. The first is pragmatic: how do we bring the conversation back to focus on the business process and not its implementation? There is a deeper underlying challenge here because the perceived unification of the business process with its application supersedes consideration of the difference between existing business information needs and future business information needs. One quick example: during an exploration to evaluate needs for building a data warehouse, one of the client’s initial drivers was to optimize the creation of the reports produced by the existing management information system. Yet deeper probing showed that there were not any significant business processes that incorporated those management reports into any relevant activity, which not only brought the migration to a data warehouse into question, but also the existing reporting systems as well.

A better approach would have been to evaluate the organization’s business objectives to seek out ways that data that was currently available could be used in reports that would drive optimization or uncover new opportunities. Therefore, reflecting the business processes and documenting how the applications support them might have better impacts when determining new application development opportunities. In essence, whenever the conversation moves toward describing the underlying technology, halt the discussion and reframe the question to concentrate on the business activities, not how they are performed.

The second question is philosophical: at what point does the business application become the business process? There is some point where, logically, the activities performed by the business application define the process, sometimes to the point where changes to the application define changes to the business process. An example is at our children’s school: they are converting the cafeteria from a pay-by-ticket process to an electronic process with fingerprint biometrics attached to an account for each student from which school lunch costs are deducted. Changing the technology necessitates a change in the business application, requiring parents to fund accounts proactively instead of the current pay-as-you-go approach.

The answer to that philosophical question may drive the way that applications are developed in general. If we look at applications that support business processes, then essentially the application components should be equivalent to business process tasks, such as those that can be encapsulated as services in a services-oriented architecture. However, that does not preclude the need to understand business process flows and ways to capture a representative picture of how the business is run, not how the application runs the business. Therefore, a desirable happy medium would be a business process modeling framework that enables business activity encapsulation within a service set. This might incorporate semantic definitions of business services (“service as an object?”) and might even be suitable to be managed from within an object metadata repository – yet another opportunity to exploit business metadata.

Categories