Despite the relentless urging of the data quality gurus, not all data issues can be prevented or managed proactively. Data volumes are growing exponentially, the variety of data is becoming more diverse, and the ability to ensure quality is diminishing as organizations seek to take advantage of fusing data from uncontrolled sources. Although data correction is not necessarily the desired choice, many data sets may remain unusable unless data standardization and cleansing methods are applied.
This tutorial focuses on three fundamental algorithmic techniques used for data quality and cleansing. Data standardization is a process for transforming data values into their recognized standard forms. Identity resolution employs both deterministic and probabilistic methods for determining that two records refer to the same entity. Record linkage uses standardization and identity resolution to link sets of records together so that the “desired values” can be selected for updating, cleansing, or correction.
Attendees will learn about:
- Using data standards
- How standardization works
- Deterministic identity resolution
- Probabilistic identity resolution
- Record linkage
- Data quality and master data management