Master Data Management

Wikipedia Defination.

In computing, Master Data Management (MDM) comprises a set of processes and tools that consistently defines and manages the master data (i.e. non-transactional data entities) of an organization (which may include reference data).

MDM Classification

A MDM System can be classified into following categories:

1. Candidate Only MDM: This type of MDM System stores the data from various sources. It establishes the relationship between these sources and then at runtime calculates the Golden Record information. These systems works based on a Pull-Publish Model. The data in the MDM System is as current as the latest update of any participating source. This model is used in case of dynamically changing precedence of data sources for various attributes.

2. Composite Only Model: This type of model does not store the Candidate information. It maintains only one set of information about any attribute, whenever any update comes, it updates the Golden Record directly. This type of model is used when there is clear precedence of data is defined over various data sources. (eg. Every data sources has equal precedence).

3. Hybrid Model: This is a combination of above two approaches. In this approach data of the candidate records are stored as well as the Golden Composite record is also stored in the data store. This type of data model is used when there are multiple data sources of different precedence on various attributes is present.

Data Governance Council

A Data Governance Council is needed for maintaining a Master Data Management System. Following could be the activities of Data Governance Council:

1. Identification of Entities and Attributes that qualified to be stored as master data. The DGC will identity each attribute and define it completely and a non-ambiguous way. The DGC should ensure that there are no multiple definitions of an attribute across the enterprise.

2. Each and every Attribute that makes an entry into MDM system needs to be defined and approved by DGC.

3. DGC should also define the relations between various entities and attributes. DGC should appoint a team of Data Architects to build a canonical model which defines Entity/Attribute definitions and their relationships. This Logical Data Model can be served as the base for building the MDM solution across the enterprise.

4. The DGC should ensure that the data in the MDM System is of Good Quality. The Matching Thresholds should be approved by the DGC.

Elements of MDM

1. Entity / Attribute Identification: The most important part of building an MDM System is identification of the Entities and Attributes. Entities and attributes are identified by a Data Governance Council. The DGC is responsible for identification of each attribute and providing its unique definition across the enterprise. Entity is defined by identifying the natural Primary Key for the record.

2. Attribute Classification: Attributes can be classified on Matching Vs Non Matching. It can also be classified based on Standardized Vs Non Standardized Attributes.

3. Data source Identification: Various data sources and their corresponding attributes, which will contribute to the MDM System needs to be identified(eg DNB, EXP, USPS, INF etc...).

4. Priority Assignments

Each data source is assigned a priority. This priority could be country specific as well(eg. USPS has High priority over DNB in US whereas DNB has high priority in Canada).

a. Priority Rules: Various priority rules needs to be defined and should be stored in the repository. These rules should be customizable and could be changed as per the tuning requirements.

5. Matching / Searching

Matching and searching is the core component of any MDM system. A matching / searching technology should be chosen carefully, to fulfill the MDM requirements. A system can use a combination of matching techniques (eg. For US it can use FL based matching, whereas for Japan it can use FAST and at the same time it can use Initiate for Japan based searches). Hence using multiple searching/matching technique can be referred as Matching Farms. Following are the Components of any matching farm. For an ideal MDM system, these values need to be configurable.

a. Rules: Matching rules

b. Criteria: Matching Criteria

c. Threshold: Threshold values for matching rules

d. Attributes: Attributes participating in the matching

6. Key Identification Attributes / PK: This is an important question to be answered by DGC. When a new record is created Vs when a record is modified. Based on the attribute type, this decision could be taken. If its identifying attribute is changed, then a new record is created.

7. Managing History

a. Data Watermarking: Data Watermarking a technique through which timestamps are maintained in the data store and a valid data record can be constructed at any point in time in future. (More details to follow)

b. Data Snapshot: A snapshot of the entire record is taken and stored as a version in the data store periodically. This can give the complete picture of the record at any point in time in future.

c. Building a record out of data: The timestamps are managed at the attribute level and a record is reconstructed based on the available timestamps.

8. Searchable history: History of the attributes can be optionally made searchable. There should be an option in the MDM System to search against the specified history of any record.

9. Audit and archive

Auditing and Archiving Data in an MDM System can be a challenge specifically in case of growing data. There should be customizable options to enable / disable auditing at various levels for each interaction (An Interaction can be defined as a data access pattern).

a. Manage Data Access Patterns: First identify the data access patterns and classify them into traceable categories (interaction - Granular form of auditing data)

b. Manage Data Create/Modification patterns: This can be handled at two levels, table level audit (A row is entered in audit for each and every modification in any column in the table) attribute level (a row is entered in the audit system when the specified attribute is modified).

c. Manage Data Loading Patterns

i. Bulk Load: Batch level matrices can be calculated / measured in case of bulk loading.

ii. Sequential Load: It can be maintained as one of the pattern mentioned above.

10. Identification of Consumers (data)

It is equally important to know how the data will be used by the consumers.

a. How Data will be used by consumers: Classify how consumers use the data in bulk / sequential pattern.

b. How data needs to be exposed to Consumers: Based on the usage pattern determine how data needs to be exposed to the consumers.

11. Notifications

a. How Updates to the Mastered Data are communicated to the consumer: Based on the model chosen, notification services needs to be designed. Notifications can be classified into Immediate and Periodic notifications. A throttling service can be built to control notification delivery.

b. Is Synch is needed Consumers and mastered data? Weather a push or pull notification is needed between Consumers and MDM system.

12. Security

a. What data needs to be exposed: A self-service interface is needed to define what attributes will be exposed to various on-boarders.

b. Data Access permissions: Permissions needs to be restricted to expose specific data to specific zone (eg Japan data is not visible to US users)

c. Data Modification permissions: Data modification permissions are set as per the data source and the roles.

d. Data Access Roles: Various roles needs to be created in the system based on the data usage/access patterns. These roles could be country/Subsidiary specific.(e.g. readAll, ReadUS, ReadWriteUS, ReadWriteAll, etc…)

13. Extensible model

In order to have MDM Model extensible, the data model and interfaces should be defined such that adding / removing any attribute or data source is just a matter of configuration.

a. Ability to Add / remove attributes as master data: The DGC can remove/add extra attributes to be included into the Mastered data. Based on the complexity of the system, this could be a configuration driven or a code change.

b. Ability to Add / remove data sources: For identified MDM attributes, there should be ability to add / remove multiple data sources. Also number of attributes and mappings should be configurable.

14. Metadata

a. Complete metadata information about each and every attribute / entity: Extensive metadata information about entity, attribute, relationship, configuration handles is required for building a MDM System.

15. Relationships

Different types of relationships can exist between candidate and candidate, Candidate and composite. The system should support multiple relationship support. Named hierarchies can be used to resolve this scenario. The relationship data needs to be partitioned very carefully.

a. Ability to define / maintain relationships between attributes

b. Ability to define /maintain Hierarchy between attributes

c. Ability to support multiple relationships / hierarchies

d. Ability to support relationship between various data sources

16. Data Façade(Interfaces)

While defining the interfaces we need to consider the extensibility, scalability, backwards compatibility and inter-portability. WCF services can be considered for singular data transfer and some bulk interfaces can be exposed for bulk data transfer.

a. Define interfaces through which data will be exposed.

b. Define various methods through which data is exposed.

c. Define Data Input Channels

d. Define Data output channels.

17. Manage Slowly Changing dimensions(SCD)

a. Type 1, Type 2, Type3.

Architecture

1. Master Data Store

2. Data Sources

3. Loading Engine

4. Matching Farm

5. Processing Queues

6. Rules Processing Engine

7. Interfaces

8. Hub spoke Model

a. Pull all data to a central hub and then flow out the Mastered / Golden Record / Composite information.

9. Taxonomy Management System

Implementation

1. SOA Implementation

a. Loosely coupled

b. Extensible / Scalable

c. Data flow through Queues

d. Staged Approach

e. Real Time Status

2. Handling Updates

a. Approach 1: Contributor Updates

i. Store the contributing elements.

ii. Upgrade the Contributing elements.

iii. Propagate the changes to the Golden/Mastered Record.

b. Approach 2: Update the Mastered record.

i. Update the Golden / Mastered Record directly

ii. Maintain the version of the Golden record as per defined SCD rules.

iii. Define rules for maintaining the versions per attribute and associate the SCD type with it.

3. Data Model

a. Generic Data model: for Scalability

b. Specific data model: for performance

4. Stages in SOA

a. Get data from the Source System

b. Validate Data / Basic Validations

c. Load Data into staging

d. Business validations

e. Match / Search against Golden / Mastered / Composite Record

f. Prepare association between candidate and composite record.

g. Based on business rules, prepare composite data.

h. Maintain versions if needed in case of SCD treatment.

i. Store data in the Composite Store

j. Expose data to search/match farm

k. Prepare notification data

l. Send notifications

5. Master Data Identification

a. Dimensions can be considered as mastered data. E.g. Customers, Products, Organization

b. Common Data that is used across different applications in the enterprise can be considered for Mastering.

6. Defining Identifying attributes

7. Handling Mergers and Acquisitions

a. E.g. Word and Excel merged into Office, Person changed his name, Oracle Acquired PeopleSoft etc.

Managing Data Quality

1. Data Standardization: Input data needs to be standardized and stored into the repository. Standardization ensures that the two different versions of the same entity are stored in a common standard way. For example St / Street ; WA / Washington State / WS ; etc.. There are various tools available for data standardization. Bing Address services returns the standardized addresses, Trillium uses standardization engine etc..

2. Data Quality Cheeks: Certain Data Quality Checks needs to be in place to make sure that the entered data is of good quality. A list of Bad/Noise words can be maintained per region. Any data entry with such words should be either discarded or needs to be re-validated. Usage of Taxonomy values for fields like Language, currency, ISO Country codes, etc… can be maintained in the system. Data needs to checked for the dummy data like abc@xyz.com ; aaaaaaa, etc..

3. Trusted Data Priority Rules: Data can be trusted based on the authenticity of the input sources. For example demographic data from the DNB can be trusted for US whereas address data can be trusted from USPS. A matrix needs to be prepared about which source has precedence over other for specific attributes.

4. Weight Control System: The data can be classified into categories based on its matching threshold. These matching threshold values needs to be configurable and can be tweaked till the matching algorithm is trained to match for the desired results.

5. Double Byte Characters: Special testing is needed for testing the algorithms for the DBC (eg Japanese, Chinese, Korean characters). In certain scenarios it is observed that different algorithms are needed to standardize and match DBC.

I will be writing more on this in my next blog.
-TheGr8DB

TheGr8DB

Friday, 6 July 2012

Guide to Master Data Management