Master Data Management
In computing, Master Data Management (MDM) comprises a set of processes and tools that consistently defines and manages the master data (i.e. non-transactional data entities) of an organization (which may include reference data).
MDM Classification
A MDM System can be
classified into following categories:
1. Candidate
Only MDM: This type of MDM System stores the data from various sources. It
establishes the relationship between these sources and then at runtime
calculates the Golden Record information. These systems works based on a
Pull-Publish Model. The data in the MDM System is as current as the latest
update of any participating source. This model is used in case of dynamically
changing precedence of data sources for various attributes.
2. Composite
Only Model: This type of model does not store the Candidate information. It
maintains only one set of information about any attribute, whenever any update
comes, it updates the Golden Record directly. This type of model is used when
there is clear precedence of data is defined over various data sources. (eg. Every
data sources has equal precedence).
3. Hybrid
Model: This is a combination of above two approaches. In this approach data of
the candidate records are stored as well as the Golden Composite record is also
stored in the data store. This type of data model is used when there are
multiple data sources of different precedence on various attributes is
present.
Data Governance Council
A Data Governance
Council is needed for maintaining a Master Data Management System. Following
could be the activities of Data Governance Council:
1. Identification
of Entities and Attributes that qualified to be stored as master data. The DGC
will identity each attribute and define it completely and a non-ambiguous way.
The DGC should ensure that there are no multiple definitions of an attribute
across the enterprise.
2. Each
and every Attribute that makes an entry into MDM system needs to be defined and
approved by DGC.
3. DGC
should also define the relations between various entities and attributes. DGC
should appoint a team of Data Architects to build a canonical model which
defines Entity/Attribute definitions and their relationships. This Logical Data
Model can be served as the base for building the MDM solution across the
enterprise.
4. The
DGC should ensure that the data in the MDM System is of Good Quality. The
Matching Thresholds should be approved by the DGC.
Elements of MDM
1.
Entity / Attribute Identification: The most
important part of building an MDM System is identification of the Entities and
Attributes. Entities and attributes are identified by a Data Governance Council. The DGC is
responsible for identification of each attribute and providing its unique
definition across the enterprise. Entity is defined by identifying the natural Primary
Key for the record.
2.
Attribute Classification: Attributes can be
classified on Matching Vs Non Matching. It can also be classified based on
Standardized Vs Non Standardized Attributes.
3.
Data source Identification: Various data sources
and their corresponding attributes, which will contribute to the MDM System
needs to be identified(eg DNB, EXP, USPS, INF etc...).
4.
Priority Assignments
Each data source is assigned a priority.
This priority could be country specific as well(eg. USPS has High priority over
DNB in US whereas DNB has high priority in Canada).
a.
Priority Rules: Various priority rules needs to
be defined and should be stored in the repository. These rules should be
customizable and could be changed as per the tuning requirements.
5.
Matching / Searching
Matching and searching is the core
component of any MDM system. A matching / searching technology should be chosen
carefully, to fulfill the MDM requirements. A system can use a combination of
matching techniques (eg. For US it can use FL based matching, whereas for Japan
it can use FAST and at the same time it can use Initiate for Japan based
searches). Hence using multiple searching/matching technique can be referred as
Matching Farms. Following are the Components of any matching farm. For an ideal
MDM system, these values need to be configurable.
a.
Rules: Matching rules
b.
Criteria: Matching Criteria
c.
Threshold: Threshold values for matching
rules
d.
Attributes: Attributes participating in the
matching
6.
Key Identification Attributes / PK: This is an
important question to be answered by DGC. When a new record is created Vs when
a record is modified. Based on the attribute type, this decision could be
taken. If its identifying attribute is changed, then a new record is created.
7.
Managing History
a.
Data Watermarking: Data Watermarking a technique
through which timestamps are maintained in the data store and a valid data
record can be constructed at any point in time in future. (More details to
follow)
b.
Data Snapshot: A snapshot of the entire record
is taken and stored as a version in the data store periodically. This can give
the complete picture of the record at any point in time in future.
c.
Building a record out of data: The timestamps
are managed at the attribute level and a record is reconstructed based on the
available timestamps.
8.
Searchable history: History of the attributes
can be optionally made searchable. There should be an option in the MDM System
to search against the specified history of any record.
9.
Audit and archive
Auditing and Archiving Data in an MDM
System can be a challenge specifically in case of growing data. There should be
customizable options to enable / disable auditing at various levels for each
interaction (An Interaction can be defined as a data access pattern).
a.
Manage Data Access Patterns: First identify the
data access patterns and classify them into traceable categories (interaction
- Granular form of auditing data)
b.
Manage Data Create/Modification patterns: This
can be handled at two levels, table level audit (A row is entered in audit for
each and every modification in any column in the table) attribute level (a row
is entered in the audit system when the specified attribute is modified).
c.
Manage Data Loading Patterns
i. Bulk
Load: Batch level matrices can be calculated / measured in case of bulk
loading.
ii. Sequential
Load: It can be maintained as one of the pattern mentioned above.
10.
Identification of Consumers (data)
It is equally important to know how the
data will be used by the consumers.
a.
How Data will be used by consumers: Classify how
consumers use the data in bulk / sequential pattern.
b.
How data needs to be exposed to Consumers: Based
on the usage pattern determine how data needs to be exposed to the consumers.
11.
Notifications
a.
How Updates to the Mastered Data are
communicated to the consumer: Based on the model chosen, notification services
needs to be designed. Notifications can be classified into Immediate and
Periodic notifications. A throttling service can be built to control
notification delivery.
b.
Is Synch is needed Consumers and mastered data?
Weather a push or pull notification is needed between Consumers and MDM system.
12.
Security
a.
What data needs to be exposed: A self-service
interface is needed to define what attributes will be exposed to various
on-boarders.
b.
Data Access permissions: Permissions needs to be
restricted to expose specific data to specific zone (eg Japan data is not
visible to US users)
c.
Data Modification permissions: Data modification
permissions are set as per the data source and the roles.
d.
Data Access Roles: Various roles needs to be
created in the system based on the data usage/access patterns. These roles
could be country/Subsidiary specific.(e.g. readAll, ReadUS, ReadWriteUS,
ReadWriteAll, etc…)
13.
Extensible model
In order to have MDM Model extensible, the
data model and interfaces should be defined such that adding / removing any
attribute or data source is just a matter of configuration.
a.
Ability to Add / remove attributes as master
data: The DGC can remove/add extra attributes to be included into the Mastered
data. Based on the complexity of the system, this could be a configuration
driven or a code change.
b.
Ability to Add / remove data sources: For
identified MDM attributes, there should be ability to add / remove multiple
data sources. Also number of attributes and mappings should be configurable.
14.
Metadata
a.
Complete metadata information about each and
every attribute / entity: Extensive metadata information about entity,
attribute, relationship, configuration handles is required for building a MDM
System.
15.
Relationships
Different types of relationships can exist between
candidate and candidate, Candidate and composite. The system should support
multiple relationship support. Named hierarchies can be used to resolve this
scenario. The relationship data needs to be partitioned very carefully.
a.
Ability to define / maintain relationships
between attributes
b.
Ability to define /maintain Hierarchy between
attributes
c.
Ability to support multiple relationships /
hierarchies
d.
Ability to support relationship between various
data sources
16.
Data Façade(Interfaces)
While defining the interfaces we need to
consider the extensibility, scalability, backwards compatibility and
inter-portability. WCF services can be considered for singular data transfer
and some bulk interfaces can be exposed for bulk data transfer.
a.
Define interfaces through which data will be
exposed.
b.
Define various methods through which data is
exposed.
c.
Define Data Input Channels
d.
Define Data output channels.
17.
Manage Slowly Changing dimensions(SCD)
a.
Type 1, Type 2, Type3.
Architecture
1.
Master Data Store
2.
Data Sources
3.
Loading Engine
4.
Matching Farm
5.
Processing Queues
6.
Rules Processing Engine
7.
Interfaces
8.
Hub spoke Model
a.
Pull all data to a central hub and then flow out
the Mastered / Golden Record / Composite information.
9.
Taxonomy Management System
Implementation
1.
SOA Implementation
a.
Loosely coupled
b.
Extensible / Scalable
c.
Data flow through Queues
d.
Staged Approach
e.
Real Time Status
2.
Handling Updates
a.
Approach 1: Contributor Updates
i. Store
the contributing elements.
ii. Upgrade
the Contributing elements.
iii. Propagate
the changes to the Golden/Mastered Record.
b.
Approach 2: Update the Mastered record.
i. Update
the Golden / Mastered Record directly
ii. Maintain
the version of the Golden record as per defined SCD rules.
iii. Define
rules for maintaining the versions per attribute and associate the SCD type
with it.
3.
Data Model
a.
Generic Data model: for Scalability
b.
Specific data model: for performance
4.
Stages in SOA
a.
Get data from the Source System
b.
Validate Data / Basic Validations
c.
Load Data into staging
d.
Business validations
e.
Match / Search against Golden / Mastered /
Composite Record
f.
Prepare association between candidate and
composite record.
g.
Based on business rules, prepare composite data.
h.
Maintain versions if needed in case of SCD
treatment.
i.
Store data in the Composite Store
j.
Expose data to search/match farm
k.
Prepare notification data
l.
Send notifications
5.
Master Data Identification
a.
Dimensions can be considered as mastered data.
E.g. Customers, Products, Organization
b.
Common Data that is used across different
applications in the enterprise can be considered for Mastering.
6.
Defining Identifying attributes
7.
Handling Mergers and Acquisitions
a.
E.g. Word and Excel merged into Office, Person
changed his name, Oracle Acquired PeopleSoft etc.
Managing Data Quality
1.
Data Standardization: Input data needs to be
standardized and stored into the repository. Standardization ensures that the
two different versions of the same entity are stored in a common standard way.
For example St / Street ; WA / Washington
State / WS ; etc.. There are various tools available for data standardization.
Bing Address services returns the standardized addresses, Trillium uses
standardization engine etc..
2.
Data Quality Cheeks: Certain Data Quality Checks
needs to be in place to make sure that the entered data is of good quality. A
list of Bad/Noise words can be maintained per region. Any data entry with such
words should be either discarded or needs to be re-validated. Usage of Taxonomy
values for fields like Language, currency, ISO Country codes, etc… can be
maintained in the system. Data needs to checked for the dummy data like abc@xyz.com ; aaaaaaa, etc..
3.
Trusted Data Priority Rules: Data can be trusted
based on the authenticity of the input sources. For example demographic data
from the DNB can be trusted for US whereas address data can be trusted from
USPS. A matrix needs to be prepared about which source has precedence over
other for specific attributes.
4.
Weight Control System: The data can be
classified into categories based on its matching threshold. These matching
threshold values needs to be configurable and can be tweaked till the matching
algorithm is trained to match for the desired results.
5.
Double Byte Characters: Special testing is
needed for testing the algorithms for the DBC (eg Japanese, Chinese, Korean
characters). In certain scenarios it is observed that different algorithms are
needed to standardize and match DBC.
I will be writing more on this in my next blog.
-TheGr8DB
I will be writing more on this in my next blog.
-TheGr8DB