Thursday, 2 June 2011

METADATA IN THE DATA WAREHOUSE


METADATA IN THE DATA WAREHOUSE

The first image most people have of the data warehouse is a large collection of
historical, integrated data. While that image is correct in many regards, there is another
very important element of the data warehouse that is vital - metadata.
Metadata is data about data. Metadata has been around as long as there have been
programs and data that the programs operate on. Figure 1 shows metadata in a simple
form.







While metadata is not new, the role of metadata and its importance in the face of the
data warehouse certainly is new. For years the information technology professional has
worked in the same environment as metadata, but in many ways has paid little attention
to metadata. The information professional has spent a life dedicated to process and
functional analysis, user requirements, maintenance, architectures, and the like. The role
of metadata has been passive at best in this milieu.
But metadata plays a very different role in data warehouse. Relegating metadata to a
backwater, passive role in the data warehouse environment is to defeat the purpose of
data warehouse. Metadata plays a very active and important part in the data warehouse
environment.
The reason why metadata plays such an important and active role in the data warehouse
environment is apparent when contrasting the operational environment to the data
warehouse environment insofar as the user community is concerned.
The information technology professional is the primary community involved in the usage
of operational development and maintenance facilities. It is expected that the
information technology community is computer literate, and able to find his/her way
around systems. The community served by the data warehouse is a very different
community. The data warehouse serves the DSS analysis community. It is anticipated
that the DSS analysis community is not computer literate. Instead the expectation is
that the DSS analysis community is a businessperson community first, and a technology
community second.
Simply from the standpoint of who needs help the most in terms of finding one's way
around data and systems, it is assumed the DSS analysis community requires a much
more formal and intensive level of support than the information technology community.
For this reason alone, the formal establishment of and ongoing support of metadata
becomes important in the data warehouse environment.
But there is a secondary, yet important, reason why metadata plays an important role in
the data warehouse environment. In the data warehouse environment, the first thing the
DSS analyst needs to know in order to do his/her job is what data is available and where
it is in the data warehouse. In other words, when the DSS analyst receives an
assignment, the first thing the DSS analyst needs to know is what data there is that
might be useful in fulfilling the assignment. To this end the metadata for the warehouse
is vital to the preparatory work done by the DSS analyst.
Contrast the importance of the metadata to the DSS analyst to the importance of
metadata to the information technology professional. The information technology
professional has been doing his/her job for many years while treating metadata
passively.
The community served and the importance to that community are the primary reasons
why metadata is so important in the data warehouse environment. But there are other
powerful reasons as well.


MAPPING
A basic part of the data warehouse environment is that of mapping from the operational
environment into the data warehouse. The mapping includes a wide variety of facets,
including, but not limited to:
• mapping from one attribute to another,
• conversions,
• changes in naming conventions,
• changes in physical characteristics of data,
• filtering of data, etc.
It may not be obvious why mapping information is so important in the data warehouse
environment. Consider the vice president of marketing who has just asked for a new
report. The DSS analyst turns to the data warehouse for the data for the report. Upon
inspection, the vice president proclaims the report to be fiction. The credibility of the
DSS analyst goes down until the DSS analyst can prove the data in the report to be
valid. The DSS analyst first looks to the validity of the data in the warehouse. If the data
warehouse data has not been reported properly, then the reports are adjusted.
However, if the reports have been made properly from the data warehouse, the DSS
analyst is in the position of having to go back to the operational source to salvage
credibility. At this point, if the mapping data has been carefully stored, then the DSS
analyst can quickly and gracefully go to the operational source. However, if the mapping
has not been stored or has not been stored properly, then the DSS analyst has a difficult
time defending his/her conclusions to management.
The metadata store for the data warehouse then is a natural place for the storing of
mapping information.

MANAGING DATA OVER TIME
Another extremely important function of metadata in the data warehouse environment is
that of management of data over time. The time horizon of data in the data warehouse
is MUCH longer than that normally found in the operational environment. In the data
warehouse a time horizon of five to ten years is absolutely normal. A time horizon of a
few weeks to ninety days is normal for many operational environments. Nowhere does
the difference in managing data over different time horizons show up more poignantly
than in the metadata sitting over those environments.
Given that data in the data warehouse environment is taken over a broad spectrum of
time, it is absolutely normal that the structure of data - the keys, the attributes, the
layouts, etc. - should change. Said another way, over a long period of time, the same
data structure will have multiple forms. The record of those structures and those
changes becomes the purview of the metadata for the data warehouse.
The storage of multiple data structures for the data warehouse is contrasted with the
storage of a single data structure as found in the operational environment. One of the
fundamental concepts of data management in the operational environment is that there
is one and only one correct definition of data. This assumption is 180 degrees the
opposite of the data found in the data warehouse. Managing data over a long spectrum
of time then is another reason why metadata in the data warehouse is so important.

VERSIONING OF DATA
Because data must be managed over a long spectrum of time in the data warehouse
environment (and correspondingly, the associated metadata must be likewise managed),
metadata must be "versioned". Versioned data is data that allows changes to be
continuously tracked over a long period of time.One of the characteristics of versioning is that the data trail be continuous and nonoverlapping.
In other words it is important that for any moment in the past there be one
and only one value or status of metadata.
Versioning can be accomplished by placing the effective from data and the effective to
date on the metadata component, or by placing the effective from date only on the
metadata component. When only the effective from date is placed on the metadata
component, the effective to date must be derived by looking at the next occurrence of
the metadata component.
The reason why versioning is so important is that when the DSS analyst wants to
interrogate a calculation or report made in the past that the versioning allows the DSS
analyst to understand what data was and where it came from as it entered the
warehouse. Without versioning, the only meaningful data a data warehouse environment
has is data written for and managed under the most current definition and structure of
data.

METADATA COMPONENTS - BASIC COMPONENTS
The basic components of the data warehouse metadata store include the tables that are
contained in the warehouse, the keys of those tables, and the attributes.
The contents of the metadata for the data warehouse do not differ significantly from the
metadata contents for the operational environment for these basic components, other
than the data warehouse components are versioned, and the same components for the
operational environment are not normally versioned.


METADATA COMPONENTS - MAPPING
As previously discussed, one of the most important components of the data warehouse
metadata store is that of mapping.
The typical contents of mapping metadata that are stored in the data warehouse
metadata store are:
• identification of source field(s),
• simple attribute to attribute mapping,
• attributes conversions,
• physical characteristic conversions,
• encoding/reference table conversions,
• naming changes,
• key changes,
• defaults,
• logic to choose from multiple sources,
• algorithmic changes, and so forth.
Like all other data warehouse metadata, these components should be versioned.

METADATA COMPONENTS - EXTRACT HISTORY
The actual history of extracts and transformations of data coming from the operational
environment and heading for the data warehouse environment is another component
that belongs in the data warehouse metadata store. Figure 8 shows this part of the data
warehouse metadata store.
The extract history simply tells the DSS analyst when data entered the data warehouse. The DSS
analyst has many uses for this type of information. One occasion is when the DSS analyst wants to
know when the last time data in the warehouse was refreshed. Another occasion is when the DSS
analyst wants to do what if processing and the assertions of analysis have changed. The DSS
analyst needs to know whether the results obtained for one analysis are different from results
obtained by an earlier analysis because of a change in the assertions or a change in the data. There
are MANY cases where the DSS analyst needs to use the precise history of when insertions have
been done to the data warehouse.

METADATA COMPONENTS - MISCELLANEOUS
One important component of the metadata environment is that of alias information.Alias information is attribute and key information that allows for alternative names.
Alternative names often make a data warehouse environment much more "user
friendly". In some cases technicians have influenced naming conventions that cause data
warehouse names to be incomprehensible. In other cases one departments names for
data have been entered into the warehouse, and another department would like to have
their names for the data imposed. Alias' are a good way to resolve these issues.
Another useful data warehouse metadata component is that of status. In some cases a
data warehouse table is undergoing design. In other cases the table is inactive or may
contain misleading data. The existence of a status field is a good way to resolve these
differences.
Volumetrics are measurements about data in the warehouse. Typical volumetric
information might include:
• the number of rows currently in the table,
• the growth rate of the table,
• the statistical profile of the table,
• the usage characteristics of the table,
• the indexing for the table and its structure,
• the byte specifications for the table, etc.
Volumetric information is useful for the DSS analyst planning an efficient usage of the
data warehouse. It is much more effective to consult volumetrics before submitting a
query that will use unknown resources than it is to simply submit the query and hope for
the best.
Aging/purge criteria is also an important component of data warehouse metadata.
Looking into the metadata store for a definition of the life cycle of data warehouse data
is much more efficient than trying to divine the life cycle by examining the data inside
the warehouse.

METADATA COMPONENTS - SUMMARIZATION ALGORITHMS
The data warehouse is made up of multiple levels of data, usually divided along the lines
of detail and summary. Storing the algorithms used to summarize or otherwise calculate
data can save the DSS analyst untold time in determining what the appropriate source of
data is. Figure 10 shows the summarization algorithm connecting two levels of data in
the warehouse.

METADATA COMPONENTS - RELATIONSHIP ARTIFACTS
Data relationships are as important in the data warehouse environment as they are in
any other data base environment. However, the implementation of data relationships in
the data warehouse environment is quite different from the implementation of data
relationships in other environments. Storing information about the data relationship
artifacts found in the data warehouse is a useful component of the data warehouse
metadata store. Storing artifact information in the data warehouse metadata makes the DSS analysts job
much easier in trying to determine what data is appropriate and most useful for analysis.


METADATA COMPONENTS - RELATIONSHIP HISTORY
Data artifacts are the actual data that results from a relationship of data occurring. But
there is another dimension to that relationship and that is the actual history of the
relationship itself. Storing the history of the relationship is also a very important aspect
of data warehouse metadata. The history of a relationship (as opposed to the actual data resulting from the
relationship) can be quite useful in interpreting data warehouse artifacts.
Typically the relationship history contains the following information:
• the tables participating in the relationship,
• the effective dates of the relationship,
• the constraints in effect,
• cardinality,
• a description of the relationship,
• the data administrator in charge, etc.

METADATA COMPONENTS - OWNERSHIP/STEWARDSHIP
Because data is updated in the operational environment there is a strong sense of
ownership. The department that can update operational data effectively "owns" the data.
There is a corresponding notion of stewardship of data in the data warehouse
environment. Data is not updated in the data warehouse. But it is moved into the data
warehouse and managed there. There is a responsibility for data in the data warehouse
(although the notion of responsibility for data in the operational environment is quite
different.)
Storing ownership (operational) and stewardship (data warehouse) information in the
data warehouse metadata store can be of great service to the DSS analyst trying to
rectify or understand data.

METADATA COMPONENTS - ACCESS PATTERNS
The pattern of access of data - either predicted or actual - can be quite useful. This
information can be stored in the data warehouse. Access pattern data is useful in determining when and what data to migrate to other levels of storage. If a parallel environment is chosen, access patterns are very useful in
deciding how to physically spread the data.
Unfortunately, determining access patterns is not an easy task in many technologies. In
some cases log records can be used for this purpose. In other environments the dbms
masks off the actual pattern of access.

METADATA COMPONENTS - REFERENCE TABLES/ENCODED DATA
Encoded data or reference data is data that is stored in an external table and contains
commonly used translations of values. Usually the data in the reference table is lengthy
or stored awkwardly and the data in the primary data table is stored concisely. As data
changes over time, it is necessary to store the contents of the reference tables as well.
If the reference tables are not stored, the DSS analyst may well arrive at the proposition
of having to try to decipher primary data into an unknown format. This can be very
disconcerting for the DSS analyst. Inaccuracy and frustration are usually the result.
In order to manage data over the spectrum of time, it is necessary to store reference
tables/encoded tables as well, with the effective from and effective to dates for the
tables in the data warehouse metadata store.

METADATA COMPONENTS - DATA MODEL - DESIGN REFERENCE
The heart of the design of the data warehouse is the data model. Building a data
warehouse without a data model is a very trying experience. The DSS analyst
occasionally needs to understand the relationship between the physical design of the
data warehouse and the data model. Figure 16 shows the data model and the
relationship between the data model and the physical design of the system residing in
the data warehouse metadata store.

RELATIONSHIP TO OTHER METADATA STORES
While the metadata for the data warehouse is certainly an important feature of the
architected environment, it is hardly the only metadata there is. Metadata exists (in a
variety of forms) for the operational environment and the departmental environment.
The metadata for all the environments needs to be able to communicate when needed
for the other environments. There needs to be a two-way interchange of information.
The interfaces need to understand what data resides where and the format and use of
the data in other environments.

No comments:

Post a Comment