« | August 2025 | » | 日 | 一 | 二 | 三 | 四 | 五 | 六 | | | | | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | | | | | | |
|
|
公告 |
欢迎大家访问,希望大家多多交流!
Email:hello105@ustc.edu
QQ: 7779112
|
统计 |
blog名称:hello105 日志总数:63 评论数量:174 留言数量:3 访问次数:396080 建立时间:2004年11月8日 |
| 
|
W3CHINA Blog首页 管理页面 写新日志 退出
[其他]The 38 Subsystems of ET(ZZ) |
音乐昆虫 发表于 2007/9/12 13:49:28 | To create a successful data warehouse, rely on best practices, not intuition.http://www.intelligententerprise.com/showArticle.jhtml?articleID=54200319By
-->
-->
-->
-->
-->Ralph Kimball
The extract-transform-load (ETL) system, or more informally, the
"back room," is often estimated to consume 70 percent of the time and
effort of building a data warehouse. But there hasn't been enough
careful thinking about just why the ETL system is so complex and
resource intensive. Everyone understands the three letters: You get the
data out of its original source location (E), you do something to it
(T), and then you load it (L) into a final set of tables for the users
to query.
When asked about breaking down the three big steps, many designers
say, "Well, that depends." It depends on the source, it depends on
funny data idiosyncrasies, it depends on the scripting languages and
ETL tools available, it depends on the skills of the in-house staff,
and it depends on the query and reporting tools the end users have.
The "it depends" response is dangerous because it becomes an excuse
to roll your own ETL system, which in the worst-case scenario results
in an undifferentiated spaghetti-mess of tables, modules, processes,
scripts, triggers, alerts, and job schedules. Maybe this kind of
creative design approach was appropriate a few years ago when everyone
was struggling to understand the ETL task, but with the benefit of
thousands of successful data warehouses, a set of best practices is
ready to emerge.
I have spent the last 18 months intensively studying ETL practices
and ETL products. I have identified a list of 38 subsystems that are
needed in almost every data warehouse back room. That's the bad news.
No wonder the ETL system takes such a large fraction of the data
warehouse resources. But the good news is that if you study the list,
you'll recognize almost all of them, and you'll be on the way to
leveraging your experience in each of these subsystems as you build
successive data warehouses.
The 38 Subsystems
Extract system. Source data adapters,
push/pull/dribble job schedulers, filtering and sorting at the source,
proprietary data format conversions, and data staging after transfer to
ETL environment. Change data capture system. Source log file readers, source date and sequence number filters, and CRC-based record comparison in ETL system.
Data profiling system. Column property
analysis including discovery of inferred domains, and structure
analysis including candidate foreign key — primary relationships, data
rule analysis, and value rule analysis. Data cleansing system. Typically a
dictionary driven system for complete parsing of names and addresses of
individuals and organizations, possibly also products or locations.
"De-duplication" including identification and removal usually of
individuals and organizations, possibly products or locations. Often
uses fuzzy logic. "Surviving" using specialized data merge logic that
preserves specified fields from certain sources to be the final saved
versions. Maintains back references (such as natural keys) to all
participating original sources. Data conformer. Identification and
enforcement of special conformed dimension attributes and conformed
fact table measures as the basis for data integration across multiple
data sources. Audit dimension assembler. Assembly of
metadata context surrounding each fact table load in such a way that
the metadata context can be attached to the fact table as a normal
dimension. Quality screen handler. In line ETL tests
applied systematically to all data flows checking for data quality
issues. One of the feeds to the error event handler (see subsystem 8). Error event handler. Comprehensive system
for reporting and responding to all ETL error events. Includes
branching logic to handle various classes of errors, and includes
real-time monitoring of ETL data quality Surrogate key creation system. Robust
mechanism for producing stream of surrogate keys, independently for
every dimension. Independent of database instance, able to serve
distributed clients. Slowly Changing Dimension (SCD) processor.
Transformation logic for handling three types of time variance possible
for a dimension attribute: Type 1 (overwrite), Type 2 (create new
record), and Type 3 (create new field). Late arriving dimension handler. Insertion and update logic for dimension changes that have been delayed in arriving at the data warehouse.
Fixed hierarchy dimension builder. Data validity checking and maintenance system for all forms of many-to-one hierarchies in a dimension.
Variable hierarchy dimension builder. Data
validity checking and maintenance system for all forms of ragged
hierarchies of indeterminate depth, such as organization charts, and
parts explosions. Multivalued dimension bridge table builder.
Creation and maintenance of associative (bridge) table used to describe
a many-to-many relationship between dimensions. May include weighting
factors used for allocations and situational role descriptions. Junk dimension builder. Creation and
maintenance of dimensions consisting of miscellaneous low cardinality
flags and indicators found in most production data sources. Transaction grain fact table loader. System
for updating transaction grain fact tables including manipulation of
indexes and partitions. Normally append mode for most recent data. Uses
surrogate key pipeline (see subsystem 19). Periodic snapshot grain fact table loader.
System for updating periodic snapshot grain fact tables including
manipulation of indexes and partitions. Includes frequent overwrite
strategy for incremental update of current period facts. Uses surrogate
key pipeline (see subsystem 19). Accumulating snapshot grain fact table loader.
System for updating accumulating snapshot grain fact tables including
manipulation of indexes and partitions, and updates to both dimension
foreign keys and accumulating measures. Uses surrogate key pipeline
(see subsystem 19). Surrogate key pipeline. Pipelined, multithreaded process for replacing natural keys of incoming data with data warehouse surrogate keys.
Late arriving fact handler. Insertion and update logic for fact records that have been delayed in arriving at the data warehouse.
Aggregate builder. Creation and maintenance
of physical database structures, known as aggregates, that are used in
conjunction with a query-rewrite facility, to improve query
performance. Includes stand-alone aggregate tables and materialized
views. Multidimensional cube builder. Creation and
maintenance of star schema foundation for loading multidimensional
(OLAP) cubes, including special preparation of dimension hierarchies as
dictated by the specific cube technology. Real-time partition builder. Special logic
for each of the three fact table types (see subsystems 16, 17, and 18)
that maintains a "hot partition" in memory containing only the data
that has arrived since the last update of the static data warehouse
tables. Dimension manager system. Administration
system for the "dimension manager" who replicates conformed dimensions
from a centralized location to fact table providers. Paired with
subsystem 25. Fact table provider system. Administration
system for the "fact table provider" who receives conformed dimensions
sent by the dimension manager. Includes local key substitution,
dimension version checking, and aggregate table change management. Job scheduler. System for scheduling and
launching all ETL jobs. Able to wait for a wide variety of system
conditions including dependencies of prior jobs completing
successfully. Able to post alerts. Workflow monitor. Dashboard and reporting
system for all job runs initiated by the Job Scheduler. Includes number
of records processed, summaries of errors, and actions taken. Recovery and restart system. Common system
for resuming a job that has halted, or for backing out a whole job and
restarting. Significant dependency on backup system (see subsystem 36).
Parallelizing/pipelining system. Common
system for taking advantage of multiple processors, or grid computing
resources, and common system for implementing streaming data flows.
Highly desirable (eventually necessary) that parallelizing and
pipelining be invoked automatically for any ETL process that meets
certain conditions, such as not writing to the disk or waiting on a
condition in the middle of the process. Problem escalation system. Automatic plus
manual system for raising an error condition to the appropriate level
for resolution and tracking. Includes simple error log entries,
operator notification, supervisor notification, and system developer
notification. Version control system. Consistent
"snapshotting" capability for archiving and recovering all the metadata
in the ETL pipeline. Check-out and check-in of all ETL modules and
jobs. Source comparison capability to reveal differences between
different versions. Version migration system. development to
test to production. Move a complete ETL pipeline implementation out of
development, into test, and then into production. Interface to version
control system to back out a migration. Single interface for setting
connection information for entire version. Independence from database
location for surrogate key generation. Lineage and dependency analyzer. Display
the ultimate physical sources and all subsequent transformations of any
selected data element, chosen either from the middle of the ETL
pipeline, or chosen on a final delivered report (lineage). Display all
affected downstream data elements and final report fields affected by a
potential change in any selected data element, chosen either in the
middle of the ETL pipeline, or in an original source (dependency). Compliance reporter. Comply with regulatory
statutes to prove the lineage of key reported operating results. Prove
that the data and the transformations haven't been changed. Show who
has accessed or changed any such data. Security system. Administer role-based
security on all data and metadata in the ETL pipeline. Prove that a
version of a module hasn't been changed. Show who has made changes. Backup system. Backup data and metadata for recovery, restart, security, and compliance requirements.
Metadata repository manager. Comprehensive
system for capturing and maintaining all ETL metadata, including all
transformation logic. Includes process metadata, technical metadata,
and business metadata. Project management system. Comprehensive system for keeping track of all ETL development.
If you've survived to the end of this list, congratulations! Here
are the important observations I'd like you to carry away: It's really
difficult to argue that any of these subsystems are unnecessary as this
list makes it really clear that without dividing up the task (perhaps
38 ways), the descent into chaos is inevitable. The industry is ready
to define best-practices goals and implementation standards for each of
these 38 subsystems, and it would be a tremendous contribution for the
ETL tool vendors to provide wizards or serious templates for each of
these 38 subsystems. We have a lot to talk about. Maybe 38 more columns!
Ralph Kimball founder of the Kimball Group,
teaches dimensional data warehouse design through Kimball University
and critically reviews large data warehouse projects. He has four
best-selling data warehousing books in print, including the newly
released The Data Warehouse ETL Toolkit (Wiley, 2004).
|
阅读全文(2645) | 回复(0) | 编辑 | 精华 |
|