Data Mashups Defined and the Differences from Traditional Data Integration Approaches
description
Transcript of Data Mashups Defined and the Differences from Traditional Data Integration Approaches
Data Mashups Defined and
the Differences from Traditional Data
Integration Approaches
Byron Igoe
Product Manager
InetSoft Technology
for the Minnesota Chapter of The Data Management Association
2
Presentation Outline
I. Traditional Data Integration
a. ETL & EII
b. Spreadmarts
II. Meaning and Origins of Data Mashup
a. In-Memory Data Federation
b. Combining Formal and Informal Data Sources
c. Differences from Traditional Techniques
III. Data Management and Data Mashup
a. Data Warehousing
b. Meta Data
c. Data Governance
d. Enterprise Content Management
e. Data Modeling
3
Traditional Data Integration: ETL
Extract, Transform and Load
a well-understood convention for preparing data for analysis
reasons for being:
reorganization
conversion
cleansing
mapping
pre-calculations of business metrics
transformations
aggregations
save processing resources during analyses
ensure data quality
4
ETL (continued)
Data warehousing trends
growth in number of data sources
range of 3 to 30 “official” data sources currently
users desire to use data sources discovered via the Web
using reports or feeds from vendors & partners
growth in data¹
Annual global data production: 5 exabytes
5,000,000,000,000,000,000 – 18 zeroes
Equivalent of 37K US Libraries of Congress
Almost 1 GB per person on earth
Growing at 30% per year
1 zetabyte by 2010 – 21 zeroes
what are the data sizes and growth rates at your enterprise?
¹Source: UC Berkeley study, 2003
5
ETL (continued)
Limitations and challenges of traditional ETL & data warehousing
cumbersome to add data sources
bottleneck for ever increasing user demands
overkill for some data sources, especially transient ones
rigidity of business metric definitions
inflexibility to process changes
lag in data availability
6
Traditional Data Integration: EII
Enterprise Information Integration
same principle as ETL, creating a single data source from many
arose from data warehouse’s limitation of data timeliness
difference from data warehousing: a virtual data warehouse
benefits:
data is "real-time"
more adaptable to changes in definitions/processes
limitations:
bottlenecks and slow turnaround time to incorporate changes to definitions and processes
still relies on IT efforts to respond to demands
7
Spreadmarts
The “bane” of the business intelligence specialist!
the use of spreadsheets to store copies of enterprise data
arose from users’ frustrations with
lack of any business intelligence front-end application, or
too-hard-to-use versions of early (and some current) applications
graphical charting limitations of a BI app
tedious change request form processes
slow turnaround times to change requests
not having a way to bring in external data
8
Spreadmarts (continued)
The current position in business intelligence
now BI vendors and enterprises are learning to accept the spreadsheet as a very user-friendly tool
but still aim to reign in the use of spreadmarts per se because they are:
error prone
institutionalizing labor inefficiency
can become corrupted
have data size limitations
are not ideal for sharing
knowledge is “locked up”
don’t have governance controls
violate Sarbanes-Oxley requirements
in search of the “right” solution
9
Meaning and Origins of Data Mashup
A mashup is “the creation of a new work from two sources that were not initially designed to be combined"
first used in music in the early ’00’s, especially rap music
next used in Web 2.0 environment, especially Web portals, like My Yahoo
next entered enterprise application space, limited to “screen scraping”
now we define “data mashup” as “data transformation and integration that can be done by users with minimal skills”
examples:
joining two datasets that weren’t previously combined
creating a new business metric on the fly
importing external or user-created data
10
The Differences from Traditional Techniques
it’s the middle ground between "IT controlled" and "User defined“
“collaboration" is born
in the traditional models, IT defines how multiple sources are connected
painstaking process; especially for mergers, process changes, etc.
with data mashup, the connections are created on the fly
11
The Business Case Benefits of Mashups
Higher ROI on BI investment
higher success rate of deployment due to higher:
end-user satisfaction
usage rates
adoption rates
greater number of actionable learnings leading to:
more sales and/or
greater efficiency
increased speed of:
decisions
competitive responses
reactions to customer feedback
12
The Business Case Benefits of Mashups
Lower TCO
reduced personnel needed to support a BI solution
end-user self-service
save on change request processes
save on manpower to code requests
reduce report request backlog
reduced number of highly-skilled analysts or DBAs needed to satisfy business demands
end-users meet their own needs more often
13
The Advent of In-Memory Data Federation
Moore’s law, increasing power, lower costs of CPU & memory allow in-memory transformation, pre-aggregation and caching
Enables data mashup as well
14
The Trade-offs of these Techniques
Technique Development
Time
Development
Skill
Latency Performance Adaptability
ETL high high high high low
Data Federation high high low medium low
Spreadsheet low low high low high
Data Mashup low low low medium high
15
Combining Formal and Informal Data Sources
how a data mashup works
similar to what a user is doing in Excel
creating new formulas
bringing in external data
doing what-if scenarios
live connections to the enterprise sources are maintained
data mashup "refreshes" automatically on each use
can save it to a shared folder for re-use and collaboration
16
Data Management and Data Mashup
Relative to Data Warehousing
data mashups can be seen as an expedient alternative to data warehousing is some cases
data mashup can be a precursor to data warehousing
allows quick and inexpensive experimentation
when satisfied, codify the mashup into a data warehouse for performance benefits
17
Data Management and Data Mashup
Relative to Impact on Pre-Aggregation
pre-aggregation improves downstream processing
with many traditional techniques:
pre-aggregations are designed before reports and dashboards
usage of pre-aggregated data is explicit
in the data mashup model, pre-aggregation can be built into the engine
18
Data Management and Data Mashup
Importance of Meta Data
creation of mashups depend on meta data: data type compatibility
transformation options, like grouping and aggregation, differ based on the field type
19
Data Management and Data Mashup
Relative to Data Governance
data mashups are a major improvement over spreadmarts
data quality is enhanced
live data is used
no copying & pasting
changes to master data mappings take effect immediately
data security is enhanced
security defined at source system level
all derived mashups automatically secured
overcome limitations of Excel’s security
concern: is it giving too much power to users?
no different than what users will do inevitably in Excel
20
Data Management and Data Mashup
Relative to Enterprise Content Management
data mashups are re-usable & shareable
data integrity is always maintained
more easily embedded in other applications, portals
21
Data Management and Data Mashup
Relative to Data Modeling
data mashups situated on top of various data sources
data mashups can use:
physical tables
pre-defined SQL, or
logical models
22
Questions and Discussion