Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave...

18
del ed transaction processing systems cess to the data stored ood at putting data into databases n use of electronic data gathering devices e.g. point-of-sale, remote sensing devices etc. became easier and cheaper with increasing computing power to the data stored but no analysis of data to unearth the hidden relationships within the data i.e. for decision support has increased e.g. VLDBs, need automated techniques for analysis as they have grown beyond manual ex ntific user knew nothing of commercial business applications database programmers, knew nothing of massively parallel principles for database software producers to create easy-to-use tools and form strategic relationships with ha the non trivial extraction of implicit, previously unknown, and potentially useful information from data lliam J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus e analysis of data and the use of software techniques for finding patterns and regularities in sets o esponsible for finding the patterns by identifying the underlying rules and features in the data. `strike gold' in unexpected places as the data mining software extracts patterns not previously disc at no-one has noticed them before. s of data are sifted in an attempt to find something worthwhile operation large amounts of low grade materials are sifted through in order to find something of value line Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN 1-55860-489-8. be Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 9, ISBN 1-55860-552-5.
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave...

Page 1: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Historical Perspective

The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting data into databases The data explosion Increase in use of electronic data gathering devices e.g. point-of-sale, remote sensing devices etc. Data storage became easier and cheaper with increasing computing power

Problems

DBMS gave access to the data stored but no analysis of data Analysis required to unearth the hidden relationships within the data i.e. for decision support Size of databases has increased e.g. VLDBs, need automated techniques for analysis as they have grown beyond manual extraction Obstacles typical scientific user knew nothing of commercial business applications the business database programmers, knew nothing of massively parallel principles solution was for database software producers to create easy-to-use tools and form strategic relationships with hardware manufacturers

What is data mining? the non trivial extraction of implicit, previously unknown, and potentially useful information from dataWilliam J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus

Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The computer is responsible for finding the patterns by identifying the underlying rules and features in the data. It is possible to `strike gold' in unexpected places as the data mining software extracts patterns not previously discernible or so obvious that no-one has noticed them before. Mining analogy: large volumes of data are sifted in an attempt to find something worthwhile in a mining operation large amounts of low grade materials are sifted through in order to find something of value.

Books:• Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN 1-55860-489-8. • Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 1999, ISBN 1-55860-552-5.

Page 2: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.
Page 3: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Data Mining vs. DBMS

DBMS - queries based on the data held e.g.• last months sales for each product• sales grouped by customer age etc.• list of customers who lapsed their policy

Data Mining - infer knowledge from the data held to answer queries e.g.• what characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies?• why is the Cleveland division so profitable?

Characteristics of a data mining system

Large quantities of data• volume of data so great it has to be analyzed by automated techniques e.g. POS, satellite information, credit card transactions etc.

Noisy, incomplete data• imprecise data is characteristic of all data collection• databases - usually contaminated by errors, cannot assume that the data they contain is entirely correct e.g. some attributes rely on subjective or measurement judgments

Complex data structure - conventional statistical analysis not possibleHeterogeneous data stored in legacy systems

Who needs data mining?

Who(ever) has information fastest and uses it wins

Don McKeough, former president of Coke Cola

Page 4: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Data Mining Applications

Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc. Finance - stock market prediction, credit assessment, fraud detection etc. Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc. Knowledge Acquisition Expert systems are models of real world processes Much of the information is available straight from the process e.g. in production systems, data is collected for monitoring the system knowledge can be extracted using data mining tools experts can verify the knowledge Engineering - automotive diagnostic expert systems, fault detection etc.

Data Mining Goals

Classification DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules Example - customer database in a bank Question - Is a new customer applying for a loan a good investment or not? Typical rule formulated: if STATUS = married and INCOME > 10000 and HOUSE_OWNER = yes then INVESTMENT_TYPE = good

Association Rules that associate one attribute of a relation to another Set oriented approaches are the most efficient means of discovering such rules Example - supermarket database 72% of all the records that contain items A and B also contain item C the specific percentage of occurrences, 72 is the confidence factor of the rule

Sequence/Temporal Sequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of time Difference between sequence rules and other rules is the temporal factor Example - retailers database Can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven

Page 5: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Data Mining and Machine Learning

Data Mining (DM) or Knowledge Discovery in Databases (KDD) is about finding understandable knowledgeMachine Learning (ML) is concerned with improving performance of an agent training a neural network to balance a pole is part of ML, but not of KDDEfficiency of the algorithm and scalability is more important in DM or KDD DM is concerned with very large, real-world databases ML typically looks at smaller data setsML has laboratory type examples for the training setDM deals with `real world' data. Real world data tend to have problems such as: missing values dynamic data noise

Statistical Data AnalysisIll-suited for Nominal and Structured Data TypesCompletely data driven - incorporation of domain knowledge not possibleInterpretation of results is difficult and dauntingRequires expert user guidance

Page 6: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Stages of the Data Mining Process

Data pre-processing• heterogeneity resolution• data cleansing• data warehousing

Applying Data Mining Tools: extraction of patterns from the pre-processed dataInterpretation and evaluation: the user bias can direct DM tools to areas of interest

• attributes of interest in databases• goal of discovery• domain knowledge• prior knowledge or belief about the domain

Techniques

Machine Learning methodsStatistics: can be used in several data mining stages

• data cleansing i.e. the removal of erroneous or irrelevant data• EDA, exploratory data analysis e.g. frequency counts, histograms etc.• data selection - sampling facilities and so reduce the scale of computation• attribute re-definition• data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.

Visualization: enhances EDA, makes patterns more visibleClustering (Cluster Analysis)

• Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric• Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of molecules• Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base• It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases

Page 7: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Knowledge Representation Methods

Neural Networks• a trained neural network can be thought of as an "expert" in the category of information it has been given to analyze• provides projections given new situations of interest and answers "what if" questions• problems include:

• the resulting network is viewed as a black box• no explanation of the results is given i.e. difficult for the user to interpret the results • difficult to incorporate user intervention• slow to train due to their iterative nature

Decision trees• used to represent knowledge• built using a training set of data and can then be used to classify new objects• problems are:

• opaque structure - difficult to understand• missing data can cause performance problems• they become cumbersome for large data sets

Rules• probably the most common form of representation• tend to be simple and intuitive • unstructured and less rigid• problems are:

• difficult to maintain• inadequate to represent many types of knowledge

• Example format: if X then Y

Page 8: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Related Technologies: Data Warehousing

DefinitionA data warehouse can be defined as any centralized data repository which can be queried for businessbenefit warehousing makes it possible to:

• extract archived operational data• overcome inconsistencies between different legacy data formats• integrate data throughout an enterprise, regardless of location, format, or communication requirements• incorporate additional or expert information

Characteristics of a data warehouse• subject-oriented - data organized by subject instead of application e.g.

• an insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.)• contains only the information necessary for decision support processing

• integrated - encoding of data is often inconsistent e.g. gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention• time-variant - the data warehouse is a place for storing data that are five to 10 years old, or older e.g.

• this data is used for comparisons, trends, and forecasting• these data are not updated

• non-volatile• data are not updated or changed in any way once they enter the data warehouse• data are only loaded and accessed

Page 9: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Data warehousing Processes

• insulate data - i.e. the current operational information• preserves the security and integrity of mission-critical OLTP applications• gives access to the broadest possible base of data

• retrieve data - from a variety of heterogeneous operational databases• data is transformed and delivered to the data warehouse/store based on a selected model (or mapping definition)• metadata - information describing the model and definition of the source data elements

• data cleansing - removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times.• transfer - processed data transferred to the data warehouse, a large database on a high performance box

Page 10: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Criteria for a data warehouse

Load Performance require incremental loading of new data on a periodic basis must not artificially constrain the volume of dataLoad Processing data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata updateData Quality Management ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive database sizeQuery Performance must not be slowed or inhibited by the performance of the data warehouse RDBMSTerabyte Scalability Data warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations. It must support modular and parallel management.Mass User Scalability Access to warehouse data must not be limited to the elite few has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance.Networked Data Warehouse Data warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses from a single client workstationWarehouse Administration large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibilityThe RDBMS must Integrate Dimensional Analysis dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP toolsAdvanced Query Functionality End users require advanced analytic calculations, sequential and comparative analysis, and consistent access to detailed and summarized data

Page 11: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.

Data warehousing vs. OLTP

OLTP systems designed to maximize transaction capacity but they: cannot be repositories of facts and historical data for business analysis cannot quickly answer ad hoc queries rapid retrieval is almost impossible data is inconsistent and changing, duplicate entries exist, entries can be missing OLTP offers large amounts of raw data which is not easily understoodTypical OLTP query is a simple aggregation e.g. what is the current account balance for this customer?Data warehouses are interested in query processing as opposed to transaction processingTypical business analysis query e.g. which product line sells best in middle-America and how does this correlate to demographic data?

OLAP (On-line Analytical processing)

Problem is how to process larger and larger databases OLAP involves many data items (many thousands or even millions) which are involved in complex relationships Fast response is crucial in OLAP Difference between OLAP and OLTP OLTP servers handle mission-critical production data accessed through simple queries OLAP servers handle management-critical data accessed through an iterative analytical investigation

OLAP operations

Consolidation - involves the aggregation of data i.e. simple roll-ups or complex expressions involving inter-related data e.g. sales offices can be rolled-up to districts and districts rolled-up to regions Drill-Down - can go in the reverse direction i.e. automatically display detail data which comprises consolidated data "Slicing and Dicing" - ability to look at the data base from different viewpoints e.g. one slice of the sales database might show all sales of product type within regions; another slice might show all sales by sales channel within each product type often performed along a time axis in order to analyze trends and find patterns

Page 12: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.
Page 13: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.
Page 14: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.
Page 15: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.
Page 16: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.
Page 17: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.
Page 18: Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting.