01data mining

56
Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — UNIT 1 — Introduction —

Transcript of 01data mining

  • Data Mining: Concepts and Techniques*Data Mining: Concepts and Techniques UNIT 1 Introduction

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Chapter 1. IntroductionMotivation: Why data mining?What is data mining?Data Mining: On what kind of data?Data mining functionalityClassification of data mining systemsTop-10 most popular data mining algorithmsMajor issues in data miningOverview of the course

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Why Data Mining? The Explosive Growth of Data: from terabytes to petabytesData collection and data availabilityAutomated data collection tools, database systems, Web, computerized societyMajor sources of abundant dataBusiness: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! Necessity is the mother of inventionData miningAutomated analysis of massive data sets

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Evolution of SciencesBefore 1600, empirical science1600-1950s, theoretical scienceEach discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s, computational scienceOver the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data scienceThe flood of data from new scientific instruments and simulationsThe ability to economically store and manage petabytes of data onlineThe Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Evolution of Database Technology1960s:Data collection, database creation, IMS and network DBMS1970s: Relational data model, relational DBMS implementation1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)1990s: Data mining, data warehousing, multimedia databases, and Web databases2000sStream data management and miningData mining and its applicationsWeb technology (XML, data integration) and global information systems

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*What Is Data Mining?Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of dataData mining: a misnomer?Alternative namesKnowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.Watch out: Is everything data mining? Simple search and query processing (Deductive) expert systems

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Knowledge Discovery (KDD) ProcessData miningcore of knowledge discovery processData CleaningData IntegrationDatabasesData WarehouseTask-relevant DataSelectionData MiningPattern Evaluation

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Data Mining and Business Intelligence Increasing potentialto supportbusiness decisionsEnd UserBusiness Analyst DataAnalystDBADecision MakingData PresentationVisualization TechniquesData MiningInformation DiscoveryData ExplorationStatistical Summary, Querying, and ReportingData Preprocessing/Integration, Data WarehousesData SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Data Mining: Confluence of Multiple Disciplines

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Why Not Traditional Data Analysis?Tremendous amount of dataAlgorithms must be highly scalable to handle such as tera-bytes of dataHigh-dimensionality of data Micro-array may have tens of thousands of dimensionsHigh complexity of dataData streams and sensor dataTime-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked dataHeterogeneous databases and legacy databasesSpatial, spatiotemporal, multimedia, text and Web dataSoftware programs, scientific simulationsNew and sophisticated applications

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Multi-Dimensional View of Data MiningData to be minedRelational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWWKnowledge to be minedCharacterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.Multiple/integrated functions and mining at multiple levelsTechniques utilizedDatabase-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.Applications adaptedRetail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Data Mining: Classification SchemesGeneral functionality (Data Mining Tasks)Descriptive data mining Predictive data miningDifferent views lead to different classificationsData view: Kinds of data to be minedKnowledge view: Kinds of knowledge to be discoveredMethod view: Kinds of techniques utilizedApplication view: Kinds of applications adapted

    Data Mining: Concepts and Techniques

  • Predictive TasksObjective: Predict the value of a specific attribute (target/dependent variable)based on the value of other attributes (explanatory).

    Example: Judge if a patient has specific disease based on his/her medical tests results.

    Data Mining: Concepts and Techniques

  • Descriptive TasksObjective: To derive patterns (correlation,trends) that summarizes the underlying relationship between data.

    Example: Identifying web pages that are accessed together.(human interpretable pattern)

    Data Mining: Concepts and Techniques

  • *Data Mining Models and Tasks

  • Data Mining Techniques

    Data Mining: Concepts and Techniques

  • *Basic Data Mining TasksClassification maps data into predefined groups or classesSupervised learningPattern recognitionPredictionRegression is used to map a data item to a real valued prediction variable.Clustering groups similar data together into clusters.Unsupervised learningSegmentationPartitioning

    Data Mining: Concepts and Techniques

  • *Basic Data Mining Tasks (contd)Summarization maps data into subsets with associated simple descriptions.CharacterizationGeneralizationLink Analysis uncovers relationships among data.Affinity AnalysisAssociation RulesSequential Analysis determines sequential patterns.

    Data Mining: Concepts and Techniques

  • *Ex: Time Series AnalysisExample: Stock MarketPredict future valuesDetermine similar patterns over timeClassify behavior

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Data Mining: On What Kinds of Data?Database-oriented data sets and applicationsRelational database, data warehouse, transactional databaseAdvanced data sets and advanced applications Data streams and sensor dataTime-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked dataObject-relational databasesHeterogeneous databases and legacy databasesSpatial data and spatiotemporal dataMultimedia databaseText databasesThe World-Wide Web

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Major Issues in Data MiningMining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, WebPerformance: efficiency, effectiveness, and scalabilityPattern evaluation: the interestingness problemIncorporation of background knowledgeHandling noise and incomplete dataParallel, distributed and incremental mining methodsIntegration of the discovered knowledge with existing one: knowledge fusion User interactionData mining query languages and ad-hoc miningExpression and visualization of data mining resultsInteractive mining of knowledge at multiple levels of abstractionApplications and social impactsDomain-specific data mining & invisible data miningProtection of data security, integrity, and privacy

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Why Data Mining?Potential ApplicationsData analysis and decision supportMarket analysis and managementTarget marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentationRisk analysis and managementForecasting, customer retention, improved underwriting, quality control, competitive analysisFraud detection and detection of unusual patterns (outliers)Other ApplicationsText mining (news group, email, documents) and Web miningStream data miningBioinformatics and bio-data analysis

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques*Architecture: Typical Data Mining System

    Data Mining: Concepts and Techniques

  • Understanding the term Data WarehousingData Warehouse: The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in the sentence as follows:Subject Oriented:Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant: All data in the data warehouse is identified with a particular time period. Non-volatileData is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.

    Data Mining: Concepts and Techniques

  • *What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference

    [Forrester Research, April 1996]

    Data Mining: Concepts and Techniques

  • Data Warehouse Architecture

  • *Data Warehouse Architecture

  • *Data Mining works with Warehouse DataData Warehousing provides the Enterprise with a memory

    Data Mining provides the Enterprise with intelligence

    Data Mining: Concepts and Techniques

  • Rules of a Data WarehouseData Warehouse and Operational Environments are Separated

    Data is integrated

    Contains historical data over a long period of time

    Data is a snapshot data captured at a given point in time

    Data is subject-oriented

    Data Mining: Concepts and Techniques

  • Rules of Data WarehouseMainly read-only with periodic batch updates

    Development Life Cycle has a data driven approach versus the traditional process-driven approach

    Data contains several levels of detailCurrent, Old, Lightly Summarized, Highly Summarized

    Data Mining: Concepts and Techniques

  • Rules of Data WarehouseEnvironment is characterized by Read-only transactions to very large data sets

    System that traces data sources, transformations, and storage

    Metadata is a critical componentSource, transformation, integration, storage, relationships, history, etc

    Contains a chargeback mechanism for resource usage that enforces optimal use of data by end users

    Data Mining: Concepts and Techniques

    *

    Warehousing on the Web(Web Warehousing)Webhouse

    Data Mining: Concepts and Techniques

  • Data WebhouseDefined by Ralph KimballTwo distict focusesBringing the web to the warehouseClickstream data as a source of informationBringing existing data warehouses to webFully distributed environment

    Data Mining: Concepts and Techniques

  • Required CapabilitiesCapture clickstream logs and convert to tables for analysisMerge customer demographic and account info with aboveInterpret customer paths in websiteIdentify abandoned sessionsUse dw to drive customer responses appearing on your websiteDW querying and reporting available through web browsersAttach multimedia to DWDW security

    Data Mining: Concepts and Techniques

  • Architecture Web to WarehouseBeyond comprehensive snapshot of business on real-time basis also want knowledge of customer behaviorExtended design factorsTimliness real-timeData volume no upper limitResponse time less than 10 seconds

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques

    Data Mining: Concepts and Techniques

  • Who are our users?TraditionalPower usersneed database connectivityAnalysts want to manipulate existing dataReport viewersview standardized reportsWebOur customersOur business partners Our employees

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and TechniquesClickstream as defined by Internet Advertising Bureau (IAB) :

    The electronic path a user takes while navigating from site to site, and from page to page within a site. It is a comprehensive body of data describing the sequence of activity between a users browser and any other Internet resource, such as a Web site or third party ad serverClickstreams

    Data Mining: Concepts and Techniques

  • ClickstreamsClickstream not another data sourceDistributed nature leads to multiple data sources which require synchronizationMultiple partiesMore than a dozen log file formats for capturing clickstream dataSearch specification

    Basic form of clickstream data statelessLog shows isolated page retrieval event

    Clickstream data anonymous

    Data Mining: Concepts and Techniques

  • ClickstreamsClickstream post-processor receives raw long data from web server and normalizes it into a format which can be combined with application derived data for insertion into dw

    Data Mining: Concepts and Techniques

  • Why Bring DW to Web?Primary function of dw to publish information web good partner

    Need distributed dw web provides universal connectivity

    Universal front-end web browser

    Data Mining: Concepts and Techniques

  • Web Pushes Data Warehouse User interface effectiveness measurableQueries and updates mixedSpeed expected 10 second ruleGlobal 27 X 7 expectedInternational characters, dates, addressesExpanded multimediaAnimation, zoomable images, maps, video clipsNeed material in digital formEnterprise information portal will require items to be searchable

    Data Mining: Concepts and Techniques

  • Web Pushes Data WarehouseMass customizationDynamically created web pages XML

    Fully distributedLinking together all the data marts

    Security and PrivacyPublish only to those who need to knowUser profiles and access profiles defined in one placeFull-time expert security person

    Data Mining: Concepts and Techniques

  • Second Generation User Interface GuidelinesNear- instantaneous performanceWebsite DesignDesign for lowest common denominatorMeasure page performance on a continuous basisPaint navigation buttons immediatelyDisclose content progressivelyImplement page cachingCache data, reportsImprove web server bandwidthImprove server throughput

    Data Mining: Concepts and Techniques

  • Second Generation User Interface GuidelinesData Webhouse designAdapt all web design responses Select appropriate DBMS software dimensional models, OLAPUse indexes, aggregationsPartition filesIncrease RAMUse parallel processing

    Data Mining: Concepts and Techniques

  • Meet User ExpectationsWebsite designSite navigation choicesHelp choicesCommunication with various groups response must be assuredHeadlines serious and define contentIndicate off-screen materialSurvey customer needs and wants

    Data Mining: Concepts and Techniques

  • Meet User ExpectationsData Webhouse designReport libraryFolder of previous queries, reports Dimension browser viewing dimension can assist report creationBusiness metadata interface understand organizations data assets

    Data Mining: Concepts and Techniques

  • Streamline ProcessBusiness processes designed from ground up to work seamlessly on webWebsite designReengineer to streamline process and make navigation easier, uniform interfacesRemove barriers to reaching pageMinimize clicks and new windowsAllow interruption and return

    Data Mining: Concepts and Techniques

  • Streamline ProcessData Webhouse designBuild an explicit value chain for reporting and analysis around the application suite using conformed dimensions and factsDrill across functionsSingle user interface for reporting against all parts of businessMaster report library and FAQsSingle login and single console access to webhouse

    Data Mining: Concepts and Techniques

  • Allow Problem ResolutionWebsite designAllow backtracking, rollback, play forwardKeep old transactionsEasy error reportingAcknowledge, track and follow-up all user inputs, show wait timeAssist searchingData Webhouse designProvide adequate end user supportShow aggregates in use and availableShow system load and percent completed

    Data Mining: Concepts and Techniques

  • Build TrustClearly state and observe websites policies for using customers identityWebsite designDo not abuse privacyLink to privacy statementUse friendly pictures of peopleDistinguish between ad content and editorial content

    Data Mining: Concepts and Techniques

  • Build TrustData Webhouse designTwo-factor securityWhat you know passwordWhat you posses tokenTrack changes in employee and contractor statusCreate and enforce roles for employees, contractors and customersManage webhouse security directly

    Data Mining: Concepts and Techniques

  • Provide Communication HooksWebsite designProvide useful links to others internal and externalRemove links that invalidate the back buttonUse copyable URLsUse URL as medium of distribution

    Data Mining: Concepts and Techniques

  • Data Mining: Concepts and Techniques

    Data Mining: Concepts and Techniques

    ****************