Introduction to Database

53
CENG 547 – Database Systems principles (Data Mining) June 15, 2022 Data Mining: Concepts and Techniques 1 TEXTBOOK: Data Mining: concepts and techniques, 3 rd edition

description

Presentation used for introduction to database

Transcript of Introduction to Database

  • CENG 547 Database Systems principles (Data Mining)*Data Mining: Concepts and Techniques*TEXTBOOK:Data Mining: concepts and techniques, 3rd edition

    Data Mining: Concepts and Techniques

  • Course structure*Data Mining: Concepts and Techniques*Lectures (mixed PPT, textbook)Labs ( WEKA tool, download it )

    5 quizzes per semester, grade on UMS (15%)Quiz target grade is 80 (you should talk to chair if you get below that)

    Assignments (10%) respect due datesMidterm (35%)Final (40%)

    Data Mining: Concepts and Techniques

  • **Data Mining: Concepts and Techniques (3rd ed.)

    Chapter 1 Lebanese International UniversityCENG 547

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *Why Data Mining? The Explosive Growth of Data: from terabytes to petabytesData collection and data availabilityAutomated data collection tools, database systems, Web, computerized societyMajor sources of abundant data:Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! Necessity is the mother of inventionData miningAutomated analysis of massive data sets

  • *Evolution of Sciences: New Data Science EraBefore 1600: Empirical science1600-1950s: Theoretical scienceEach discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s: Computational scienceOver the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now: Data scienceThe flood of data from new scientific instruments and simulationsThe ability to economically store and manage petabytes of data onlineThe Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumesData mining is a major new challenge!Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *What Is Data Mining?Data mining (knowledge discovery from data KDD) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of dataData mining: a misnomer?Alternative namesKnowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.Watch out: Is everything data mining? Simple search and query processing (Deductive) expert systems

  • *Knowledge Discovery (KDD) ProcessThis is a view from typical database systems and data warehousing communities

    Data mining plays an essential role in the knowledge discovery processData CleaningData IntegrationDatabasesData WarehouseTask-relevant DataSelectionData MiningPattern EvaluationKnowledge

  • *Example: A Web Mining FrameworkWeb mining usually involvesData cleaningData integration from multiple sourcesWarehousing the dataData cube constructionData selection for data miningData miningPresentation of the mining resultsPatterns and knowledge to be used or stored into knowledge-base

  • *Data Mining in Business Intelligence Increasing potentialto supportbusiness decisionsEnd UserBusiness Analyst DataAnalystDBADecision MakingData PresentationVisualization TechniquesData MiningInformation DiscoveryData ExplorationStatistical Summary, Querying, and ReportingData Preprocessing/Integration, Data WarehousesData SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

  • *KDD Process: A Typical View from ML and StatisticsInput DataData MiningData Pre-ProcessingPost-ProcessingThis is a view from typical machine learning and statistics communitiesPattern discoveryAssociation & correlationClassificationClusteringOutlier analysis Pattern information knowledge

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *Multi-Dimensional View of Data MiningData to be minedDatabase data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networksKnowledge to be mined (or: Data mining functions)Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levelsTechniques utilizedData-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.Applications adaptedRetail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *Data Mining: On What Kinds of Data?1 Database-oriented data2 Data Warehouse 3 Transactional Data

  • *Data Mining: On What Kinds of Data?1 Database-oriented data sets and applicationsRelational database, data warehouse, transactional database1.a Advanced data sets and advanced applications Data streams and sensor dataTime-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked dataObject-relational databasesHeterogeneous databases and legacy databasesSpatial data and spatiotemporal dataMultimedia databaseText databasesThe World-Wide Web

  • 2- Mining in Data warehouses*Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • Using Data cubes: drill-down & rollup*Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 3- Transactional Data*Data Mining: Concepts and Techniques*Which items sell well together?

    A laptop and a printer?

    A laptop and Play station 3?

    A laptop and car speakers?

    Data Mining: Concepts and Techniques

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • Data mining functionalities: Descriptive and Predictive1 Class/concept descriptionCharacterization and discrimination

    2 Frequent patterns, association and correlation

    3 Classification and regression for prediction

    4 Cluster analysis

    5 Outlier analysis*Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 1 class/concept descriptionClass: mainly refers to objects.For example, a laptop in a store, a digital cam,

    Concept: abstract levelFor example: rich customers, bargain customers,

    They can be derived using:Data characterizationData Discrimination (comparison)

    *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 1.a Data characterizationCharacteristics or features of a class/concept

    Example: how many customers have spent more than 5000$ last year? (can be done using Database queries)

    *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 1.b Data DiscriminationComparing features of a certain class/concept to other contrasting class.

    Example: compare products of a store which sale increased last year to those whose sale decreased

    *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 2 Frequent patterns, association and correlationFrequent itemset are items usually bought together like bread and cheese.

    Frequent sequential set: digital cam then SD card then optical lenses.

    Association analysis (single dimensional rules)

    Support: 1% of transactions containing a computer contained also software

    Confidence: there is 50% chance that a software is bought with laptop

    *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 2 Frequent patterns, association and correlation (cont.)

    Association analysis (multi dimensional rules)

    Association rules can be discarded if they dont meet the minimum support threshold or minimum confidence threshold.

    *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 3 classification and regression for predictive modelingClassification: finding a model that describes and differentiates a class/concept

    *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • 3 classification and regression for predictive modeling (cont.)Regression: Rather than forming discrete prediction, regression predicts unavailable numerical data . (both use training data sets)

    Example on classification/regression: Amazon wants to find based on a decision tree what makes customers buy a certain class of products.

    results: price for example.

    *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • *4 Cluster AnalysisUnsupervised learning (i.e., Class label is unknown)Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patternsPrinciple: Maximizing intra-class similarity & minimizing interclass similarityMany methods and applications

  • *5 Outlier AnalysisOutlier analysisOutlier: A data object that does not comply with the general behavior of the dataNoise or exception? One persons garbage could be another persons treasureMethods: by product of clustering or regression analysis, Useful in fraud detection, rare events analysis

  • *Evaluation of KnowledgeAre all mined knowledge interesting?One can mine tremendous amount of patterns and knowledgeSome may fit only certain dimension space (time, location, )Some may not be representative, may be transient, Objective measures for interestingness:

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *Data Mining: Confluence of Multiple Disciplines

  • *Why Confluence of Multiple Disciplines?Tremendous amount of dataAlgorithms must be highly scalable to handle such as tera-bytes of dataHigh-dimensionality of data Micro-array may have tens of thousands of dimensionsHigh complexity of dataData streams and sensor dataTime-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked dataHeterogeneous databases and legacy databasesSpatial, spatiotemporal, multimedia, text and Web dataSoftware programs, scientific simulationsNew and sophisticated applications

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *Applications of Data MiningWeb page analysis: from web page classification, clustering to PageRank & HITS algorithmsCollaborative analysis & recommender systemsBasket data analysis to targeted marketingBiological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysisData mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *Major Issues in Data Mining (1)Mining MethodologyMining various and new kinds of knowledgeMining knowledge in multi-dimensional spaceData mining: An interdisciplinary effortBoosting the power of discovery in a networked environmentHandling noise, uncertainty, and incompleteness of dataPattern evaluation and pattern- or constraint-guided miningUser InteractionInteractive miningIncorporation of background knowledgePresentation and visualization of data mining results

  • *Major Issues in Data Mining (2)Efficiency and ScalabilityEfficiency and scalability of data mining algorithmsParallel, distributed, stream, and incremental mining methodsDiversity of data typesHandling complex types of dataMining dynamic, networked, and global data repositoriesData mining and societySocial impacts of data miningPrivacy-preserving data miningInvisible data mining

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *A Brief History of Data Mining Society1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)1991-1994 Workshops on Knowledge Discovery in DatabasesAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD95-98)Journal of Data Mining and Knowledge Discovery (1997)ACM SIGKDD conferences since 1998 and SIGKDD ExplorationsMore conferences on data miningPAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.ACM Transactions on KDD (2007)

  • *Conferences and Journals on Data MiningKDD ConferencesACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)SIAM Data Mining Conf. (SDM)(IEEE) Int. Conf. on Data Mining (ICDM)European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)Int. Conf. on Web Search and Data Mining (WSDM)Other related conferencesDB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, Web and IR conferences: WWW, SIGIR, WSDMML conferences: ICML, NIPSPR conferences: CVPR, Journals Data Mining and Knowledge Discovery (DAMI or DMKD)IEEE Trans. On Knowledge and Data Eng. (TKDE)KDD ExplorationsACM Trans. on KDD

  • *Where to Find References? DBLP, CiteSeer, GoogleData mining and KDD (SIGKDD: CDROM)Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDDDatabase systems (SIGMOD: ACM SIGMOD AnthologyCD ROM)Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAAJournals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.AI & Machine LearningConferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.Web and IR Conferences: SIGIR, WWW, CIKM, etc.Journals: WWW: Internet and Web Information Systems, StatisticsConferences: Joint Stat. Meeting, etc.Journals: Annals of statistics, etc.VisualizationConference proceedings: CHI, ACM-SIGGraph, etc.Journals: IEEE Trans. visualization and computer graphics, etc.

  • *Recommended Reference BooksE. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009B. Liu, Web Data Mining, Springer 2006T. M. Mitchell, Machine Learning, McGraw Hill, 1997Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005

  • *Chapter 1. IntroductionWhy Data Mining?What Is Data Mining?A Multi-Dimensional View of Data MiningWhat Kinds of Data Can Be Mined?What Kinds of Patterns Can Be Mined?What Kinds of Technologies Are Used?What Kinds of Applications Are Targeted? Major Issues in Data MiningA Brief History of Data Mining and Data Mining SocietySummary

  • *SummaryData mining: Discovering interesting patterns and knowledge from massive amount of dataA natural evolution of science and information technology, in great demand, with wide applicationsA KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentationMining can be performed in a variety of dataData mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.Data mining technologies and applicationsMajor issues in data mining

  • extras*Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

  • *Data Mining Function: (1) GeneralizationInformation integration and data warehouse constructionData cleaning, transformation, integration, and multidimensional data modelData cube technologyScalable methods for computing (i.e., materializing) multidimensional aggregatesOLAP (online analytical processing)Multidimensional concept description: Characterization and discriminationGeneralize, summarize, and contrast data characteristics, e.g., dry vs. wet region

  • *Data Mining Function: (2) Association and Correlation AnalysisFrequent patterns (or frequent itemsets)What items are frequently purchased together in your Walmart?Association, correlation vs. causalityA typical association ruleDiaper Beer [0.5%, 75%] (support, confidence)Are strongly associated items also strongly correlated?How to mine such patterns and rules efficiently in large datasets?How to use such patterns for classification, clustering, and other applications?

  • *Data Mining Function: (3) ClassificationClassification and label prediction Construct models (functions) based on some training examplesDescribe and distinguish classes or concepts for future predictionE.g., classify countries based on (climate), or classify cars based on (gas mileage)Predict some unknown class labelsTypical methodsDecision trees, nave Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, Typical applications:Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages,

  • *Time and Ordering: Sequential Pattern, Trend and Evolution AnalysisSequence, trend and evolution analysisTrend, time-series, and deviation analysis: e.g., regression and value predictionSequential pattern mininge.g., first buy digital camera, then buy large SD memory cardsPeriodicity analysisMotifs and biological sequence analysisApproximate and consecutive motifsSimilarity-based analysisMining data streamsOrdered, time-varying, potentially infinite, data streams

  • *Structure and Network AnalysisGraph miningFinding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)Information network analysisSocial networks: actors (objects, nodes) and relationships (edges)e.g., author networks in CS, terrorist networksMultiple heterogeneous networksA person could be multiple information networks: friends, family, classmates, Links carry a lot of semantic information: Link miningWeb miningWeb is a big information network: from PageRank to GoogleAnalysis of Web information networksWeb community discovery, opinion mining, usage mining,

    ***********************Add a definition/description of traditional data analysis.

    *****************