A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey...

11
A RESEARCH SUPPORT SYSTE A RESEARCH SUPPORT SYSTE M FRAMEWORK FOR WEB DATA M FRAMEWORK FOR WEB DATA MINING MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support Systems WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support Systems October 13, 2003, Halifax October 13, 2003, Halifax This research was partially supported by NSF, CISE/IIS-Digital Society and Technology

description

INTRODUCTION World Wide Web World Wide Web Abundant information Abundant information Important resource for research Important resource for research Web Data Features Web Data Features Semi-structured Semi-structured Heterogeneous Heterogeneous Dynamic Dynamic A Research Support System for Web Data Mining A Research Support System for Web Data Mining

Transcript of A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey...

Page 1: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

A RESEARCH SUPPORT SYSTEA RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA M FRAMEWORK FOR WEB DATA MININGMININGJin Xu, Yingping Huang, Gregory Madey

Department of Computer Science and EngineeringUniversity of Notre Dame

Notre Dame, IN 46556

WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support SystemsSystems

October 13, 2003, HalifaxOctober 13, 2003, Halifax

This research was partially supported by NSF, CISE/IIS-Digital Society and Technology

Page 2: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

OUTLINEOUTLINE INTRODUCTIONINTRODUCTION FRAMEWORK OVERVIEWFRAMEWORK OVERVIEW INFORMATION RETRIEVALINFORMATION RETRIEVAL DATA MINING TECHNIQUES DATA MINING TECHNIQUES CASECASE CONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORK

Page 3: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

INTRODUCTIONINTRODUCTION World Wide WebWorld Wide Web

Abundant informationAbundant information Important resource for researchImportant resource for research

Web Data FeaturesWeb Data Features Semi-structuredSemi-structured HeterogeneousHeterogeneous DynamicDynamic

A Research Support System for Web A Research Support System for Web Data MiningData Mining

Page 4: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

FRAMEWORKFRAMEWORK

Web

SourceIdentification

ContentSelection

InformationRetrieval

DataMining

Discovery

Page 5: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

INFORMATION RETRIEVALINFORMATION RETRIEVAL Searching ToolsSearching Tools

DirectoryDirectory Search engineSearch engine

Web CrawlerWeb Crawler URL access methodURL access method Web page parserWeb page parser

Table extractorTable extractor Link extractor – absolute links/relative linksLink extractor – absolute links/relative links Word extractorWord extractor

Page 6: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

DATA MINING FUNCTIONSDATA MINING FUNCTIONS Association RulesAssociation Rules

Find interesting association or correlation Find interesting association or correlation relationship among data itemsrelationship among data items

ClassificationClassification Predict classesPredict classes Two steps – build model, apply modelTwo steps – build model, apply model

ClusteringClustering Find natural groups of dataFind natural groups of data

Page 7: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

OPEN SOURCE SOFTWAREOPEN SOURCE SOFTWARE Open Source Software (OSS)Open Source Software (OSS)

Apache, Perl, LinuxApache, Perl, Linux Developed by part time contributorsDeveloped by part time contributors

SourceForge Developer SiteSourceForge Developer Site Sponsored by VA SoftwareSponsored by VA Software Largest OSS development siteLargest OSS development site

70,000 projects70,000 projects 90,000 developers90,000 developers 700,000 users700,000 users

Page 8: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

DATA COLLECTONDATA COLLECTON Data sourcesData sources

Statistics, forumsStatistics, forums Project statisticsProject statistics

9 fields – project ID, lifespan, rank, page 9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, views, downloads, bugs, support, patches and CVSpatches and CVS

Developer statisticsDeveloper statistics Project ID and developer IDProject ID and developer ID

Page 9: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

DATA COLLECTON (Cont.)DATA COLLECTON (Cont.) Web CrawlerWeb Crawler

Perl and CPANPerl and CPAN LWP – fetch pagesLWP – fetch pages HTML parser – parse pagesHTML parser – parse pages HTML::TableExtract – extract informationHTML::TableExtract – extract information Link extractor – extract linksLink extractor – extract links

Page 10: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

DATA MININGDATA MINING Association RulesAssociation Rules

““all tracks”, “downloads” and “CVS” are associaall tracks”, “downloads” and “CVS” are associatedted ClassificationClassification

Predict “downloads”Predict “downloads” Naïve Bayes – Build Time 30 sec, accuracy Naïve Bayes – Build Time 30 sec, accuracy 9%9% Adaptive Bayes Network - Build Time 20 min, accuracy Adaptive Bayes Network - Build Time 20 min, accuracy 63%63%

ClusteringClustering K-means: User specified number of clustersK-means: User specified number of clusters O-cluster: Automatically detect the number of clustersO-cluster: Automatically detect the number of clusters

Page 11: A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

CONCLUSIONSCONCLUSIONS ConclusionsConclusions

Build a framework Build a framework Describe proceduresDescribe procedures Discuss techniquesDiscuss techniques Provide a case studyProvide a case study

Future WorkFuture Work Exploratory studyExploratory study Implement all stages Implement all stages