A RESEARCH SUPPORT SYSTEA RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA M FRAMEWORK FOR WEB DATA MININGMININGJin Xu, Yingping Huang, Gregory Madey
Department of Computer Science and EngineeringUniversity of Notre Dame
Notre Dame, IN 46556
WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support WSS’03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support SystemsSystems
October 13, 2003, HalifaxOctober 13, 2003, Halifax
This research was partially supported by NSF, CISE/IIS-Digital Society and Technology
OUTLINEOUTLINE INTRODUCTIONINTRODUCTION FRAMEWORK OVERVIEWFRAMEWORK OVERVIEW INFORMATION RETRIEVALINFORMATION RETRIEVAL DATA MINING TECHNIQUES DATA MINING TECHNIQUES CASECASE CONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORK
INTRODUCTIONINTRODUCTION World Wide WebWorld Wide Web
Abundant informationAbundant information Important resource for researchImportant resource for research
Web Data FeaturesWeb Data Features Semi-structuredSemi-structured HeterogeneousHeterogeneous DynamicDynamic
A Research Support System for Web A Research Support System for Web Data MiningData Mining
FRAMEWORKFRAMEWORK
Web
SourceIdentification
ContentSelection
InformationRetrieval
DataMining
Discovery
INFORMATION RETRIEVALINFORMATION RETRIEVAL Searching ToolsSearching Tools
DirectoryDirectory Search engineSearch engine
Web CrawlerWeb Crawler URL access methodURL access method Web page parserWeb page parser
Table extractorTable extractor Link extractor – absolute links/relative linksLink extractor – absolute links/relative links Word extractorWord extractor
DATA MINING FUNCTIONSDATA MINING FUNCTIONS Association RulesAssociation Rules
Find interesting association or correlation Find interesting association or correlation relationship among data itemsrelationship among data items
ClassificationClassification Predict classesPredict classes Two steps – build model, apply modelTwo steps – build model, apply model
ClusteringClustering Find natural groups of dataFind natural groups of data
OPEN SOURCE SOFTWAREOPEN SOURCE SOFTWARE Open Source Software (OSS)Open Source Software (OSS)
Apache, Perl, LinuxApache, Perl, Linux Developed by part time contributorsDeveloped by part time contributors
SourceForge Developer SiteSourceForge Developer Site Sponsored by VA SoftwareSponsored by VA Software Largest OSS development siteLargest OSS development site
70,000 projects70,000 projects 90,000 developers90,000 developers 700,000 users700,000 users
DATA COLLECTONDATA COLLECTON Data sourcesData sources
Statistics, forumsStatistics, forums Project statisticsProject statistics
9 fields – project ID, lifespan, rank, page 9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, views, downloads, bugs, support, patches and CVSpatches and CVS
Developer statisticsDeveloper statistics Project ID and developer IDProject ID and developer ID
DATA COLLECTON (Cont.)DATA COLLECTON (Cont.) Web CrawlerWeb Crawler
Perl and CPANPerl and CPAN LWP – fetch pagesLWP – fetch pages HTML parser – parse pagesHTML parser – parse pages HTML::TableExtract – extract informationHTML::TableExtract – extract information Link extractor – extract linksLink extractor – extract links
DATA MININGDATA MINING Association RulesAssociation Rules
““all tracks”, “downloads” and “CVS” are associaall tracks”, “downloads” and “CVS” are associatedted ClassificationClassification
Predict “downloads”Predict “downloads” Naïve Bayes – Build Time 30 sec, accuracy Naïve Bayes – Build Time 30 sec, accuracy 9%9% Adaptive Bayes Network - Build Time 20 min, accuracy Adaptive Bayes Network - Build Time 20 min, accuracy 63%63%
ClusteringClustering K-means: User specified number of clustersK-means: User specified number of clusters O-cluster: Automatically detect the number of clustersO-cluster: Automatically detect the number of clusters
CONCLUSIONSCONCLUSIONS ConclusionsConclusions
Build a framework Build a framework Describe proceduresDescribe procedures Discuss techniquesDiscuss techniques Provide a case studyProvide a case study
Future WorkFuture Work Exploratory studyExploratory study Implement all stages Implement all stages
Top Related