Cenitpede: Analyzing Webcrawl
-
Upload
primal-pappachan -
Category
Technology
-
view
381 -
download
0
Transcript of Cenitpede: Analyzing Webcrawl
Centipede: Analyzing Web Crawl data for context of a
location
Vikas BansalPrimal PappachanAbhishek Sethi
Introduction
Introduction
Description
A web service that presents the context associated with a location
Context of a location
1. Weather2. Healthcare3. Crime4. Employment5. ……
Customers
1. Moving/Travelling into a new place2. Policy Makers3. Journalists4. Researchers
Scenario
Related Services
● Yelp● Google news● http://bestplaces.net/● http://www.nycgo.com/events/● http://www.stubhub.com/
Technical Description of Service
● Analyze the web crawl data● Create a list of locations ● Filter top 100 words from the files that
mention a location from the list● Build an index of location against list of
words corresponding to that location
System Architecture
Data Sources
•Common Crawl Data from Amazon S3–Contains information on billions of web pages–Search through the contents–Use ARC and Text files
Technologies and Resources
● Hadoop Cluster on Bluegrit System● Apache Pig
○ Python for UDF’s● Java/PHP for front end development
○ Use a Jboss container for Java, Xampp for PHP ● Elastic Search● Map Reduce● SQL/NoSQL database● REST● WSDL 2.0● AWS - RDS, R53, EC2
MapReduce Job
Splitter● Sentence ● Paragraph● Article
Elastic Search
● Distributed restful search and analytics.● Has near real-time search.● Resilient clusters - detect and remove failed
nodes.
Challenges and Limitations
•Amount of HDD space available.•Learning new technologies such as Apache Pig, WSDL etc.•Creating special UDF’s in Python.
Timeline
References
● Data set ● Common Crawl Web data ● Elastic Search ● Apache Pig ● Elastic Search for Term Filter lookup● Hadoop Tutorial● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data
processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.