From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton...
Transcript of From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton...
![Page 1: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/1.jpg)
From Data to InformationApache Mahout
Speaker: Isabel Drost
![Page 2: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/2.jpg)
Isabel Drost
Nighttime:Co-Founder Apache Mahout.
Organizer of Berlin Hadoop Get Together.
Daytime:Software developer @ Berlin
![Page 3: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/3.jpg)
Hello ApacheCon visitors!
![Page 4: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/4.jpg)
Agenda
● Motivation.
● HowTo: A path from data to information.
● Introduction to Mahout.
![Page 5: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/5.jpg)
January 3, 2006 by Matt Callowhttp://www.flickr.com/photos/blackcustard/81680010
![Page 6: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/6.jpg)
News aggregation
Today: Read news papers,Blogs, Twitter, RSS feed.
Wish: Aggregate sourcesand track emerging topics.
September 10, 2008 by Alex Barthhttp://www.flickr.com/photos/a-barth/2846621384
![Page 7: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/7.jpg)
![Page 8: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/8.jpg)
Go to cinema
Today: IMDB, zitty, movie reviewpages, twitter, blogs, ask friends.
Wish: Reviews, sentimentdetection, recommendations.
March 22, 2008 by Crystian Cruzhttp://www.flickr.com/photos/crystiancruz/2353895708
![Page 9: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/9.jpg)
HowTo: From data to information.
![Page 10: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/10.jpg)
From data to information.
● Start collecting and storing data.
● Analyse and understand data.
● Answer more complex questions.
![Page 11: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/11.jpg)
Collecting and storing data.
![Page 12: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/12.jpg)
By Lab2112, http://www.flickr.com/photos/lab2112/462388595/
![Page 13: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/13.jpg)
Data storage optionsData storage options
● Structured, relational.– Customer data.
– Bug database.
![Page 14: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/14.jpg)
By bareform, http://www.flickr.com/photos/bareform/2483573213/
![Page 15: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/15.jpg)
Data storage optionsData storage options
● Continuous files.– Log data.
– Document Stream.
![Page 16: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/16.jpg)
January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/
![Page 17: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/17.jpg)
Data storage optionsData storage options
● Semi-structured data:– Documents.
– Independent rows.
![Page 18: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/18.jpg)
From data to information.
● Start collecting and storing your data.
● Analyse and understand your data.
● Answer more complex questions.
![Page 19: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/19.jpg)
Understanding your data
● Data profiling.
● Goals:– Identify usual behaviour.
– Find exceptional cases.
● Exact questions depend on domain.
![Page 20: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/20.jpg)
Example: Shopping sessions
● Average amount of money spent.● Number of customers per state.● Min/Max age of customers.● Number of shopping sessions.● Words associated with product.
![Page 21: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/21.jpg)
Example: Access Logs
● Average session length.● Entry-/ exit-pages.● Average number of hits/ day.● Clean data.
![Page 22: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/22.jpg)
Example: Textual documents
● Average length of documents.● Distribution of document topics.● Distribution of authors.
![Page 23: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/23.jpg)
Understanding your data
● Analysing data in HDFS/ HBase/ CouchDB:– Write analysis code as Map/Reduce jobs.
– Use higher level language.
![Page 24: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/24.jpg)
From data to information.
● Start collecting and storing your data.
● Analyse and understand your data.
● Answer more complex questions.
![Page 25: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/25.jpg)
Analyse shopping lists
By tanakawho, http://www.flickr.com/photos/28481088@N00/349049527/
![Page 26: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/26.jpg)
Interactive web search
![Page 27: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/27.jpg)
Show most relevant ads
![Page 28: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/28.jpg)
Show most relevant ads
![Page 29: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/29.jpg)
Show most relevant ads
![Page 30: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/30.jpg)
Find emerging news topics
![Page 31: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/31.jpg)
Machine learning – what's that?
![Page 32: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/32.jpg)
Image by John Leech, from: The Comic History of Rome by Gilbert Abbott A Beckett.
Bradbury, Evans & Co, London, 1850sArchimedes taking a Warm Bath
![Page 33: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/33.jpg)
Archimedes model of nature
![Page 34: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/34.jpg)
June 25, 2008 by chase-mehttp://www.flickr.com/photos/sasy/2609508999
![Page 35: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/35.jpg)
![Page 36: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/36.jpg)
An SVM's model of nature
![Page 37: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/37.jpg)
The challenge
![Page 38: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/38.jpg)
● Large amounts of data.
● Structured and unstructured data.
● Diverse tasks.
![Page 39: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/39.jpg)
Mission
Provide scalable data mining algorithms.
![Page 40: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/40.jpg)
● Commercially friendly license.
● Scalable to large amounts of data.
● Well documented.
● Healthy community.
● Targeted to developers.
![Page 41: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/41.jpg)
What does Mahout have to offer.
![Page 42: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/42.jpg)
Discover groups of items
● Group items by similarity.
● Examples:– Group news articles by topic.
– Find developers with similar interests.
![Page 43: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/43.jpg)
![Page 44: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/44.jpg)
![Page 45: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/45.jpg)
Discover groups of similar items
● Canopy.
● k-Means.
● Fuzzy k-Means.
● Dirichlet based.
● Others upcoming.
![Page 46: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/46.jpg)
Discover groups of similar items
![Page 47: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/47.jpg)
Identify dominant topics
● Given a dataset of texts, identify main topics.
● Examples:– Dominant topics in set of mails.
– Identify news message categories.
Algorithms: Parallel LDA
![Page 48: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/48.jpg)
Assign items to defined categories.
● Given pre-defined categories, assign items to it.
● Examples:– Spam mail classification.
– Discovery of images depicting humans.
![Page 49: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/49.jpg)
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
![Page 50: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/50.jpg)
![Page 51: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/51.jpg)
![Page 52: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/52.jpg)
Assign items to defined categories.
● Naïve Bayes.
● Complementary naïve bayes.
● Random forests.
● Others upcoming.
![Page 53: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/53.jpg)
Assign items to defined categories
● Examples based on “standard” datasets:
● 20 Newsgroupshttp://cwiki.apache.org/MAHOUT/twentynewsgroups.html
● Wikipediahttp://cwiki.apache.org/MAHOUT/wikipediabayesexample.html
![Page 54: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/54.jpg)
Recommendation mining.
● Recommend items to users.
● Examples:– Find books related to the book I am buying.
– Find movies I might like.
![Page 55: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/55.jpg)
Recommending places
![Page 56: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/56.jpg)
Recommending people
![Page 57: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/57.jpg)
Recommendation mining.
● Integrated Taste.● Mature Java library.● Java-based, web service / HTTP bindings.
● Batch mode based on EC2 and Hadoop.
![Page 58: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/58.jpg)
Frequent pattern mining
● Given groups of items, find commonly co-occurring items.
● Examples:– In shopping carts find items bought together.
– In query logs find queries issued in one session.
![Page 59: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/59.jpg)
By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/
![Page 60: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/60.jpg)
By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/
By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/
![Page 61: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/61.jpg)
By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/
By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/
By quinnanya, http://www.flickr.com/photos/quinnanya/2806883231/
![Page 62: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/62.jpg)
Upcoming
● More algorithms.
● Optimization of existing implementations.
● More examples.
● Release 0.2
![Page 63: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/63.jpg)
Upcoming
● “TU Winter of Code”– Crawl and store blog postings.
– Group posts and identify emerging topics.
– Index results with Solr.
Database Systems and Information Management
Prof. Dr. Volker Markl
![Page 64: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/64.jpg)
TU Winter of Code
● 6 students, 5 months.● http://github.org/MaineC/Playground
![Page 65: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/65.jpg)
Why go for Apache Mahout?
![Page 66: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/66.jpg)
Jumpstart your project with proven code.
January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559
![Page 67: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/67.jpg)
Discuss ideas and problems online.
November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296
![Page 68: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/68.jpg)
Become part of the community.
![Page 69: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/69.jpg)
Interest in solving hard problems.
Being part of lively community.
Engineering best practices.
Bug reports, patches, features.
Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449
![Page 70: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/70.jpg)
Dec., 16th 2009: Hadoop* Get Together in Berlin
– Richard Hutton (nugg.ad): “Moving from five days to one hour.”
– Jörg Möllenkamp (Sun): “Jörg Möllenkamp (Sun): "Hadoop on Sun."
– Nikolaus Pohle (nurago): "M/R for MR - Online Market Research powered by Apache Hadoop. Enable consultants to analyze online behavior for audience segmentation, advertising effects and usage patterns."
http://upcoming.yahoo.com/event/4842528/
* UIMA, Hbase, Lucene, Solr, katta, Mahout, CouchDB, pig, Hive, Cassandra, Cascading, JAQL, ... talks welcome as well.
![Page 71: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/71.jpg)
March 2009: Hadoop* Get Together in Berlin
– Dragan Milosevic ( ): TBA
– YOU!
newthinking store Berlin
Tucholskystr. 48
* UIMA, Hbase, Lucene, Solr, katta, Mahout, CouchDB, pig, Hive, Cassandra, Cascading, JAQL, ... talks welcome as well.
![Page 72: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/72.jpg)
Mahout Meetup this evening
![Page 73: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/73.jpg)
Interest in solving hard problems.
Being part of lively community.
Engineering best practices.
Bug reports, patches, features.
Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449
![Page 74: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/74.jpg)
![Page 75: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/75.jpg)
![Page 76: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/76.jpg)
Going parallel: k-Means
![Page 77: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/77.jpg)
![Page 78: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/78.jpg)
![Page 79: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/79.jpg)
![Page 80: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/80.jpg)
![Page 81: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/81.jpg)
Until stable.
![Page 82: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/82.jpg)
Until stable.
Data intensive.Output: Cluster assignment.Pre-Compute centers.
Done in Map.
![Page 83: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/83.jpg)
Until stable.
Data intensive.Output: Cluster assignment.Pre-Compute centers.
Done in Map. Done in Reduce.
![Page 84: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f478595aec0af589735a94d/html5/thumbnails/84.jpg)
Make searching the web easier