ProjectHub
-
Upload
sematext-group-inc -
Category
Technology
-
view
108 -
download
2
description
Transcript of ProjectHub
![Page 1: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/1.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
ProjectHub
Crawling, Indexing, and Searching Software Project Datawith Droids, Tika, Solr & friends
Otis Gospodnetić ◦◦ [email protected] ◦◦ @otisg
Sematext Int'l ◦◦ www.sematext.com ◦◦ @sematext
1
![Page 2: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/2.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
What I Will Cover
• Who I am• What Why Where• Architecture• Info Gathering & Indexing• Search & Extra Search Dog Food• Performance & Analytics• Ops & Stats
2
![Page 3: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/3.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
About Otis Gospodnetić
• Lucene/Solr/Nutch/Mahout/... committer
• Lucene in Action 1 & 2 co-author
• Lucene Consulting since 2005
• Sematext International since 2007
3
![Page 4: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/4.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
About Sematext
Search (Lucene, Solr, Elastic Search...)
Web Crawling (Nutch)
Machine Learning (Mahout)
Big Data (Hadoop, HBase, Voldemort...)
![Page 5: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/5.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
What
• Search everything about a Software Project• Lucene & Hadoop
– All sub-projects– All content• Mailing list archives• JIRA issues• Web site & Wiki pages• Source code (local syntax highlighting), trunk• Javadoc, trunk
5
![Page 6: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/6.jpg)
Copyright 2010 Sematext Int'l. All rights reserved. 6
![Page 7: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/7.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Why
• We need it• Other Hadoop, Lucene, Solr... users need it• Our own playground• Live product demos• Yummy dog food
7
![Page 8: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/8.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Where
• search-lucene.com• search-hadoop.com• Other suggestions / needs?• In your Enterprise?
8
![Page 9: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/9.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture
9
![Page 10: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/10.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Tool Matrix
Data Source Fetch Parse
JIRA URLConnection (feed) Digester (feed) DOM (item)
ML FileInputStream (fs) URLConnection (feed)Droid (works, unused)
Digester (feed) MIME4J (mbox)
Web site Droids Tika via Droids
Wiki Droids Tika via Droids
Source code svn co QDox
Javadoc svn co QDox
![Page 11: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/11.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Information Gathering
• Multiple independent JVM processes (cron)
• Different polling frequencies
• Different data sources / formats:– RSS (JIRA, Mailing Lists)– Mbox (Mailing Lists)– HTTP/HTML (Web site, Wiki)– Subversion (source code, Javadoc)
• Nutch is a beast. Droids is light & simple.
• ML thread detection is tricky
• Finding deleted docs (Wiki, Web, Javadoc...)
![Page 12: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/12.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Thread Detection
• Email clients are kaput
• SMTP headers are unreliable
• Heuristics are needed– Try headers– Fall back to subjects (get subject skeleton,
calculate hash)– Factor in time (4 weeks)– Use index for thread info retrieval
Q: Are there any libraries for this?
![Page 13: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/13.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Indexing
• Use StreamingUpdateSolrServer
• AutoCommit use-case
• Solr index abuse: track seen/unseen
• &qsrc=indexer
• &warmUp=true
• Separate processes – easier reindexing (esp. with frequent project infra changes)
• Treating quoted portions of ML messages
![Page 14: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/14.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Search
• Facets (multi-select)– Project– Data source/type– Author (based on names only)
• Boosting more recent documents vs. pure relevance vs. newest/oldest first
give equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs)
recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4
![Page 15: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/15.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Search cont'd
• Query Spellchecker
• Sematext components:– ReSearcher & Relaxer– AutoComplete– Key Phrase Extractor (2 approaches)
• Threaded vs. flat view
• In-document search term highlighting
• Short URLs
![Page 16: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/16.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Search cont'd
![Page 17: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/17.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #1: Auto-Complete
• Source: nightly refreshed subject and titles
• Approach: go directly to selection
• sematext.com/products/autocomplete/
![Page 18: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/18.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #2: ReSearcher & Relaxer• Avoid “sorry, no/poor matches”
• Multiple algos trigger re-searching
• Different forms of relaxing
• sematext.com/products/dym-researcher/
![Page 19: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/19.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #3: Key Phrases
Help narrow search results, like facets
• 2 types:– Stored in index vs. calculated from top N hits
sematext.com/products/key-phrase-extractor/
![Page 20: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/20.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Basic Search Analytics
• Top queries, top terms...
• Daily, weekly, monthly
• MRRhttp://en.wikipedia.org/wiki/Mean_reciprocal_rank
![Page 21: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/21.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Very Basic Search Analytics
![Page 22: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/22.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Real Search Analytics
![Page 23: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/23.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Performance & Monitoring: RPM
![Page 24: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/24.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Availability: Site24x7.com
![Page 25: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/25.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Operations
• Small EC2 instance: 1.7 GB RAM
• EBS for data - got burnt once
• Local disk for index
• Solr 1.4.1 multi-core
• Performance monitoring via RPM
• Availability & performance via site24x7.com
![Page 26: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/26.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Statistics
• search-hadoop.com:– 110K+ documents– ~700 MB optimized
• search-lucene.com– 170K+ documents– ~900 MB optimized
![Page 27: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/27.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Future
• Field collapsing (threads)
• Bot detection (load) DONE
• Solr duplicate detection (release notes)
• Relevance tuning (MRR)
• Open sourcing?
![Page 28: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/28.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
World-wide!
Search & Data Analytics
Machine Learning & NLP
Big Data
WE ARE HIRING
![Page 29: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/29.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Questions
?
![Page 30: ProjectHub](https://reader037.fdocuments.us/reader037/viewer/2022103015/54c69d504a7959d9668b4571/html5/thumbnails/30.jpg)
Copyright 2010 Sematext Int'l. All rights reserved.
Contact
• sematext.com
• blog.sematext.com
• @sematext
• @otisg
30