Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary...
Transcript of Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary...
Google Confidential and Proprietary
Statistical Research in the Tech Industry: Google Suggest & Instant
Donal McMahon ([email protected])
Alternatively: Generating search query suggestions - natural language processing using Markov chains
Google Confidential and Proprietary
Big questionsData-driven Design
● Google has access to lots of data ○ search queries, emails, maps, social network data...
● How would you use it to improve its products? ● How would you know people liked these new products?
○ Best way to set up experiments? ○ What methods to evaluate performance?○ What is the best way to balance privacy and
usefulness?
Forty-one shades of blue!
Google Confidential and Proprietary
Original "Suggest"
Google Confidential and Proprietary
Original "Suggest"
● Just on search boxes - when you type in "h" it autocompletes to "hotmail", "hulu", "home depot", etc...
● How would YOU generate reasonable suggestions?
Google Confidential and Proprietary
Why is it useful?
3 players:- Google- Advertisers- Users
Google Confidential and Proprietary
How would you generate suggestions?
● If you saw only one letter○ "t"○ What's the most likely next letter?
● When a second letter is entered○ "th"○ What would you guess then?
● How about after a couple of words?
● How would you build up a dictionary?
Google Confidential and Proprietary
How would you generate suggestions?
● What data would you use?
● Is there supplemental information you might have on the user?
● What about the location of the query?
● What about spam? How might it arrive and how to remove it?
Google Confidential and Proprietary
Changes by location
Google Confidential and Proprietary
Natural Language Processing● Intersection between computer science, linguistics and
statistics.
● "The goal of NLP is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person."
● Some examples: ● Automatic summarisation● Sentiment analysis● Speech recognition● Machine translation
Google Confidential and Proprietary
N-grams
● An n-gram is a contiguous sequence of n items from a given sequence of text or speech.○ bigram○ trigram○ ...○ n-gram
● Build a predictive model for Xi based on Xi-(n-1),Xi-(n-2),...,Xi-1
● P(Xi|Xi-(n-1),Xi-(n-2),...,Xi-1)
Google Confidential and Proprietary
Google N-grams● Not allowed give most recent data - but you can guess!!● As of 2006:
○ Processed 1,024,908,267,229 words of running text ○ Published the counts for all 1,176,470,663 five-word
sequences that appear at least 40 times.○ Data available in: http://www.ldc.upenn.edu/
File sizes: approx. 24 GB compressed (gzip'ed) text filesNumber of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of fivegrams: 1,176,470,663
Google Confidential and Proprietary
Google N-grams● Not allowed give most recent data - but you can guess!!● As of 2006:
○ Processed 1,024,908,267,229 words of running text ○ Published the counts for all 1,176,470,663 five-word
sequences that appear at least 40 times.○ Data available in: http://www.ldc.upenn.edu/
File sizes: approx. 24 GB compressed (gzip'ed) text filesNumber of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of fivegrams: 1,176,470,663
Google Confidential and Proprietary
Google N-gramExample of 3-gram data in corpus:
ceramics collectables collectibles 55ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection | 59ceramics collections , 66ceramics collections . 60ceramics combined with 46
Example of 4-gram data in corpus:
serve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234serve as the industrial 52serve as the industry 607serve as the info 42
Google Confidential and Proprietary
N-gram viewer
Google Confidential and Proprietary
N-gram viewer
Google Confidential and Proprietary
N-gram viewer
Google Confidential and Proprietary
Extending to Instant Search
Google Confidential and Proprietary
Google Instant● What's the actual effect for users?● For the general economy?● What about for the backend?
Areas here are suspicious!
Google Confidential and Proprietary
Other Statistical Projects● Run experiments to see if users, advertisers and we like
potential feature launches○ How best to assign to experiments?○ How would you evaluate performance?
● The basic Backrub/Pagerank model● The advertising auction● How often to index the web?● Predicting flu trends● Self-driving cars● ....
Google Confidential and Proprietary
Flu Trends
Google Confidential and Proprietary
Thank you
Questions?
Jobs: http://www.google.com/about/jobs/
Contact: [email protected]
Google Confidential and Proprietary
Appendix
Google Confidential and Proprietary
Big questionsData-driven Design
● Google has access to lots of data ○ search queries, emails, maps, social network data...
● How would you use it to improve its products? ● How would you know people liked these new products?
○ Best way to set up experiments? ○ What methods to evaluate performance?○ What is the best way to balance privacy and
usefulness?
Forty-one shades of blue!
Google Confidential and Proprietary
Why is this useful?Users● Speed for users● Spelling mistakes avoided - somewhat● Better search experience - how would we measure this?
Advertisers● Don't have to think about keyword targeting ● Could use this same methodology in finding similar
searches (kind of) - how would you extend it
Google● Fewer spurious searches● Can store good results and serve them quicker
Google Confidential and Proprietary
N-gram viewer
Google Confidential and Proprietary
N-gram viewer
Google Confidential and Proprietary
N-gram viewer
Google Confidential and Proprietary
Self-driving Cars