Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
-
Upload
wkwsci-research -
Category
Education
-
view
104 -
download
1
description
Transcript of Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
![Page 1: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/1.jpg)
Words and More Words: Challenges of Big (Text) Data
Edie Rasmussen Visiting Professor, Nanyang Technological University
Professor, University of British Columbia
WKWSCI
SYMPOSIUM
2014 Big Data, Big Ideas for Smarter Communities
![Page 2: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/2.jpg)
Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extracting Meaning from Text
2 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 3: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/3.jpg)
The Rise of Big Text Data
• Before there was Big Data, there were large bibliographic databases:
– Dialog: ~180 scholarly databases
– Lexis/Nexis: 5 billion documents (business/law/news)
– Citation Indexes: > 40 million records
• IR techniques designed for rapid access to very large (text) databases
• Swanson: “Undiscovered public knowledge” (1987)
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
3
![Page 4: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/4.jpg)
Current Text Sources
• Digitized Legacy Materials – Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Search Logs (over 2 million queries per minute)
• Wikipedia (~4.5 million English articles)
• Blogs (The Blogosphere)
• Twitter (The Twitterverse)
• Test Collections – Smaller
– Experimentally more robust
4 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 5: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/5.jpg)
Challenges of Text
• Legacy Text/Digitization Costs • Quality (OCR Errors; Metadata Errors) • Availability (Access, Copyright, Privacy) • Reliability
– Algorithmic dependencies – Creator trustworthiness
• Authorship Issues (Identification, Authority) • Lack of Structure • Lack of Context • Ambiguity of human language • Breadth vs. Depth
5
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 6: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/6.jpg)
Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Other evidence: network structures
• Similarity calculations
• Creating ranked lists
• Note: Probabilistic rather than Deterministic
6
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 7: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/7.jpg)
Counting and the Rise of Culturomics
• “Culturomics is the application of high-throughput data collection and analysis to the study of human culture”
• Database of >5 million digitized books (~4%)
• Michel et al. (Science, 2011): “Quantitative analysis of culture using millions of digitized books”
• Google’s N-Gram Viewer
7
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 8: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/8.jpg)
Using the N-Gram Viewer
8
typhoid
gout
1800 2000 1900
HIV
cholera
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 9: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/9.jpg)
How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
– Size of English lexicon
– Change in word usage (irregular verbs) over time
– Cultural turnover (inventions)
– The nature (duration) of fame
– Patterns of censorship (“suppression index”)
9 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 10: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/10.jpg)
Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions.” (Nunberg, 2011)
• Books as a representation of human history
• A “time sink”
10 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 11: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/11.jpg)
Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
– 6 new Wikipedia articles
– 30 hours of video uploaded (Source: Intel http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html)
11 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 12: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/12.jpg)
TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find subsequent instances
• Story segmentation, First story detection, Clustering of like stories
• Interesting to news, business, security analysts
12 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 13: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/13.jpg)
TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve Bayes, etc.) - > positive, negative, neutral
• Involves Entity Extraction, NLP, sentiment vocabularies
• Of interest to government and businesses
• See Stanford SA of movie reviews: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
13 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 14: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/14.jpg)
TM: Trends and Predictions
• Can Tweets and Search Logs be used to predict the future?
• Google Flu Trends, Google Dengue Trends – Correlated with Search Terms
• Network analysis on Tweets on Arab Spring
• Assessing tone of global news data to predict national stability, location of terrorists, etc. (Leetaru)
• Predicting opinions (recommender systems)
14
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 15: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/15.jpg)
TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of candidate text
– Natural language processing
– Entity extraction
– Hypothesis generation and scoring (confidence)
– Ranking hypotheses
15 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 16: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/16.jpg)
16
Watson, 2011
Hans Peter Luhn, 1952
Watson, 2011
![Page 17: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/17.jpg)
Structuring Research: “Digging Into Data” Program
• Addresses: “how "big data" changes the research landscape for the humanities and social sciences”
• 3 rounds of international research funding • Canada, US, UK, plus Netherlands • Team approach: scholars, scientists, information
professionals • Requires international teams; funding from at
least two countries • Wide range of datasets made available • http://www.diggingintodata.org/
17
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 18: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/18.jpg)
18 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 19: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.fdocuments.us/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/19.jpg)
Thank you!
19 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities