Information surprise or how to find interesting data
-
Upload
oleksandr-pryymak -
Category
Science
-
view
376 -
download
4
Transcript of Information surprise or how to find interesting data
![Page 2: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/2.jpg)
What is a ‘surprise’?
![Page 3: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/3.jpg)
Define Surprise!
surprise
[countable] an event, a piece of news, etc. that is unexpected or that happens suddenlySYNONYMS: shock, … , eye-opener
[uncountable, countable] a feeling caused by something happening suddenly or unexpectedlySYNONYMS: astonishment, ...
(Oxford Advanced Learner's Dictionary)
![Page 4: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/4.jpg)
Cat explores
![Page 5: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/5.jpg)
Cat explores
meh
![Page 6: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/6.jpg)
Cat meets unexpected
![Page 7: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/7.jpg)
Cat meets unexpected
wow
![Page 8: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/8.jpg)
Quantify Surprise!
?measured in
wows
![Page 9: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/9.jpg)
QuantifyComplexity can measure any content type.Note: complex is not random!
Measures of complexity1. Subjective rating2. #Distinct elements3. #Dimension4. #Control parameters5. Minimal description6. Information content7. Minimal generator8. Minimum energy
Abdallah, S., & Plumbley, M. (2009). Information dynamics: patterns of expectation and surprise in the perception of music. Connection Science, 21(2-3), 89-117.
<vs>
![Page 10: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/10.jpg)
Neuro/Cognitive ScienceHow do we perceive information?
Machine LearningHow to measure differences?
Surprise Quants in academia
![Page 11: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/11.jpg)
... machine that constantly tells you what you already know is just irritating. So software alerts users only to surprises...Horvitz, E., Apacible, J., Sarin, R., & Liao, L. Prediction, Expectation, and Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting Service.
Friston, K. (2010). The free-energy principle: a unified brain theory?. Nature Reviews Neuroscience, 11(2), 127-138.
Surprise Quants in academiaNeuro/Cognitive ScienceHow do we perceive information?
Machine LearningHow to measure differences?
![Page 12: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/12.jpg)
Machine LearningNeuro/Cognitive Science
Surprise Quants in academia
Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information processing systems (pp. 547-554).
![Page 13: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/13.jpg)
Surprise Quants in academia
Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information processing systems (pp. 547-554).
meh
wow
meh
![Page 14: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/14.jpg)
Typical ML applicationsUnsupervised Learning
1. Decision trees (inf. gain)2. MaxEnt principle 3. ...
Specifically after ‘surprise’:4. One-class classification5. Anomaly detection6. Novelty measure Pimentel, M. A., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014).
A review of novelty detection. Signal Processing, 99, 215-249.
![Page 15: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/15.jpg)
Model of a catData Model
(expectations)
Data (stream) Surprising?
(interesting, new)
Update
wow(act)
meh(ignore)
Element(attention window)
![Page 16: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/16.jpg)
Model of a cat’s surprise
Surprising?(interesting, new)
![Page 17: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/17.jpg)
Quantify surprisal /self-information/
The surprise /information/ in observing the occurrence of an event having probability .
Axioms:≤≥
∗
Derive:∗ ∗
∗
Surprisal /self-information/:−
Flipping a fair coin provides 1bit of new information.
bitsor wows
bits
![Page 18: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/18.jpg)
Surprisal applicationsSelecting information source:
Oleksandr Pryymak. Achieving Accurate Opinion Consensus in Large Multi-Agent SystemsUniversity of Southampton, Doctoral Thesis, 170pp., 2013
![Page 19: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/19.jpg)
Model of a catData Model
(expectations)
Data (stream) Surprising?
(interesting, new)
Update
wow(act)
meh(ignore)
Element(attention window)
![Page 20: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/20.jpg)
Model of a cat’s knowledgeData Model
(expectations)
![Page 21: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/21.jpg)
Quantify ‘knowledge’ /entropy/
The Shannon entropy is the expected value of the self-information.
Notes:1. The maximum entropy distribution
is the least informative.
2. The statistical mechanics and the information entropy are principally the same.
max: log2(n)
Entropy of a Bernoulli trialX Є {0,1}
![Page 22: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/22.jpg)
Entropy applicationsAnalysis of a binary of GeoIP ISP database:
Analyzing unknown binary files using information entropy:http://yurichev.com/blog/entropy/
![Page 23: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/23.jpg)
Entropy applicationsVisualizing the OSX ksh binary (see binvis.io)
Visualizing entropy in binary files http://corte.si/posts/visualisation/entropy/index.html
1,2: Cryptic signature
![Page 24: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/24.jpg)
Model of a cat’s discoveryData Model
(expectations)
Surprising?(interesting, new)
wow(act)
meh(ignore)
Element(attention window)
What has changed?
![Page 25: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/25.jpg)
The Kullback–Leibler divergence /relative entropy, information gain/: is a measure of the information lost when Q is used to approximate P (measures the expected number of extra bits required to recode)
Quantify ‘discovery’ /information gain/
"KL-Gauss-Example" T. Nathan Mundhenk
Not a true measure: asymmetric →
![Page 26: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/26.jpg)
Quantify ‘discovery surprise’ Symmetric KL Distances: All result in the same performance:
Pinto, D., Benedí, J. M., & Rosso, P. (2007). Clustering narrow-domain short texts by using the Kullback-Leibler distance. In Computational Linguistics and Intelligent Text Processing
![Page 27: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/27.jpg)
Calculating KLDData sparseness problem: often ∞Solutions:- drop components from calculations- smothing:
![Page 28: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/28.jpg)
Surprise in TweetsKLD application
![Page 29: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/29.jpg)
Surprise in TweetsKLD application
![Page 30: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/30.jpg)
Explore data: search engines Elasticsearch +Kibana = faceted data exploration
![Page 31: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/31.jpg)
Whole dataset
I still have hopes to find where I left this partition
![Page 32: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/32.jpg)
Whole dataset
![Page 33: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/33.jpg)
Whole dataset
![Page 34: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/34.jpg)
Whole datasetMH17July 17,2014
Annexation of CrimeaFeb 20... March 20,2014
Presidential electionsMay 25,2014
Experiments SETFeb 1 - 28, 2014
![Page 35: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/35.jpg)
tweets: 5.64 M
Experiment dataset: Feb2014
![Page 36: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/36.jpg)
Experiment dataset: English
![Page 37: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/37.jpg)
Pipeline
Stream (tweets)
Last 8 timeslots (data model)
Timeslot(attention window)
KLD(interesting, new)
Update
new event
(act)
meh(ignore)
![Page 38: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/38.jpg)
Simplistic topic modeling- tweets are super short
+ important events are widely discussed+ events change vocabulary- timeslot aggregation favors the predominant event
Document is a timeslot.Model:
- bag of words- freq. threshold > 200 tweets- term frequency (naive)- tokenizer: https://github.
com/jaredks/tweetokenize + a few touches
![Page 39: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/39.jpg)
Simplistic topic modeling
Document is a time slot.Model:
- bag of words- freq. threshold > 200 tweets- term frequency (naive)- tokenizer: https://github.
com/jaredks/tweetokenize + a few touches
![Page 40: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/40.jpg)
Vocabulary diversityFollows daily cycles
run out of disc space
![Page 41: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/41.jpg)
Test a domain specific hack
Vocabulary: catastrophe
…
![Page 42: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/42.jpg)
Vocabulary slots: KLD How surpriseful vocabulary of each hour against the whole dataset
Beware: on this scale individual hours are small, but events are plentiful
Higher KLD on sparse data
Lower KLD on dense data
![Page 43: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/43.jpg)
Vocabulary slots: KLD smoothedSmoothing did not change peaks
new minimum
![Page 44: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/44.jpg)
Vocabulary slots: rolling KLD How surpriseful vocabulary of each hour against the last 24h
Less variation on dense data
![Page 45: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/45.jpg)
Vocabulary slots: rolling KLD How surpriseful vocabulary of each hour against the last 8h
![Page 46: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/46.jpg)
Vocabulary slots: rolling KLD How surpriseful vocabulary of each hour against the last 4h
![Page 47: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/47.jpg)
Event Detection ProblemOutliers detection:
- rate change of the ‘surprise’
Compare against:
![Page 48: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/48.jpg)
Rolling KLD outliersEvents: detected rate change
![Page 49: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/49.jpg)
Rolling KLD outliers tokensAnnotate events with the most surpriseful tokens
![Page 50: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/50.jpg)
Further dataset limitation
primeevents
![Page 51: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/51.jpg)
Rolling KLD outliers: Feb 19-28
![Page 52: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/52.jpg)
Find representative tweetsLast 8 timeslots
(data model)
Timeslot(attention window)
KLD(surprising)
Update surprising tweets
-KLD(least surprising)
1. Detect distinct features
2. Find elements representing
distinct features
![Page 53: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/53.jpg)
Surpriseful tweets link➥
Only from users with +500followers
![Page 54: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/54.jpg)
The only spam/bot tweet selected. from the first time slot, when the prior is uniform. Notice: the dataset is not filtered!
![Page 55: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/55.jpg)
![Page 56: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/56.jpg)
![Page 57: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/57.jpg)
![Page 58: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/58.jpg)
![Page 59: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/59.jpg)
![Page 60: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/60.jpg)
majdannezalezhnosti.blogspot.com
![Page 61: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/61.jpg)
1. Benchmark: ‘hot’ events from media2. Fight bots
a. spam (repetitions, bots)b. ‘forced’ opinionsc. filter low quality
3. Topic modela. no just Term Frequencyb. split topics (!)
To improve in Tweets app
![Page 62: Information surprise or how to find interesting data](https://reader034.fdocuments.us/reader034/viewer/2022042608/55cf29acbb61ebbd668b4647/html5/thumbnails/62.jpg)
Questions?
art by www.facebook.com/Marysya.Rudska