CS109/Stat121/AC209/E-109 Data Science - GitHubExample 1.1.2 (Code-breaking with...

CS109/Stat121/AC209/E-109 Data Science

Bayesian Methods Continued, Text DataHanspeter Pfister, Joe Blitzstein, Verena Kaynig

gene 0.04dna 0.02genetic 0.01.,,

life 0.02evolve 0.01organism 0.01.,,

brain 0.04neuron 0.02nerve 0.01...

data 0.02number 0.02computer 0.01.,,

Topics Documents Topic proportions andassignments

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.

model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)

We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biology

topic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.

2. For each word in the document

(a) Randomly choose a topic from the distribution over topics in step #1.

(b) Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with di↵erent proportion (step #1); each word in each document

1Technically, the model assumes that the topics are generated first, before the documents.

3

Blei, https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

This Week• Project team info is due tonight at 11:59 pm via

the Google form: http://goo.gl/forms/CzVRluCZk6

• HW4 is due this Thursday (Nov 5) at 11:59 pm

• Before this Thursday’s lecture on interactive visualizations:

• Download/install Tableau Public at https://public.tableau.com/

• Download data file (.zip) from http://bit.ly/cs109data

https://urldefense.proofpoint.com/v2/url?u=https-3A__public.tableau.com_&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=5bFUH-joByi4b6PRNC6MH45pNWHrIMwAFTt7Anzehkk&m=Ae8Gan4cAwUfkC0hT1PIwdGkIwJnflycBirASMyA0A4&s=WvhcoLcNghaQOPU5fEw6Ph3l37ckCBIZDofiF6wfMI8&e=

https://urldefense.proofpoint.com/v2/url?u=http-3A__bit.ly_cs109tableau&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=5bFUH-joByi4b6PRNC6MH45pNWHrIMwAFTt7Anzehkk&m=Ae8Gan4cAwUfkC0hT1PIwdGkIwJnflycBirASMyA0A4&s=4TMc_ZW4rbE7fsHi0WArUARBoHNE25UK51RkutwUJVM&e=

http://healthyalgorithms.com/2010/03/12/a-useful-metaphor-for-explaining-mcmc/

MCMC as mountain exploration

vs.

Bayesian Hierarchical Models: Radon Example

Example from Gelman http://www.eecs.berkeley.edu/~russell/classes/cs294/f05/papers/gelman-2005.pdf

Python-based exposition at http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/

http://www.eecs.berkeley.edu/~russell/classes/cs294/f05/papers/gelman-2005.pdf

Complete Pooling vs. No pooling

complete pooling: radoni,c = ↵+ � · floori,c + ✏

radoni,c = ↵c + �c · floori,c + ✏cno pooling:

Partial Poolingno pooling:

partial pooling/ hierarchical model:

Partial Pooling

radoni,c = ↵c + �c · floori,c + ✏c

↵c ⇠ N (µ↵,�2↵)

�c ⇠ N (µ� ,�2�)

http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/

Hierarchical Models Provide:

• a compromise between no pooling and complete pooling • regularization and shrinkage • give sensible estimates even for small groups • organize the parameters in an interpretable way • incorporate information at different levels in the hierarchy

(e.g., individual level, county level, state level) • predictions at various levels of the hierarchy (e.g., for

new house or for new county)

Gibbs Sampler

Draw new ✓1 from conditional distribution of ✓1|✓2

Explore space by updating one coordinate at a time.

Draw new ✓2 from conditional distribution of ✓2|✓1

Repeat

2D parameter space version:

http://zoonek.free.fr/blosxom//R/2006-06-22_useR2006_rbiNormGiggs.png

http://zoonek.free.fr/blosxom//R/2006-06-22_useR2006_rbiNormGiggs.png

Gibbs sampler animation

http://twiecki.github.io/blog/2014/01/02/visualizing-mcmc/

Metropolis-Hastings Algorithm

Modify a Markov chain on a state space of interest to obtain a new chain with any desired stationary distribution!

2 CHAPTER 1. MARKOV CHAIN MONTE CARLO

1. If Xn

= i, propose a new state j using the transition probabilities p

ij

of theoriginal Markov chain.

2. Compute an acceptance probability,

a

ij

= min

✓s

j

p

ji

s

i

p

ij

, 1

◆.

3. Flip a coin that lands Heads with probability a

ij

, independently of the Markovchain.

4. If the coin lands Heads, accept the proposal and set Xn+1 = j. Otherwise, stay

in state i; set Xn+1 = i.

In other words, the modified Markov chain uses the original transition probabilitiesp

ij

to propose where to go next, then accepts the proposal with probability a

ij

,staying in its current state in the event of a rejection. The claim is that the modifiedMarkov chain (let’s call its transition matrix Q) has stationary distribution s.

To prove this claim, we just need to check the reversibility condition s

i

q

ij

= s

j

q

ji

for all i and j. Let’s consider the case where i and j satisfy s

j

p

ji

s

i

p

ij

. If bothsides are 0, then s

i

q

ij

= s

j

q

ji

= 0 trivially actually I didn’t think at all about theedge cases because I’m lazy; otherwise, we have a

ij

= sjpji

sipijand a

ji

= 1, so

s

i

q

ij

= s

i

p

ij

a

ij

= s

i

p

ij

s

j

p

ji

s

i

p

ij

= s

j

p

ji

= s

j

p

ji

a

ji

= s

j

q

ji

.

Symmetrically, if sj

p

ji

> s

i

p

ij

, we also have s

i

q

ij

= s

j

q

ji

, by switching the roles of iand j in the calculation above. Since the reversibility condition is satisfied, s is thestationary distribution of the modified Markov chain with transition matrix Q.

In other words, we can supply any old Markov chain as the input to Metropolis-Hastings, and as long as that chain is able to reach all states in the state space, thealgorithm will return a spi↵y new Markov chain whose stationary distribution is s.Here is an example of how Metropolis-Hastings can work on a very large state space.

Example 1.1.2 (Code-breaking with Metropolis-Hastings). Markov chains have re-cently been applied to code-breaking; this example will consider one way in whichthis can be done. A substitution cipher is a permutation g of the letters from a toz, where a message is enciphered by replacing each letter ↵ by g(↵). For example, ifg is the permutation given by

https://www.siam.org/pdf/news/637.pdf

Metropolis-Hastings animation

http://twiecki.github.io/blog/2014/01/02/visualizing-mcmc/

MCMC in Python• Stan: http://mc-stan.org

• PyMC: https://pymc-devs.github.io/pymc/

Mosteller-Wallace, Federalist Papers Authorship

Mosteller-Wallace, Federalist Papers Authorship

https://www.stat.cmu.edu/Exams/mosteller.pdf

Use of “upon” by Hamilton vs. Madison

12 Stochastic Modeling and Mathematical Statistics

ton identified as the author of forty-one, and James Madison being credited with writingfourteen of the others. Of the fifteen papers whose authorship was uncertain, three were de-termined to have been written by a third party, and each of a group of twelve was variouslyattributed to Hamilton or to Madison, with scholars of high repute stacking up on bothsides in a vigorous game of academic tug of war. Enter Mosteller and Wallace (1964), twostatisticians who were convinced that the appropriate classification could be made througha careful analysis of word usage. In essence, it could be determined from writings whoseauthorship was certain that there were words that Hamilton used a lot more than Madison,and vice versa. For example, Hamilton used the words “on” and “upon” interchangeably,while Madison used “on” almost exclusively. The table below, drawn from the Mostellerand Wallace study, can be viewed as a specification of a stochastic model for the frequencyof occurrence of the word “upon” in arbitrary essays written by each of these two authors.The collection of written works examined consisted of forty-eight works known to havebeen authored by Hamilton, fifty works known to have been authored by Madison, andtwelve Federalist Papers whose authorship was in dispute. By itself, this analysis presentsa strong argument for classifying at least eleven of the twelve disputed papers as Madis-onian. In combination with a similar treatment of other “non-contextual” words in thesewritings, this approach provided strong evidence that Madison was the author of all twelveof the disputed papers, essentially settling the authorship debate.

Rate/1000 Words Authored by Hamilton Authored by Madison 12 Disputed PapersExactly 0 0 41 11(0.0, 0.4) 0 2 0[0.4,0.8) 0 4 0[0.8,1.2) 2 1 1[1.2,1.6) 3 2 0[1.6,2.0) 6 0 0[2.0,3.0) 11 0 0[3.0,4.0) 11 0 0[4.0,5.0) 10 0 0[5.0,6.0) 3 0 0[6.0,7.0) 1 0 0[7.0,8.0) 1 0 0Totals: 48 50 12

Table 1.2.1. Frequency distribution of the word “upon” in 110 essays.

Exercises 1.2.1. Specify the sample space for the experiment consisting of three consecutive tosses of a

fair coin, and specify a stochastic model for this experiment. Using that model, computethe probability that you (a) obtain exactly one head, (b) obtain more heads than tails, (c)obtain the same outcome each time.

2. Suppose a pair of balanced dice is rolled and the number of dots that are facing upwardsare noted. Compute the probability that (a) both numbers are odd, (b) the sum is odd, (c)one number is twice as large as the other number, (d) the larger number exceeds the smallernumber by 1, and (e) the outcome is in the “field” (a term used in the game of Craps), thatis, the sum is among the numbers 5, 6, 7, and 8.

But what is the probability that Madison authored a particular disputed document, and how confident

should we be about our answer?

Table from Samaniego, Stochastic Modeling and Mathematical Statistics

Poisson Model

f(y|�) = e��y

y!

y is the number of occurrences of a specific word

� is the rate parameter

Gamma prior is conjugate: p(�) / �a�1e�b�

12.5 The Posterior and Inferences 283

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6

01

23

45

6

x

dgam

ma(

x, s

hape

= 3

31.6

, rat

e =

270.

3)

PosteriorLikelihood

Fig. 12.2 Posterior and the likelihood function for the rate of using the word fromin Madison’s writings.

We find the posterior median using the gamma quantile function qgamma withinputs 0.5 (the probability) and the shape and rate parameters.

> qgamma(0.5, shape=331.6, rate=270.3)[1] 1.225552

Our opinion after observing the data is that Madison’s rate is equally likelyto be smaller or larger than 1.226.

A Bayesian interval estimate for the rate parameter is an interval (λLO,λHI)that contains λ with a specific probability γ:

P (λLO < λ < λHI) = γ.

Suppose we want a 90% Bayesian interval where γ = 0.90. An “equal-tails”interval is found by computing the 5th and 95th percentiles of the gammaposterior density using a second application of the qgamma function:

> qgamma(c(0.05, 0.95), shape=331.6, rate=270.3)[1] 1.118115 1.339661

The posterior probability that Madison’s rate parameter λ is in the interval(1.118, 1.340) is 0.90.

Likelihood and Posterior for Madison’s use of “from”

n-grams

Unigrams: look at individual words. “data”, “science”, “is”, “fun”

Bigrams: look at word pairs. “data science”, “science is”, “is fun”

Trigrams: look at word triplets. “data science is”, “science is fun”

Data science is fun.

n-grams: Randomized Hobbit

Karl Broman, Randomized Hobbit http://www.r-bloggers.com/randomized-hobbit/

into trees, and then bore to the Mountain to go through?” groaned the hobbit. “Well, are you doing, And where are you doing, And where are you?” it squeaked, as it was no answer. They were surly and angry and puzzled at finding them here in their holes

n-grams: Hobbit/Cat in the Hat Mixture

Karl Broman, Randomized Hobbit http://www.r-bloggers.com/randomized-hobbit/

“I am Gandalf,” said the fish. This is no way at all!already off his horse and among the goblin and the dragon, who had remained behind to guard the door. “Something is outside!” Bilbo’s heart jumped into his boat on to sharp rocks below; but there was a good game, Said our fish No! No! Those Things should not fly.

n-grams

while True: next_word_candidates = transitions[current] # bigrams (current, _) current = random.choice(next_word_candidates) # choose one at random result.append(current) # append it to results if current == ".": return " ".join(result) # if "." we're done

The sentences it produces are gibberish, but they’re the kind of gibberish you mightput on your website if you were trying to sound data-sciencey. For example:

If you may know which are you want to data sort the data feeds web friend someone ontrending topics as the data in Hadoop is the data science requires a book demonstrateswhy visualizations are but we do massive correlations across many commercial diskdrives in Python language and creates more tractable form making connections thenuse and uses it to solve a data.

—Bigram ModelWe can make the sentences less gibberishy by looking at trigrams, triplets of consecu‐tive words. (More generally, you might look at n-grams consisting of n consecutivewords, but three will be plenty for us.) Now the transitions will depend on the previ‐ous two words:

trigrams = zip(document, document[1:], document[2:])trigram_transitions = defaultdict(list)starts = []

for prev, current, next in trigrams:

if prev == ".": # if the previous "word" was a period starts.append(current) # then this is a start word

trigram_transitions[(prev, current)].append(next)

Notice that now we have to track the starting words separately. We can generate sen‐tences in pretty much the same way:

def generate_using_trigrams(): current = random.choice(starts) # choose a random starting word prev = "." # and precede it with a '.' result = [current] while True: next_word_candidates = trigram_transitions[(prev, current)] next_word = random.choice(next_word_candidates)

prev, current = current, next_word result.append(current)

if current == ".": return " ".join(result)

This produces better sentences like:

n-gram Models | 243

In hindsight MapReduce seems like an epidemic and if so does that give us newinsights into how economies work That’s not a question we could even have asked afew years there has been instrumented.

—Trigram ModelOf course, they sound better because at each step the generation process has fewerchoices, and at many steps only a single choice. This means that you frequently gen‐erate sentences (or at least long phrases) that were seen verbatim in the original data.Having more data would help; it would also work better if you collected n-gramsfrom multiple essays about data science.

GrammarsA different approach to modeling language is with grammars, rules for generatingacceptable sentences. In elementary school, you probably learned about parts ofspeech and how to combine them. For example, if you had a really bad Englishteacher, you might say that a sentence necessarily consists of a noun followed by averb. If you then have a list of nouns and verbs, you can generate sentences accordingto the rule.

We’ll define a slightly more complicated grammar:grammar = { "_S" : ["_NP _VP"], "_NP" : ["_N", "_A _NP _P _A _N"], "_VP" : ["_V", "_V _NP"], "_N" : ["data science", "Python", "regression"], "_A" : ["big", "linear", "logistic"], "_P" : ["about", "near"], "_V" : ["learns", "trains", "tests", "is"]}

I made up the convention that names starting with underscores refer to rules thatneed further expanding, and that other names are terminals that don’t need furtherprocessing.

So, for example, "_S" is the “sentence” rule, which produces a "_NP" (“noun phrase”)rule followed by a "_VP" (“verb phrase”) rule.

The verb phrase rule can produce either the "_V" (“verb”) rule, or the verb rule fol‐lowed by the noun phrase rule.

Notice that the "_NP" rule contains itself in one of its productions. Grammars can berecursive, which allows even finite grammars like this to generate infinitely many dif‐ferent sentences.

244 | Chapter 20: Natural Language Processing

Joel Grus, Data Science from Scratch

Topic Modeling
















3



Topic Modeling
















3


genetics

evolution

brain

computing


17,000 articles from Science, 100 topics


“Genetics” “Evolution” “Disease” “Computers”

human evolution disease computergenome evolutionary host models

dna species bacteria informationgenetic organisms diseases datagenes life resistance computers

sequence origin bacterial systemgene biology new network

molecular groups strains systemssequencing phylogenetic control model

map living infectious parallelinformation diversity malaria methods

genetics group parasite networksmapping new parasites softwareproject two united new

sequences common tuberculosis simulations

1 8 16 26 36 46 56 66 76 86 96

Topics

Probability

0.0

0.1

0.2

0.3

0.4

Figure 2: Real inference with LDA. We fit a 100-topic LDA model to 17,000 articlesfrom the journal Science. At left is the inferred topic proportions for the example article inFigure 1. At right are the top 15 most frequent words from the most frequent topics found inthis article.

is drawn from one of the topics (step #2b), where the selected topic is chosen from theper-document distribution over topics (step #2a).2

In the example article, the distribution over topics would place probability on genetics,data analysis and evolutionary biology, and each word is drawn from one of those threetopics. Notice that the next article in the collection might be about data analysis andneuroscience; its distribution over topics would place probability on those two topics. Thisis the distinguishing characteristic of latent Dirichlet allocation—all the documents in thecollection share the same set of topics, but each document exhibits those topics with di↵erentproportion.

As we described in the introduction, the goal of topic modeling is to automatically discoverthe topics from a collection of documents. The documents themselves are observed, whilethe topic structure—the topics, per-document topic distributions, and the per-documentper-word topic assignments—are hidden structure. The central computational problem fortopic modeling is to use the observed documents to infer the hidden topic structure. Thiscan be thought of as “reversing” the generative process—what is the hidden structure thatlikely generated the observed collection?

Figure 2 illustrates example inference using the same example document from Figure 1.Here, we took 17,000 articles from Science magazine and used a topic modeling algorithm toinfer the hidden topic structure. (The algorithm assumed that there were 100 topics.) We

2We should explain the mysterious name, “latent Dirichlet allocation.” The distribution that is used to

draw the per-document topic distributions in step #1 (the cartoon histogram in Figure 1) is called a Dirichlet

distribution. In the generative process for LDA, the result of the Dirichlet is used to allocate the words of the

document to di↵erent topics. Why latent? Keep reading.

4


Latent Dirichlet Allocation (LDA):Generation and Estimation

http://mcburton.net/blog/joy-of-tm/

Dirichlet Distribution

http://blog.bogatron.net/blog/2014/02/02/visualizing-dirichlet-distributions/

Latent Dirichlet Allocation (LDA):Generative Model

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Latent Dirichlet Allocation (LDA):Generative Model Example

http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

• Pick 5 to be the number of words in D. • Decide that D will be 1/2 about food and 1/2 about cute animals. • Pick the first word to come from the food topic, which then gives you the word “broccoli”. • Pick the second word to come from the cute animals topic, which gives you “panda”. • Pick the third word to come from the cute animals topic, giving you “adorable”. • Pick the fourth word to come from the food topic, giving you “cherries”. • Pick the fifth word to come from the food topic, giving you “eating”.

Latent Dirichlet Allocation (LDA):Generative Model

http://mcburton.net/blog/joy-of-tm/

Recommendation Systems and LDA in the NY Times

http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/?_r=2

http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/?_r=2

LDA Visualization

http://cpsievert.github.io/LDAvis/reviews/reviews.html pyLDAvis: https://pypi.python.org/pypi/pyLDAvis

https://pypi.python.org/pypi/pyLDAvis

CS109/Stat121/AC209/E-109 Data Science - GitHubExample 1.1.2 (Code-breaking with...

Documents

Transcript of CS109/Stat121/AC209/E-109 Data Science - GitHubExample 1.1.2 (Code-breaking with...