A Story - MIT - Massachusetts Institute of Technologyweb.mit.edu/arkhipov/www/21W.784 Final...

Alex Arkhipov 11/28/200621W.784 Final Project Report Beth Coleman

Automated Calculation of Word Correlations from Text

A Story (Significance of Word Correlation)Imagine that someone has given me an enormous tome with text written in an unfamiliar

language. I have no idea what the text is meant to be – a story, a recipe, or maybe an encyclopedia – or who wrote it, when, and why. The language has no recognizable punctuation or capitalization. I can see that the language has words made up of individual letters and can find words made of similar sequences of letters, but I can’t tell whether these are different forms of a word ( “compute” and “computer”), or completely unrelated words (“desert” and “desert”). Moreover, I don’t have any idea of how, tenses, plurals, suffixes, and prefixes work in the language, so decomposing each word into part is fruitless. So, I treat words are the basic unit, as if the language were made of hieroglyphics rather than letters, and best the plan of attack is to look for the same word appearing multiple times.

Figure 0: Repeated words in an unknown Chinese text

Looking through the text, I notice that not only do two words (call them A and B) appear reasonably often, but where word A is found, word B tends to appear nearby. Looking at where A and B appear, I tend to find only three other words, C, D and E, that consistently appear together with A and B appear nearby, but rarely outside – all five words cluster together within the text and it follows that the words form a conceptual cluster.

As I look at the text clusters of words, I find every instance of the words appearing together is marked by word B occurring, but not necessarily any of the other words. I conclude that B is the central hub of the cluster. For example, word B may be “baseball”, and A, C, D, and E are “bat”, “strike”, “pitcher”, and “run”. Perhaps the text is a story in which B is a character, with A, C, D, and E relatives of that character.

I continue through the text the same way. Some words are spread randomly and haphazardly within the text; these may be articles or prepositions that carry no inherent meaning. Other words group into clusters by proximity, some very distinct, and some vague. I assume each of the clusters represents a set of closely-associated concepts. I draw each word as a point on an enormous sheet of paper, and connect related words with a line, so that concept clusters form a tight net of lines with a visible central hub. Doing so for more and more words, I find many concept clusters. Some clusters overlap, while other clusters bridge pairs of distant concepts. Moreover, the concept clusters are themselves grouped into clusters, due to higher-than average connections between the words within them, and that these clusters form even higher-level cluster, and so on.

The nesting is rich and detailed, and using the large amount of text at my disposal, I am able to draw a complex and intricate web that shows all these relationships with great accuracy and detail. Working for many years, I complete Network I. Network I is a web so large that I have to zoom out so that the points that indicate individual words are too small to be seen, and tightly-bound concept clusters look like pinpricks from this view, and clusters of concepts look like tiny globs.

The network gives me a higher-level understanding of the language, even though I don’t know what any word or concept means. Then, working for many more years, I use my prior knowledge to draw a similar association network of the entire English language, called Network II. Looking at the two networks, I realize that they look alike and have similar “texture” – the amounts of links between clusters, the tightness of the average cluster, and so on, are about the same. Perhaps both networks are one and the same – representations of the human conceptual network, something that is independent of the language in which it is expressed.

Next, I study Network I and Network II in detail and realize that they have with similar structures. The shapes of the highest-level central hubs match up, and I can see which clusters map onto each other. Actual words, however, fail to line up in the networks, but this does not matter, because individual words are barely visible. I now take the original text, and give each word a vague translation based on which cluster the English counterpart is in. Although I don’t have an exact idea of the words or sentences, I can see what the text is about and the path through which it visits various ideas. Or perhaps I have stumbled on a parallel text caused by me mismatching Networks I and II, to give a text that is plausible, but not what is actually written?

Background PhilosophyWhile the story above may seem far-fetched (and it is), I believe that it is an effective

allegory for what a language is. The ideas are largely influenced by the philosophy on minds and networks presented in the book Godel, Escher, Bach. The story also parallels the Chinese room argument, in which a man with no knowledge of Chinese can converse in Chinese by following precise coded instructions. This allegorical thought experiment is meant to illustrate that a machine could pass the Turing Test without actually understanding language.

The actual counterpart to my role in the story is an algorithm to process language. For such a program, all language is foreign and devoid of context and meaning, and the only clues are where words appear in relation to each other. I am interested in what information can be algorithmically extracted from a large corpus of text with no human intervention or prior knowledge.

Mining linguistic information and searching for patterns from a large, crystallized body of text is the core of corpus linguistics, a branch of computational linguistics. Such an approach may take a collection of text written in the early nineteenth century and to check for trends in the use of dependent clauses (to give a hypothetical example). This requires human intervention in tagging words with parts of speech and per-programmed heuristics to distinguish dependent and independent clauses. Although computational linguistics was once seen as promising in codifying language through machine-understandable rules, it fell into disfavor as people realized that rules and algorithms can only go so far in understanding language.

When I started the project, I was perhaps also naïve in hoping to automate the creation of a semantic network for a set of words. I planned to use only correlation data, which can be mined from text in a manner similar to that described in the story. The main problem is that correlation only gives vague relation data (A is associated with B), while semantic data is more precise about the relationship, such as (A is part of B, A is type of B, A is the same as B, A as the opposite as B), and their converses. For example, WordNet, a dictionary that organizes words as

a semantic network, primarily using the relationships A is part of B (called meronymy), and A is type of B (called hyponymy). However, WordNet was created using human understanding of relationships between words, rather than computer analysis.

The main difficulty is that there is no apparent way to see how two words that appear together are related, unless one searches for linking phrases such as “is a” or “is like a” or “does not” and uses prior knowledge of these phrases to judge the relationship.

Project OutlineCorpus linguistics typically uses human intervention in preprocessing text by identifying

root words or annotating which part of speech a word is. However, I wanted to see what I can do using only computer algorithms and no prior knowledge. Only then could my methods have broad applicability, such as to the hypothetical language in the story in which not only the meanings, but the parts of speech of the words are not known. Because I realized that semantic relationships cannot directly be extracted algorithmically, I limited the scope of my project to mining correlation data, rather than semantic data.

My main premise, apparent in the story, is that words appear together if and only if there is some association between them, even if the semantic connection cannot be determined. Given any two words, I an algorithm can estimate their correlation, a numerical estimate for the degree of association between the words. To do so, I compare how often both words appear together to how often they would appear together just due to random chance, since even unrelated words occasionally appear in close proximity. Then, I isolate clusters of words in a set by finding the correlation of each pair of words, and looking for subsets of words that correlate highly with each other. I am limited by computing power to detecting small subsets that have a clear relationship, such as distinguishing a set of fruits from a set of tools. However, I believe that my methods can be extended to searching for clusters of words linked by more subtle, conceptual relationships from a pool of randomly chosen words, as opposed to a set pre-made for experiment.

Search Engine ApproachI originally used Google, the most comprehensive and popular Internet search engine.

When one searches for a keyword on Google, in addition to returning relevant web pages, Google lists the number of pages on the Internet that contain this keyword, or the hit count. It is this number that I concentrate on in my calculations. Google allows searches using Boolean operators; searching for two keywords connected by “AND”, for example, gives the set of pages containing both keywords, as well as the number of such web pages. Because Google indexes pages that contain human-readable information, I expect that the pages indexed on Google will be representative of the English language (I only searched English-language pages), so words that denote related concepts tend to appear together on web pages.

The advantages of using Google to mine the web corpus are the sheer size of the corpus, allowing me to get large amounts of varied text data so as to drastically lower the factor of randomness inherent in using text as a sample of the English language. Google pre-built web page database, massively parallel computing grid, and streamlined search algorithms.

The main drawback was that Google (and search engines in general) gives only limited and skewed data. I will discuss the difficulties this in more detail in the results section.

Search Engine Algorithm(Note: Capital letters represent words. Also, when giving Google search queries, I

enclose them in quotes, which are not part of the query.) For any two terms, X and Y, I search

for each of X and Y, take the number of results, and divide by the total number of pages indexed by Google, a value that can be gotten using a search query that returns every page, such as “Z OR NOT Z” for any term Z (Figure 1). This division gives p(X) and p(Y), the fraction of web pages containing each of the terms, respectively, or equivalently the probability that a randomly chosen English-language page contains each of the terms. If the two probabilities are independent, then the probability of a given web page containing both terms equals the product p(X) * p(Y). This is the expected fraction of pages on which both words appearing.

Search Query Index Sizepickle OR –pickle 2.527E+10the OR –the 2.527E+10aardvark OR -aardvark 2.527E+10inurl:http 2.527E+10

Figure 1: Determining the index size (done prior to index size change). The minus signs represent “NOT” operators.

Then, I compute the actual fraction of pages containing both X and Y by dividing the number of hits for “X AND Y” by the size of Google’s index. The actual fraction may differ from the expected fraction because the two probabilities are not independent. If the words X and Y are correlated, then the appearance of X makes the appearances of Y more likely, so p(X) and p(Y) are not independent, but positively related. Then, p(X AND Y) is greater than p(X) * p(Y). So, the ratio p(X AND Y) / (p(X)*p(Y)) quantifies the correlation of words X and Y. Note that the correlation between X and Y is (or at least should be) scale-free, meaning that it is not biased towards either frequently-occurring words or rarely-occurring words.

Search Engine ResultsA few basic tests of the correlation property gave promising results (Figure 2).

Words HitsProbabilities (=Hits/Index Size) Correlation

pickle 1.060E+07 4.195E-04cucumber 1.050E+07 4.155E-04cucumber AND pickle 4.720E+05 1.868E-05 107.165

katamari 3.240E+06 1.282E-04damacy 1.220E+06 4.828E-05katamari AND damacy 1.200E+06 4.749E-05 7671.524

aardvark 7.640E+06 3.023E-04raincoat 2.580E+06 1.021E-04aardvark AND raincoat 8.210E+02 3.249E-08 1.053

rock 5.780E+07 2.287E-03align 8.510E+07 3.368E-03rock AND align 5.090E+06 2.014E-04 26.150

capacitor 1.410E+07 5.580E-04granary 1.680E+06 6.648E-05capacitor AND granary 7.470E+02 2.956E-08 0.797

apple 5.640E+08 2.232E-02banana 6.010E+07 2.378E-03apple AND banana 6.530E+06 2.584E-04 4.868

Figure 2: Selected pairs of words with computed correlation

The correlation of the clearly-related words “cucumber” and “pickle” gave a correlation of 107, which means that the words appeared together about 100 times more than expected by chance. In contrast, the random terms “aardvark” and “raincoat” have a correlation of 1.05, or nearly one, meaning that they appear together only with chance probability. The terms “katamari” and “damacy”, Japanese words that form the name of a popular video game, give an enormous correlation of 7671, which is expected since the words rarely appear outside the game name on English-language pages. In contrast, the highly dissimilar terms “capacitor” and “granary” have a correlation of 0.797, meaning that the appearance of one word actually makes the other words 20% less likely to appear. Surprisingly, the correlation between the words “apple” and “computer” is significantly greater than that between “apple” and “banana”. However, further testing found unreasonably high correlations between unrelated words: 14.31 14.31 between “chartreuse” and “rudder” and 26.150 between “rock” and “align”.

cherry

apple

banana

kiwi

screwdriver

wrench

drill

hammer

cherryapple

bananakiwi

screwdriver

wrenchdrill

hammer

0.7

1.3

1.2

1.11.2

0.91.0

0.81.0

1.1

0.9

0.70.7

0.7

1.3

1.01.6

1.3

1.2

1.11.1

1.2

1.1

1.6

0.9

0.9

0.71.0

1.1

0.91.3

0.9

2.5

2.0

1.9

1.2

0.71.2

0.9

2.5

1.9

1.8

1.0

0.71.1

0.7

2.0

1.9

1.4

1.0

0.71.1

1.0

1.9

1.8

1.40.50.70.91.11.31.51.71.92.12.32.5

log

(cor

rela

tion)

Tools vs Fruits Correlation Table (Original Formula)

Figure 3: The tools show a high degree of correlation within themselves, while the fruits show less of a trend. (The diagonal of this chart was removed because correlations of items with themselves are not relevant).

I tested whether the Google could be used to locate conceptual clusters using correlation data. I computed the pairwise correlations between the eight words {cherry, apple, banana, kiwi, screwdriver, wrench, drill, hammer} to see whether I could separate the fruits from the tools. The resulting graph (with the diagonal removed) is shown in Figure 3. A slightly different graph

(Figure 3) can be obtained by using another scaling correlation formula that has the desirable property that the correlation of did of any word with itself is one. This model also has the advantage of not depending on the size of the Google index.

cherryapple

bananakiwi

screwdriverwrench

drillhammer

cherryapple

bananakiwi

screwdriver

wrenchdrill

hammer

-1.2-1.1

-1.4

-1.6-1.5-1.4-1.3

-1.2-1.1

-1.2

-1.6

-1.6-1.3

-1.3

-1.1-1.1

-1.2

-1.7-1.7

-1.4-1.4

-1.4-1.2-1.2

-2.2-2.2

-2.0-1.7

-1.6-1.6

-1.7

-2.2

-0.7-0.8

-0.9

-1.5

-1.6

-1.7

-2.2

-0.7

-0.8

-1.0

-1.4-1.3

-1.4

-2.0

-0.8-0.9

-1.1

-1.3-1.3

-1.4

-1.7

-0.9

-1.0

-1.1

-2

-1.8

-1.6

-1.4

-1.2

-1

-0.8

log

(Cor

rela

tion)

Tools vs Fruits Correlation Table (Modified Formula)

Figure 4: In this improved model, both the fruits and tools have higher (less negative) correlations within their groups. On this scale, perfect correlation corresponds to 0 on the graph, and higher correlations are less negative.

The removed diagonal consisted of all zeroes, since any word correlates perfectly with itself.

The graphs, especially the second one, show the regions where fruits match up with fruits and tools with tools having higher values than where fruits match up with tools. This shows that it is possible that someone who didn’t know English and was given the eight words randomly shuffled could divide them into the two categories using only Google hit data. Notably, the person could do this with no knowledge of what the words mean and what the categories are.

The eight words in the preceding example are divided into two clusters in a very concrete way, as members of either the fruits category or the tools category. I wanted to see if I could locate a cluster of words related in more subtle, conceptual ways. To do this, I repeated my experiment using the words {love, valentine, chocolate, heart, cube, breakfast, boredom, golf}, the first four of which have a close semantic connection. The hit count (Figure 5) and correlation values (Figures 6 and 7) all show no visible increase in the correlation between pairs of love-related words as compared to that of other pairs, arbitrary with arbitrary or love-related with arbitrary. Many correlation results are highly unexpected; for example, “valentine” correlates more highly with “boredom” than “love” in both formulas.

lovevalentine

chocolateheart

cubebreakfast

boredomgolf

lovevalentinechocolateheartcubebreakfastboredom

golf

6.0

6.7

6.36.8

6.6

6.2

7.0

6.0

6.0

5.66.2

6.0

5.6

6.5

6.7

6.0

6.1

6.7

6.5

6.1

6.8

6.3

5.66.16.3

6.2

5.9

6.5

6.8

6.26.7

6.36.6

6.4

7.6

6.6

6.06.5

6.26.6

6.2

6.8

6.2

5.66.2

5.9

6.5

6.46.6

7.0

6.46.8

6.5

7.6

6.9

6.65.75

6.25

6.75

7.25

7.75

log

(Hits

)

Correlation Table (Pure Hit Count)

Figure 5: The hits count for “A and B” is not significantly different when both A and B are love-related words {heart, chocolate, valentine, love} as opposed to arbitrary words {golf, boredom, breakfast, cube}

lovevalentine

chocolateheart

cubebreakfast

boredomgolf

lovevalentinechocolateheartcubebreakfastboredomgolf

1.5

1.11.3

0.5

1.31.4

0.3

1.52.12.2

1.5

2.32.4

1.4

1.1

2.1

1.6

1.0

1.81.9

0.6

1.3

2.2

1.6

1.1

2.02.2

0.9

0.5

1.5

1.01.1

1.11.4

0.7

1.3

2.3

1.82.0

1.1

2.2

0.9

1.4

2.4

1.92.2

1.5

2.4

1.2

0.3

1.3

0.60.9

0.71.01.20

0.5

1

1.5

2

2.5

log

(Cor

rela

tion)

Correlation Table (Original Formula)

Figure 6: The correlation between two love-related words isn’t significantly higher than that with arbitrary words

lovevalentine

chocolateheart

cubebreakfast

boredomgolf

lovevalentinechocolateheartcubebreakfastboredomgolf

-2.6-2.3

-2.6

-2.8

-2.2

-2.4

-2.7

-2.6

-2.0

-2.3-2.3

-1.7

-2.0

-2.2

-2.3-2.0

-2.3-2.4

-1.8

-2.1

-2.6

-2.6-2.3-2.4

-2.7

-1.8

-2.0

-2.6

-2.7-2.3

-2.4

-2.7-2.3-2.3

-1.8

-2.2

-1.7

-1.8-1.8

-2.3

-1.6

-2.3

-2.4

-2.0-2.0-1.9

-2.1

-1.3

-2.3

-2.7-2.5

-2.6-2.6

-1.8

-2.1

-2.3-3

-2.5

-2

-1.5lo

g (C

orre

latio

n)

Correlation Table (Modified Formula)

Figure 7: The modified correlation formula also cannot distinguish the love-related cluster from the arbitrary words

Figure 8: Query Google, an applet made by programmed by Tim Ma, Dept. of Computer Science, UCLA, greatly helped me get hit counts for large amounts of queries

Search Engine DifficultiesI notice that when some times when I tried to replicate calculations I had previously done,

I obtained different results. The number of hits given by Google for both simple and Boolean searches changed, usually only slightly, but for some words, by as much as a factor of ten. Moreover, I was getting different numerical results using an automated program to query Google, even though this program had previously given the exact same data as searching Google

by hand. Furthermore, reassessing the size of the Google index with searches of the type “Z OR NOT Z”, I found that the index size actually decreased from about 25 billion to 9 billion. This change would affect my correlation calculations.

I suspect that the change in results was caused by either Google’s near-monthly re-indexing of its database or by a change in the formula Google uses to estimate the number of hits, or both. Google’s re-indexing serves to update Google’s database by crawling the Internet and replacing Google’s stored version of each page with the up-to-date version. However, this change typically causes Google’s index size to increase, as the Internet increases in size.

Other inconsistencies and inaccuracies in Google’s hit count led me to believe that the hit count is an estimate rather than an exact count. The hit count varies slightly when I search at different times; in one case, two identical searches conducted within one minute gave significantly different hit counts (Figure 9). This suggests that the hit count is extrapolated from a random sample of pages, thus causing slight variation, although this cannot account for the more extreme inconsistencies.

Figure 9: Two identical searches, taken less than a minute apart, have hit counts that vary by a factor of 15

The hit count inaccuracy can also be seen when conducting a search that returns few enough results so that I can see all the web pages returned. When I scroll through all the results, I find that there are fewer results than the hit count claimed, even when I opt to display the very similar pages omitted by Google.

Figure 10: Although the hit count claims that 933,000 pages contain the string “hgt”, only 836 such pages can be seen. In this case opting “to repeat the search with omitted results included” gives me 1,000 results, the maximum that can be displayed. In other cases, including repeated results still gave me many fewer pages than the hit count

Other results contradicted the law of probability with Boolean clauses. Restricting the search from “X” to “X AND Y” sometimes returns more hits, whereas broadening the search from “X” to “X OR Y” occasionally returns fewer hits (Figure 11).

Figure 11: Apparently, although 390 million pages contain “love”, 480 million contain “love”, but not “cubism”!

Also, the searches “X Y”, “X AND Y”, “Y X”, and “Y AND X” can give hit counts that vary slightly (Figure 12), even though order shouldn’t matter and Google claims that the lack of an operator between two words is identical to AND. Similar inconsistencies can be found when using the OR operator. Interestingly, the variability caused by changing term order decreased drastically following the re-indexing.

Query Hitspickle 1.060E+07cucumber 1.050E+07cucumber AND pickle 4.720E+05pickle AND cucumber 4.810E+05cucumber pickle 4.820E+05pickle cucumber 9.310E+05pickle OR cucumber 1.850E+07cucumber OR pickle 1.880E+07

Figure 12: Supposedly identical queries give slightly varying hit countsSearching online, I found these problems and similar one being discussed on Language

Log, a linguistics blog (http://itre.cis.upenn.edu/~myl/languagelog/). Apparently, linguists have previously run into similar problems when trying to use Google to mine data from the web corpus. An explanation for some of the strange results can be found in a blog post which contains the response of a Google employee on the issue (/archives/001837.html). The post explained that the hits count is an estimate that is unreliable and poorly updated, made by extrapolating from the 1000 results that the viewer actually sees. These 1000 results are not a representative sample of the entire Internet, since they are influenced by PageRank and other factors I addition to the relevance of the page to the word searched. Also, a related post contended that the varying

results may be due to the fact that re-indexing does not occur at the same time in all of Google’s data centers. However, from the discussion on the blog, it seems that nobody has a satisfying explanation of how Google actually estimates hit count for Boolean searches.

The discrepancies of Google search gave me inaccurate results that did not reflect the actual correlation of the words in the Internet. However, even if Google could give me exact hit counts, I would still have limited access to the information inherent in the web corpus. In computing correlations, I could only tell that a page contained both of the searched-for words, with no knowledge of how often each of the words appear and how closely together they appear. For example, for the more unlikely combinations of words, many of the hits were blogs in which each word appeared in an unrelated post (in addition to word lists and gibberish spam), a case in which co-appearance on a page did not give evidence for association. For better accuracy in gauging how related two words appeared in a given piece of text, I would need to be able to examine the text directly, rather than being limited to Google search as an interface.

Program ApproachSeeing the limitation of the Google approach, I decided to write my own program (using

C++) to compute the correlation of two words based on text of my choice. With this method, I could have full control over the data I was analyzing – for example, I could factor in where two terms appear relative to each other. The cost of such flexibility that my program had to be built from scratch and run on my laptop, so it could not compete with Google’s streamlined efficiency and enormous computational power of Google. Time and memory constraints also limited the size of my corpus, resulting in greater random variation.

The size limits on the corpus precluded the use of data representative of the entire English language. I chose to use literary texts (Pride and Prejudice, A Tale of Two Cities, and War and Peace) since they are relatively long and readily available. I wound up mainly using War and Peace, the longest text, since length was the major constraint. Actually, due to memory limitations, I only used the first 80% of the book as my text. I preprocessed each text to remove punctuation and change all the words to lowercase. Then, I wrote a program to segment the text into individual words and store them in an array that preserves the order in which they appear.

The goal of my algorithm is to compute the correlation between two arbitrary words based on their distribution within the text. The primary assumption is that words that share an association will tend to appear more often in the same place within the text, and that the closers the words appear, the more association there is between them.

Linguistic analysis in this vein has typically concentrated on checking how frequently each word appears immediately before or after the other word. I believe that while this method localizes a word within its immediate context, it fails to consider larger-scale patterns. This method would fail to find more subtle relationships than adjacent words, such as two related words tending to appear in the same sentences or in the same paragraphs. Nevertheless, it is still expected than one word appearing right after another gives a stronger indication of a connection that the words merely appearing in the same paragraph.

Program AlgorithmAt the heart of the algorithm is the idea of a weight function which quantifies how much

evidence of a connection is presented by the appearance of two words at a given distance apart in text (i.e., how many words are between them). The weight function is necessarily decreasing – the smaller the distance, the larger the weight. My original function, of I used as a basis in considering multiple variants, was having the weight vary inversely with distance. So, adjacent words give a weight of one, two words one word apart give a weight of one half, and so on. This

way, two words that are at opposite ends of the book contribute a weight that is almost negligible. In this way, instances of the desired words in close proximity dominate the weight function.

The weight function is used to compute a correlation score, which is a measure of how closely the words appear. To do this, the program cycles through every instance of each of the words appearing. Note that if one word appears a thousand times, and the other word appears a thousand times, the program will cycle through the million possible pairs of the two words, which can cause considerable slowdown. For every pairing of instances in which the words appear, the weight is computed as a function of the distance between the words. Taking the average weight across all possible pairings gives the correlation score. By the nature of the calculation, the more similar the distributions of the two words, the higher the correlation score.

The final step is to norm the correlation score so as to cancel the confounding factor of the length of the text corpus, so that correlations computed using different text have a comparable basis. To do this, the program simulates pairs of words being distributed randomly within the text, and calculates the average weigh, or expected correlation score, of a random text. To ensure consistency, I set the number of trials to be on the order of a million.

Dividing the correlation score by the expected correlation score gives the correlation. So, correlation is a ratio; when the correlation is one, the computed correlation equals the expected correlation. So, a correlation of one means no association between the words in question, since the words tend to cluster around each other only as much as is expected by chance. Note that this does not mean each word is distributed randomly; each word, taken alone, may tend to appear in groups. Rather, the way that each words clusters into groups is independent.

Similarly, if the correlation is greater than one, then the two words tend to appear closer together than is expected by chance, and if the correlation is less than one, the two words appear farther away from each other than expected by chance.

Figure 13: A run of my program

Results The idea of correlation is very subjective, and so there is no exact standard to compare

the results of my program against. I instead check the results the program gives against what is expected based on the intuitive notion of correlation.

I used the larger text War and Peace with relatively common words for my experiment in order to overcome the statistical random error that comes from my use of a single book, since the placement of words in the book depends, on some extent to random chance. These errors would disappear as corpus size increase, leaving only possible error from any lack of representativeness of my corpus.

I calculated the correlation between a set of arbitrarily-chosen words (Figure 14). Most correlations are slightly above one (chance correlation), with a few slightly below. I could find no obvious pattern in the data. A definite few peaks are visible, indicating a strong correlation between certain pairs of words, although the significance of this is not clear – are “love” and “life” the most connected words in War and Peace and/or in reality?

theshe

didman

lifelove

neverwar

theshe

didman

lifelove

neverwar

warnever

warlovewar

lifewarmanwar

did

warshe

warthe

neverwarnever

life

nevermannever

did

nevershe

neverthe

lovewar

lovenever

lovelife

lovemanlove

did

loveshe

lovethe

lifewar

lifenever

lifelove

lifemanlife

didlifeshe

lifethe

manwarman

nevermanloveman

life

mandidman

shemanthe

didwardid

neverdidlovedid

life

didman

didshe

didthe

shewar

shenevershe

love

shelife

sheman

shedid

shethe

thewar

theneverthe

lovethelifethe

manthedid

theshe

0.5

1

1.5

2

2.5

Cor

rela

tion

Correlations in War and Peace

Figure 14: Black-topped bars are correlations of less than one, words that correlate worse than random chance.

theshe

didman

lifelove

neverwar

theshe

didman

lifelove

neverwar

warnever

warlovewar

lifewarman

wardid

warshe

warthe

neverwar

neverlove

neverlife

nevermannever

did

nevershe

neverthe

lovewar

lovenever

lovelife

loveman

lovedid

loveshe

lovethe

lifewar

lifenever

lifelove

lifeman

lifedid

lifeshe

lifethe

manwarman

nevermanlove

manlife

mandid

mansheman

the

didwardid

neverdidlove

didlife

didmandid

she

didthe

shewar

shenevershe

love

shelifeshe

man

shedid

shethe

thewar

theneverthe

lovethelife

theman

thedid

theshe

0.5

1

1.5

2

2.5

Cor

rela

tion

Correlations in War and Peace (asymmetric)

Figure 15: Unlike Figure 14, this graph is not symmetric about the main diagonal

Looking at the data, I wondered whether words that can reasonably follow each other in either order had higher calculated correlations due to the sensitivity of the formula to consecutive words. To investigate this, I modified by program to only consider pairs of words where the first

words comes before the second word. This means if word B tends to follow word A, the correlation between A and B will be high, but the correlation between B and A will not be as high. The results are shown in Figure 15. Again, the pattern is not clear. The words “she” and “did” have a higher correlation than “did” and “she”, perhaps because the phrase “she did” is more common than “did she”, and “the” and “man” beats “man” and “the” for the same reason However, adjacent appearance cannot explain the other results.

I wondered if adjacent words had too much of a weight on correlation, as opposed to more subtle relationships where the two words appear often in proximity, but rarely directly next to each other. Although the weight functions gives a lower score to distant words than nearby ones, it should be such that many reasonably closely-spaced matches carry the same bearing as one instance of both words in very close proximity. To this end, I tried tweaking my weight function either to emphasize nearby appearance (sharp) or more longer-range instance (dull). The effects of these changes on correlation for three sets of words are shown in Figure 16.

Correlation with Varying Sharpness

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7

Sharpness

Corr

elat

ion

"love" and "marry""to" and "be""man" and "dance"

Figure 16: Right on the scale is more sharp (more emphasis on very close pairings). A correlation of one is neutral

The sharpness profiles for these word pairings are very revealing, much more so than the single data points presented in Figures 14 and 15. The words “to” and “be” derive much of their correlation from the phrase “to be”. So, the correlation grows astronomically with increasing sharpness. However, when the sharpness is decreased to nearly zero look for longer-range patterns, the correlation recedes to one. This is because the words “to” and “be” are distributed almost randomly within the text, and also because they have no inherent meaning and are correlated due to rules of syntax rather than a conceptual link.

The words “man” and “dance”, on the other hand, are conceptually related, and thus have higher correlations at lower sharpness. Interestingly, the words are correlated only at an interval

around one; outside that interval, their correlation is less than one, making them discorrelated. The words appear in close proximity and in very large-scale groupings below what is expected of chance, yet at just the right scope, there is a correlation.

Finally, “love” and “marry” have their maximum correlation at a sharpness around 2, and have significant correlation at a wide range of sharpnesses. The words “love” and “marry” are often found both very close to each other, and also in looser, larger groups.

The sharpness profiles give useful and relevant information in understanding the nature of the correlation between two words. They emphasize that a correlation is not a simple number, but rather varies with the scale at which we look for correlation.

Lastly, I bring up an experiment using my algorithm that is interesting, although it does not directly relate to my project’s objectives. Self-correlation is the correlation of a word with itself. It measures the extent to which a word appears in clusters. It is expected that words with actual meaning will cluster, whereas articles and prepositions and other words that serve a purely grammatical purpose will be evenly and randomly distributed within the text. Words with meaning are usually specific adjectives, nouns, and words (as opposed to more generic ones such as “way” and “not”), and they appear less frequently that grammatical filler. So, we’d expect self-correlation to decrease with increasing frequency, a trend confirmed in Figure 17 derived from the set of words (from lowest to highest frequency) {peace, road, mother, word, make, war, never, love, life, man, which, did, she, a, and, the}. A future experiment could look at how sharpness profiles of words’ self-correlation varies with their frequency.

Frequency vs Self-Correlation

0

1

2

3

4

5

6

7

8

9

10

11

12

13

-4 -3.5 -3 -2.5 -2 -1.5 -1Log Frequency

Self

Cor

rela

tion

War and PeacePride and Prejudice

Figure 17: Words with higher frequencies display less clustering in both texts

Potential Future Improvements

The algorithm would be more accurate if it could recognize different forms of each of the words, such as plurals, tenses, and suffixed and prefixed forms. For example, in computing the correlation between the words “love” and “marry”, a more sophisticated program could consider instances of “loves”, “loved”, “lovely”, and “married”, “marriage”, and “remarry” in addition to exact appearances of the root forms “love” and “marry”. Of course, checking root forms could be done with a pre-made dictionary, but this would defeat the point of running an algorithm with no human help. In theory, prefixed and suffixed forms could be recognized automatically by checking each word against the root form and a list of common affixes. However, due to the irregularity of English, this would be a difficult task, and would introduce another layer of inaccuracy.

Another potential improvement would be to fine-tune the computation of distance between each instances of the two desired words appearing. Rather than count how many words are in between, the program could also check how many sentences and paragraphs apart they words are. The most elegant approach to this, I believe, is to consider punctuation marks such as commas, periods, and paragraph breaks as pseudo-words that contribute to the distance between two words. For example, if one were to judge that a comma provides as much separation as one intervening word, then two words with a comma in between, would be considered to be two words one apart, rather than one. I believe I would be able to implement this improvement with difficulty and this it would slightly improve the accuracy of my results.

While both the search engine method and the program had their relative merits, a combination of the two would be ideal. This could be achieved by combining a tool that can search for every single web page on which the words appear, such as a search engine that does not use hit estimates, with an algorithm to compute the correlation between two words within the web page, such as my program. Weighting each instance of a page with both terms by the correlation on that page could greatly improve the accuracy of results obtained with the Google method.

Also, weighting web pages by their popularity of Google PageRank would be beneficial since the set of web pages that are actually seen by users is probably a more representative sample of English then the entire Internet, which has lists of words and machine-readable text. However, this may have unintended side-effect of biasing the corpus to text written to entertain rather to informing. Perhaps the English-language Wikipedia, an online encyclopedia could serve as a corpus, but (being an encyclopedia) it would be biased toward covering obscure topics. Deciding on what type of corpus would be representative of English is a difficult challenge.

On a broader level, the story presented at the beginning suggests a way to find the broader significance of the correlation data I collected in this project. The first step, drawing a correlation network, requires a more significant amount data to be meaningful. Once data collection is resolved, the main issue is algorithmically identifying clusters, tightly-knit groups of word that correlate strongly with each other. Data clustering is a burgeoning branch of computer science and statistics, and uses advanced and sophisticated algorithms, so I cannot speak in any detail to how this could be accomplished. I believe that with the proper analytical tools, vast quantitative information about how words relate to each other can be automatically mined from text corpuses using correlation-based approaches.

A Story - MIT - Massachusetts Institute of Technologyweb.mit.edu/arkhipov/www/21W.784 Final...

Documents

Transcript of A Story - MIT - Massachusetts Institute of Technologyweb.mit.edu/arkhipov/www/21W.784 Final...