The Small World of Human Language Ramon Ferrer i Cancho & Richard V. Sole presented by Emre Erdem.

The Small World of Human Language

Ramon Ferrer i Cancho

& Richard V. Sole

presented by Emre Erdem

Introduction

Zipf’s Law (Zipf 1972)

Zipf’s Law: the frequency of words decays as a power function of its rank In spite of its relevance and universality, such

a law can be obtained by various mechanisms and does not provide deep insight into the organization of the language

A complete theory of language requires a theoretical understanding of its implicit statistical regularities. Zip’s Law is the best known

Introduction

Lexicons

lexicon 1.dictionary

2.list of vocabulary belonging to a specific field

kernel lexicon: a common lexicon for successful basic communication

Human brains store lexicons that are usually formed by thousands of words. (in the range of words)

Introduction

Co-occurrence of words in sentences relies on the network structure of the lexicon.

Human language can be described in terms of a graph of word interactions. This graph has some unexpected properties that might underlie its diversity and flexibility, and create new questions about its origins and organization

Graph Properties of Human Language

Words co-occur in sentences Syntactical relationships Stereotyped expressions or collocations

(New York, take it easy)


Links

If the distance is long, the risk of capturing spurious co-occurrences increases

If the distance is too short, certain strong co-occurrences can be systematically not taken into account

The most correlated words in a sentence are the closest.

A decision must be taken about the maximum distance considered for forming links.

Links: Significant co-occurrences between words in the same sentence.


Links

A toy network constructed with four sentences John is tall John drinks water Mary is blonde Mary drinks wine

The graph is constructed by linking words at a distance one or two in the same sentence


Links

The maximum distance is decided according to minimum distance at which most of the co-occurrences are likely to happen

Many co-occurrences take place at a distance of one red flowers (adjective-noun), stay here (verb-adverb), can see

(modal-verb), getting dark (verb-adjective), the/this house (article-determiner-noun)

Many co-occurrences take place at a distance of two hit the ball (verb-object), Mary usually cries (subject-verb), table of

wood (noun-noun through a prepositional phrase), live in Boston (verb-noun)


Links

Seek will be stopped at a distance of two

Lack of an automatic capturing techniqueMethod fails to capture the exact relationships but does capture almost every possible type of linksWe are not interested in all the relationships. Our goal is to capture as many links as possible through an automatic procedure. A long-distance syntactic link implies the existence of lower-distance syntactic links. By contrast a short-distance link does not imply a long-distance link


Improving the technique

Choose only pairs of consecutive words, the mutual co-occurrence of which is larger than expected by chance.

: presence of correlations (co-occurances in real case)

: expected from random ordering(theoretical probability of co-occurance)

if this conditionis used in the graph

restricted graph


The Graph

: the graph of human language

set of words

set of edges or connections

between words


The Graph

Possible pattern of wiring in . Black nodes are common words and white nodes are rare words. Two words are linked if they occur significantly


The Small World Properties

C: clustering coefficient

d: path length

The small world pattern can be detected from the analysis of two basic statistical properties:


The Small World Properties

= 1 if there is a link between and

= 0 otherwise

: set of links

: average number of links per word

: the set of nearest neighbors of a word

The clustering coefficient for this word ( ) is defined as the number of connections between the words


Clustering coefficient

define (total numberof edges that exists)

: the set of nearest neighbors (possible number of edges X 2)


Average path lengthaverage path length of a word

min path length between two words

Scaling and Small-World Patterns

UWN (Unrestricted Word Network): the networks that results from basic method

RWN (Restricted Word Network): the networks that results from improved method

edges nodesaverage

connectivity


Distribution of degrees both the UWN and RWN obtained after processing three-quarters of the words

710

The exponent in the second regime is similar to the so-called Barabasi-Albert model (exponent is –3)

BA model leads to scale free distributions using the rule of preferential attachment


More frequent a word, the more available it is for production and comprehension. This phenomenon is known as frequency or recency effect. This phenomenon explains why preferential attachment shapes the scale-free distribution of our case

For the most frequent words,

where k is the degree and f is the frequency

Higher the degree of a word, the higher its availability

complete relationship betweenk and f in RWN


Kernel Words

The network formed exclusively by interaction of kernel words, hereafter called the Kernel Word Network (KWN) better agrees with the predictions that can be performed when preferential attachment is at play.


Kernel Words

Connectivity distribution for the kernel word network

The connectivity distribution for the kernel word network formed by 5000 most connected vertices in RWN

The average connectivity in the kernel is

Power law tail for The exponent of the power tail is

indicating that preferential attachment is happening


Kernel Words

Discussion

If the SW features derive from optimal navigation needs Words the main purpose of which is to speed-

up navigation must exist._ Brain disorders characterized by navigation

deficits in which such words are involved must exist_

10 most connected words: and the of in a to s with by is

These words are characterized by a very low or zero semantic content (meaning)

Although they are supposed to contribute to the sentence structure, they are not generally crucial for sentence understanding_

Discussion

First Prediction

Discussion

Second Prediction

Agrammatism: a kind of aphasia in which speech is non-fluent, laboured, halting and lacking in function words

aphasia: total or partial loss of the ability to use or understand spoken or written language. It is a symptom of brain disease or injury

Agrammatism is the only syndrome in which function words are particularly omitted.

Function words are the most connected ones.

Such halts and lack of fluency are due to fragility associated with the removal of highly connected words.

It is known that omission of function words is often accompanied by substitution of such words. Patients in which substitutions predominate and speech is fluent are said to undergo paragrammatism. Paragrammatism recovers fluency (i.e. low average word-word distance) by inadequately using the remaining highly connected vertices and thus often producing substitutions of words during discourse._

Thank you…

The Small World of Human Language Ramon Ferrer i Cancho & Richard V. Sole presented by Emre Erdem.

Documents

Transcript of The Small World of Human Language Ramon Ferrer i Cancho & Richard V. Sole presented by Emre Erdem.