Practice Architecture 1
-
Upload
william-patterson -
Category
Documents
-
view
221 -
download
0
Transcript of Practice Architecture 1
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 1/10
1
This Is The Architecture Of Our Practice
An overview of how Rock Creek Analytics provides opinion research designed andused for the Internet, from sampling to analysis.
Rock Creek Analyticsʼ pioneering tools use Internet text to
analyze public opinion. We provide thorough profiles of
people and issues. We find, follow, and assess developing
trends. Our work is quick, accurate and thorough.
We use Internet content because everyone – people,
institutions, even issues – leaves a record there: what we
say and what is said about us. Even more important,
opinion in Net content is there because people want it
there: people go to blogs, Facebook, and Twitter to express
their opinions and leave them for others to read.
We download that content and analyze it for the
characteristic and distinctive words and phrases that mark
off opinion about a person, an issue, or any other factor in
developing policy. We use a variety of statistical
perspectives to create profiles that characterize whatʼsbeing said online by and about someone (or some-thing).
Or whatʼs being said about something new: an emerging
trend.
Here is what we do:
• Create profiles – from as many perspectives as you need.
What makes Glen Beck and his message jump out for his
audience? We tell you the how and the why of the rise of
Glen Beck into prominence, and not just the how much.
• Identify and assess emerging trends. Where did the
nativist outcry during the financial crisis come from? What
made the argument effective and how did it trail off?
• Measure the prominence – newsworthiness, notoriety – of
an agitator or a cause. When and by how much did
someone become noticed, and how quickly did they fade
into the crowd?
We do this by benchmarking: comparing text and opinion
about a person with a context, comparing one political
position with another, finding celebrity against shiftingbackground: Al Gore and “global warming debate”; Sarah
Palin and the 2008 “Presidential campaign”.
Our benchmarking uses a series of statistical tools: most
importantly, evaluating significance. We find the differences
that matter, and put them together.
What follows is how we do it.
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 2/10
2
Search and samplingMost opinion research uses random sampling. Ours does
not.
In random sampling, each item has an equal chance of
being selected and each selection is made independently.
Randomness is modeled by normal distribution. Even in a
non-random environment, randomness is the basis of the
standard polling process.
The Economist/YouGov Internet Presidential poll .
Krosnick, J. A. (2006) compares with Blumenthal,
M.. No Such Thing As A Perfect Sample , (2009.
And ) appears to be a typical debate, focusing on
randomness as an unquestioned assumption,rather than considering whether it applies to Net
content (including whatʼs said by poll subjects
found online.)
Internet content is not randomly ordered. The Net material
we use is not amenable to random sampling, and is not
described with the mathematical models of randomness,
such as normal distribution. Rather than randomness, we
base our analysis on the makeup of Internet architecture:
power laws, and the scale-free and small world distribution.
The Structure of the Web ; L. Breslau, P. Cao, L.
Fan, G. Phillips, and S. Shenker. Web Caching an
Zipf-like Distributions: Evidence and Implications;
Menczer, F., Lexical and Semantic Clustering by
Web Links.
Power laws, scale freedom, and small world distribution
therefore apply to samples of Net content, including those
we use in our work. Random sampling is unlikely toproduce workable and representative material in a non-
random environment. This material is unlikely to be found
through random sampling, which is unlikely to be
representative of a non-random environment.
“[C]ohesive collections of Web pages (for
instances, pages on a site or pages about a topic
mirror the structure of the Web at large.” Dill, S., eal., Self-Similarity In the Web , Lexical and Seman
Clustering by Web Links; D. Gibson, J. Kleinberg,
Inferring Web Communities from Link Topology
This difficulty obtains for any grouped text, either
downloaded, as contemplated above, or gathered off-line.
“The inherent non-randomness of corpus data
renders statistical"estimates unreliable, since the
random variation on which they are based will be
smaller than the true variation of the observed
frequencies.” S. Evert, How random is a corpus?
The library metaphor
Consider analyzing opinion leadership during the efforts to
stem the financial crisis during September and October
2008. Ultimately the Net content of interest was in thetopics of finance and politics. Under the assumption of self-
similarity, those topics were the source from which Net
content was taken.
See Chrakrabarti, et al, The Structure of Broad
Topics on the Web; David M. Pennock, Gary W.
Flake, Steve Lawrence , Eric J. Glover, and C. LeeGiles. Winners don ̓ t take all: Characterizing the
competition for links on the web
The concurrent and cross-sectional analysis
conducted for profiles is similar.
The units of sampling in this case were documents/texts,
from 100 to 2000 words in length, in two groups of about
5000 Internet files total, taken from on-line websites of
newspapers (the New York Times), topical websites,
political and other weblogs, and assorted newsgroups.
Different Net upload-store-download technologi
present different issues for, variously, search and
retrieval (and analysis). Of the four here, only web
pages are usually undated; by-lines are used in o
line newspaper articles in contrast with
pseudonyms found elsewhere, collective versus
individual authorship, and so on. See below.
Originally the choices were made to see if
opinion leadership existed in one technology, e.g.
re-purposed newspapers, or another.
Blogs and Web 2.0 social media are sometimes used when
changing opinion is being analyzed.
In some cases, we use social media for timesensitive searching, weighted to reflect recency a
time-sensitive topicality.
Blogs present separate issues. Because a blog
post with a comment string may be a kind of
conversation (therefore a single functional text), b
may also be broken up into several different chun
of code, we may sample this as a set of strings
while analyzing the string as a single unit.
The unit of analysis we use is a ʻtextʼ. A text may refer to a
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 3/10
3
single item (web page) or to an aggregate (10,000 Net files
using the words “financial” and “crisis” which were uploaded
or posted September 1, 2008 – September 28, 2008).
The term ʻtextʼ as used here refers to each of two different
functions:
• The formal definition required for sampling and retrieval:
ʻone or more sentences demarcated by typological
conventions (white space, binding) or technical definitions
and use (<body> text </body> in HTML)ʼ.
• The functional definition required for analysis: ʻa semantic
unit of language in use, containing one or more
sentences, containing chains of repeated and related
words, and both familiar and novel information.ʼ
Net content is embedded in different kinds of cod
Typically we analyze content in HTML files (also,
when needed, blog comments in database
languages). (Newsgroup code, UUE etc, presents
separate issues out of scope here.)
The working convention for retrieval is, therefore,
“file” = “HTML document” = “web page” = “text”.
Depending on context, however, a blog thread –post and comments - may count as a single text.
the post and each comment may each count as
separate individual texts. (This is a working
distinction; the details are out of scope.)
Texts are also, as the formal definition implies, collections
of words with more or less well-formed boundaries. Words
are the units of measurement at this level of granularity,
and thereby serve two critical functions: they are units of
analysis for frequency, dispersion and collocation, and they
are semantic anchors for contextual and topical analysis, in
words, phrases, sentences, and texts, The dialectic
between the formal and semantic/topical perspectives on
words is the keystone of our work.
The formal extensional definition of words –
ʻcharacter strings bounded with white space,
grammatical markingʼ and so on – is omitted here
Text files are obtained from the Internet by using several
kinds of search engine, each with a different kinds of
ranking algorithm: Google (as an example of PageRank),
backlinks (Yahoo, among others), HITS/authority, and
unique visitors/popularity. Results are retrieved (with date
limitations, as needed), and downloaded by using returns
from each search engine separately (ranked by weight and
then recursed and results retrieved from a new search) and
by aggregating results.
The search engines used and the weighting
algorithms vary from case to case. The collection
process is designed around self-similarity, small
world, and scale free assumptions.
Analysis: FrequenciesTrend recognition is one of the most common forms of
frequency analysis, and so we will focus on it here.
Discovering a trend – either retrospectively, or more or less
concurrently - is a before-and-after analysis of content.
Working with the blocs of text files and the statistics of word
frequency change, we compare sequential blocks of
comparable topical materials.
The before-and-after analysis used for trends begins with
compiling word frequencies in the ʻbeforeʼ material – which
also functions as a benchmark. In this case the words are
those in the September and October text, and frequencies
are enumerated for each.
The compiled lists give what are sometimes calle
the “observed absolute frequencies” for the listedwords.
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 4/10
4
We use several metrics. One set is drawn from changes in
network graph and graph results.
Google (as an example of PageRank), backlinks,
HITS/authority, and unique visitors/popularity
For a comparable approach, see Gabrilovich,
Dumais, Horvitz Newsjunkie: Providing
Personalized Newsfeeds via Analysis of
Information Novelty, and, to similar effect, Jon
Kleinberg, Temporal Dynamics of On-Line
Information Streams .
To analyze incipient trends and numbers of words, asecond set of metrics uses measurements of word
association, the significance of changes in word frequency,
and changes in the dispersion of the most important words
to measure effects.
Some trend metrics use an interval scale – those used for
word frequencies, for example. To an extent we can the
measure the trend and its effect using interval data and
derivatives. However, we are constrained by the need to
use ordinal results (web page ranking systems), and non-
parametric dispersion analysis.
For example: relative frequencies are critical for
identifying the under the radar onset of trends,
popularity and some Google functions for followin
trends, and word and link dispersion for quantifyin
effects.
To compare before and after opinion on political issues, welooked at the two one-month periods before and after the
September 26th, the date the first bailout legislation failed.
We took each to mark an appropriate sampling unit for Net
opinion on the political dimensions of the financial crisis.
Assuming self-similarity, we used “financial” and “crisis” to
define a set of texts dealing with that topic and also
representative of the larger domains of opinion on the issue
on the Net. These definitions also served as search engine
queries (in the first instance, as discussed above). After
weighted ranking, we also introduced date limitations (of
convenience, further simplified for this discussion) and
downloaded files in two sets of about 250,000 words each,
for the two time periods.
There is also the dynamic case, not used here, wi
feedback, such that results (from one or another
level) about opinion from earlier periods are
introduced (at one or another level) for another
period. This ranges from media feedback to
explicitly and overtly gaming a popularity-based
search engine such as Technorati.
The listsʼ word frequencies - here for the pre- and post-
September 29 text collections – are then compared. This
contrast is the next step in showing whether a trend – a
discrete and identifiable chain of opinion – emerged from
one dated set of texts compared to its predecessor. What
turns up when raw counts are compared?
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 5/10
5
There was very little difference between the periods at this
level in this case: terms like bailout and financial, which
have led the substantive discussion decrease, but only very
slightly.
Observed absolute frequencies of critical terms
before and after 9/26/08 (Functional words are
much more common than the content words we
analyze. The function word “the” has been added
the table for comparison; its use declines as well,
also only slightly. This suggests that decline in
observed frequencies standing alone is unlikely to
be informative.)
pre-9/26 post-9/26
“the” 14,171 (1st) 13,300 (1
st)
bailout 899 (32
nd
) 582 (59
th
)government 661 (45th) 437 (79
th)
financial 601 (50th) 553 (63
rd)
(Numbers in parentheses show frequency rank)
(Data from a Rock Creek study, the pre- and post
9/26 text sets were about 500,00 words each) eac
However, going beyond this case, as a general matter, if
the sample sets are different sizes, comparing frequencies
in word usage between sets would not standing alone even
be valid, much less useful, at least not until the comparison
is checked.
The more frequent occurrence of a word in one te
collection does not by itself show that the observe
word is actually more frequent because the
observed frequencies are dependent on the sizes
of Normalized frequencies the texts that are being
compared. Gries, S. Th. Useful statistics for corpu
linguistics
One method for benchmarking comparisons normalizes the
different instances, a ratio of its raw count to the word countfor the entire text. This can be expressed as “[word X] per
thousand”, or as a percentage.
Differences after normalizing continue to be slight in this
case. “[F]inancial” and “bailout” dominated the content
words of the substantive debate, but, for example, showing
only about a 0.15% decrease in the use of “bailout” for the
post-September 26 period.
Important terms (as percentages) of
observed total word counts
pre-9/26 post-9/26
“the” 5.6 4.6
bailout .35 .20
government .26 .15
financial .24 .19
As the tables suggest, comparing word frequencies in
almost any pair of texts may not show much difference
even when normalized. The critical question is whether thedifferences in use (i.e., frequencies) for important terms
matter. Our analysis relies on a statistical test for comparing
frequencies.
The results of relative frequency testing serve two different
but related inquiries:
• First, are the two texts being compared (non-trivially)
distinct from one another? Is there word use in the sets of
text that shows meaningful differences in opinion for
September and October 2008?
• Second, how are the texts distinct? What word use
distinguishes one from the other? In this case, did
patterns of word use – ultimately trends in opinion –emerge and develop?
• Third, if there are conspicuous differences, do the sharp-
edged differences in word patterns suggest more or less
topically and thematically related set(s) of words?
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 6/10
6
The process begins with the frequency lists just described.
For each word in the two frequency lists we derive the
significance statistic to obtain a value with which to
distinguish the September and October texts and to analyze
the distinction.
There are more than two dozen tests now being discussed
in the scientific disciplines concerned with evaluating the
significance of frequency differences in word use when
paired texts are compared. The most commonly used are alog likelihood test and the chi-squared test.
The log likelihood test we use does not assume
randomness or normally distributed data in making
comparisons. Therefore it is better this test as suited to the
Net's non-random word and content distribution.
See Dunning T., Accurate methods for statistics o
surprise and coincidence (cited more than 1300
times); Rayson P, Garside R. Comparing Corpora
using Frequency Profiling.
By contrast the commonly used chi-squared test
derives probabilities for the frequencies by
comparing them with random ordering. The
examples are not random. This depends on the
assumption - not applicable for Net content or
words in general - that words are independent an
identically distributed.
We also use log likelihood analysis to derive the sets of
words that can be used to distinguish one set of texts fromanother, or to characterize one of them. Sometimes these
sharply defined words are referred to as "key words": those
that occur uncommonly more or less in one text (or set of
texts) than another. Keyness measures relative
distinctiveness: how a far a term departs from its
comparative benchmark. These keywords are the words
that characterize individual texts - Romeo and Juliet - or
groups of text (post- as opposed to pre-September 26
discussions of the financial crisis).
Scott M. & Tribble C., TEXTUAL PATTERNS. Note th
key/keyness are derivative terms: the log likelihoo
test (discussed below) measures the salience
(more or less the same as statistical significance)
relative frequency between texts. That is, the met
looks not at the arithmetical difference in word use
but at how much that difference matters.
When we applied “keyness” to the September-October Net
discussion, some words stood out. While used less often
as a matter of in raw numbers, these words stood out and
made the later October discussion distinctive. “[M]inorities”,
“hate”, and “alien” have become visible in this relative
frequency analysis
These emerging keywords are evidence of a change in the
terms of the debate
The log likelihood test also uses a logarithmically
based ratio scale that facilitates comparisons of
individual word (and some other) usage across
sets of texts. This in turn allows cross-sectional
and longitudinal comparisons.
New terms in the crisis debate
(Left column is absolute percentage and rank; rig
column is departure from expected frequencie
negative – declines – in red)
Percentage KEYNESS
bailout .20 (64th) -117
government .15 (85th) -82
financial .19 (68th) -14
minorities .02 (525th) 34
hate .01 (284th) 21
alien -- (3141st) 10
Keywords are the hallmarks of frequency change – theybring out the contrast between profile and background, or
between the blocks of opinion recorded in Net text. If people
talk differently about Toyota than they do about General
Motors, what stands out – by the log likelihood statistical
metric – in the comparison?
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 7/10
7
Keywords are also are markers for shifts in word use as the
Net discussion moves forward. How are keywords situated
within text? Where do they fall? This kind of ʻlocationʼ is
measured with analysis of “dispersion”: the even or uneven
distribution of an item through the text being studied the
closely related issue of dispersion.
Dispersion - placing keywords in the Internet text
environment - is the basis for the audience-effect side of ouropinion research.
We can place keywords in different audience segments
represented by the different groups of Net text, and the
compare the text groups for the distribution of keywords. Iif
“minority” and “alien” are found significantly more often
during October on conservative blogs than elsewhere, this is
evidence of an echo chamber effect for that issue.
Dispersion, then, shows where messages have taken hold:
where and by how much the message has had an effect.
That is, we are developing ways to measure where, how,
and how much a message has affected different blocs of
Internet opinion. Where are the key terms found most often
in the audience – and in which parts of the audience?
Ordinarily dispersion for ratio-scaled data is
measured by standard deviation or variance.However, where the data may be non-parametric
those metrics are not available. Also, pair-wise da
is not available, and sample sizes are large. For
continuing surveys of the problem see S. Gries,
Dispersions and adjusted frequencies in corpora;
and Dispersions and adjusted frequencies in
corpora, further explorations .
In this case, the keyword term with visibly uneven dispersion
is “minorities”, and the effect is concentrated in the posts of
three bloggers, two visibly conservative-leaning. In this case,
Malkin seems ultimately to have been preaching to the
converted.
Nativist rhetoric was found on conservative blogs
at least 40,000 of them (by different sampling tha
that used above.) However, this was less than 4%
of blogs discussing the financial crisis, and only
traces of the message could be found in
mainstream media common websites. Please not
that these results are crude, and common from th
application of a form of head-counting, using theresults of the frequency analysis.
Collocation
Collocation is the degree to which words occur together
unusually often, by some measure of significance.
"Collocation" is a formal term for this intuition - that some
words tend to occur near each other: "night" and "day", "kick"
and "bucket", "global" and "warming".
Evert, S., Corpora and collocations
If one or more collocations - a set of key phrases and other
patterns - can be found in a text, then we can build up to
quantitatively derived core features of the text. Moreover,when collocates can be found and aggregated for the
distinctively frequent vocabulary (keywords) in a text, the
process marks off the message of a text, whether this is the
intended message or the message picked up by the Internet
audience reading the text and its message.
It follows that study of collocations and their usesextends and applies statistics in virtually every
language discipline, from machine translation to
literary analysis to email forensics.
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 8/10
8
As a rule of thumb, the higher the statistical score for a word
pair's collocation, the more the association tells us about the
pairʼs role in the text. This measurement and analysis can be
done by hand (as by inspecting a text for every instance of a
word in order to identify which words recur near each other)
or using statistics.
This is a considerable over-simplification: there ar
well more than 25 measures for collocation that
have been proposed; seldom, if ever, will all point
the same direction. Corpora and collocations ;
Oakes M, STATISTICS FOR CORPUS LINGUISTICS188
195. In practice, we use one or more of the mutua
information, t-test, and log likelihood tests for a
project.
Collocates add color to literal meaning; repeated and
prominent usage may enhance the coloring of surroundingwords. “Cause” is an example; when used as a verb its usual
collocates are negative.
ʻCauseʼ collocates with, (among other things):
damage, problems, pain, disease, distress, troubl
blood, concern, degradation, harm, pollution,suffering, anxiety, death, fear, stress, surprise,
symptoms
Collocation can compound the effect of a distinctive and vivid
vocabulary. For example, at the end of 2008 we analyzed the
impact of an online essay Michelle Malkin wrote in late
September of that year, arguing that illegal immigrants were
to blame for the banking collapse.
Illegal immigration and the mortgage mess
There were interlocking word patterns in that essay that, as
received and passed on in Net discussion, could be captured
and measured with statistical collocation analysis.
These words also interlocked: “illegal” collocated markedly
with both “alien” and “Hispanic” and so on, as shown in the
figure below.
The full recursive collocation analysis is beyond
scope here. Details on request.
Figure 2 –how the core words in
Malkinʼs essay interlocked
From the Rock Creek Analytics collocation analys
of the Malkin essay. The links and nodes are not scale, except that the node and link sizes are
scaled as shown relative to each other and these
are among the most common words in the text. Th
graphic was created using Voisine network
visualization software
Each of these keywords collocated significantly often with
each of the others. The width of the lines represents the
result of applying the collocation metrics, a kind of tensilestrength. Moreover two of these keywords were “key”, used
unusually often (by Malkin in comparison with other
September Net opinion).
See above for her keywords.
The result was a tightly bound bundle of blame. The word
pattern, by itself, or in noteworthy part, was picked up by
about 40,000 weblogs in October 2008.
However, given the dispersion review noted abov
this number may not reflect a significant impact o
the overall discussion.
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 9/10
9
One of the discussion threads picking up and echoing her
phrasing also used the negative coloration of “cause”
described above: “Giving home loans to minorities caused
financial crisis”.
There are many other rhetorical devices in Net tex
that also served to convey Malkinʼs message (and
that can be measured but were not analyzed in th
case). They include synonyms, homonyms, and
other rhetorical figures, like part-for-whole
synecdoche (“alien”).
Repeating phrasing and other distinctive vocabulary in this
way reflects its influence - we tend to quote or reword the
phrasing for ideas we agree with - and by tracking
phraseology in this we can track influence.
For an example of tracking phraseology in this wa
see J. Leskovec, L. Backstrom, and J. Kleinberg,
Meme-tracking and the Dynamics of the News
Cycle. (However, to be clear, the conception of a
meme used in that article is very different than we
use, as shown for example, in Figure 2.)
Concordancing
A concordance is a list of a word (or sometimes a brief
phrase), along with immediate context, from a corpus or text
collection. The process produces a list of occurrences of the
search term, with each occurrence centered in the GUI
window of specialized concordance software. Each instance
of the search term displays the words that come before and
after in the text to the left and right of the term as it is shown
in the software window
See S. Hunston, CORPORA IN APPLIED LINGUISTICS
38-66
Although not by itself a complex tool, concordancing servesseveral functions: when counting from display, it can be used
to discover latent word patterns. It can analyze a text using
the context for collocated terms. It can be used to identify
and supply words for relative frequency analysis
Here is an example of a concordance list, centered on the
word illegal, in the Malkin essay discussed above, using
“illegal” as a search term
Figure 3: screenshot of concordance software, showing “illegal” as used in context in Malkinʼs essay
Material taken from the Malkin essay, using Antconc software
8/3/2019 Practice Architecture 1
http://slidepdf.com/reader/full/practice-architecture-1 10/10
10
Beyond investigation and research, a concordance serves as
a check for the results of other functions.
Do the collocations appear to be significant when examined
in context? What do key word results show when their usage
is examined in context? “Illegal” here shows as a linchpin of
nativist rhetoric.
Putting the data together
This technical review has shown how we extract opinion from Internest text and put it to work.
Here is a brief summary of the example.
Which are the words that matter – that make a
message stand out, that make a text distinctive.
How do words and phrases com[are with
competing messages?
Frequency and relative frequency analysis.
Keywords:
“minorities” and “alien” as core conservative
rhetoric
How do critical words – especially keywords –
hold together?
Collocation: “illegal”, “Hispanics” and “alien”
Where and how much do messages have an
impact?
Nativism had conservative resonanace, but
slight if any effect elsewhere.How were critical words used in context? Concordancing: here, for example, “illegal” and
in context: “the massive illegal alien mortgage
racket”
Conclusion
The Internet is nothing more than a vast collection of
computer files, billions of them. Many are machine–readable
text that can be displayed in English. These text files aredocuments describing, referring to, and corresponding to
people, institutions, and issues. Many of these reflect and
express opinion. Neglecting this analysis Net text means
missing out a critical resource for opinion research.
These are the most critical and the most valuable tools
available at Rock Creek Analytics.
They work.
Contact Donald Weightman, principal
(cell 202 997-3290)
[email protected] or [email protected]