Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew...
-
Upload
alannah-daniels -
Category
Documents
-
view
214 -
download
1
Transcript of Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew...
![Page 1: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/1.jpg)
Information Extraction,Social Network Analysis
Structured Topic Models & Influence Mapping
Andrew [email protected]
Information Extraction & Synthesis Laboratory
Department of Computer Science
University of Massachusetts
Joint work with Aron Culotta, Charles Sutton, Wei Li, Chris Pal, Pallika Kanani, Gideon
Mann, Natasha Mohanty, Xuerui Wang.
![Page 2: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/2.jpg)
Goals
• Quickly understand and analyze contents oflarge volume of text + other data– browse topics– navigate connections– discover & see patterns
• Assess data source to determine relevance• Browse data newly acquired from the field• Navigate your own data• Discover structure and patterns• Assess impact and influence
Collaborative
opportunity
assessment
Let analysts drive discovery process
Inducing organizational structure
unfamiliar,
inter-agency
^
Rapid ingest
Map flow of ideas
![Page 3: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/3.jpg)
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Sample a distributionover topics,
For each document:
Sample a topic, z
For each word in doc
Sample a wordfrom the topic, w
Example:
70% Iraq war30% US election
Iraq war
“bombing”
GenerativeProcess:
![Page 4: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/4.jpg)
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
![Page 5: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/5.jpg)
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
![Page 6: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/6.jpg)
Social Network in an Email Dataset
![Page 7: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/7.jpg)
Author-Recipient-Topic SNA model
Topic choice depends on:- author- recipient
r
[McCallum, Corrada, Wang, 2005]
![Page 8: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/8.jpg)
Enron Email Corpus
• 250k email messages• 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
![Page 9: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/9.jpg)
Topics, and prominent senders / receiversdiscovered by ARTTopic names,
by hand [McCallum et al 2005]
![Page 10: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/10.jpg)
Topics, and prominent senders / receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
![Page 11: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/11.jpg)
Comparing Role Discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
![Page 12: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/12.jpg)
Comparing Role Discovery Tracy Geaconne Dan McCarty
Traditional SNA Author-TopicART
Similar roles Different rolesDifferent roles
Geaconne = “Secretary”McCarty = “Vice President”
![Page 13: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/13.jpg)
Traditional SNA Author-TopicART
Different roles Very differentVery similar
Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”
Comparing Role Discovery Lynn Blair Kimberly Watson
![Page 14: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/14.jpg)
Traditional SNA Author-TopicART
Block structured NotNot
ART: Roles but not Groups
Enron TransWestern Division
![Page 15: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/15.jpg)
Two Relations with Different Attributes
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
A C E B D FG1G1G1G2G2G2
G1G1G1G2G2G2
ACEBDF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Social Admiration
Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)
ACBDEF
![Page 16: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/16.jpg)
The Group-Topic Model: Discovering Groups and Topics Simultaneously
bNw
t
B
T
φ
η
DirichletMultinomial
Uniform
2Sv
β
2Gγ α
Beta
Dirichlet
Binomial
SgMultinomial
T
![Page 17: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/17.jpg)
Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast
We assume the relationship is symmetric.
![Page 18: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/18.jpg)
Dataset #1:U.S. Senate
• 16 years of voting records in the US Senate (1989 – 2005)
• a Senator may respond Yea or Nay to a resolution
• 3423 resolutions with text attributes (index terms)
• 191 Senators in total across 16 years
S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
![Page 19: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/19.jpg)
Topics Discovered (U.S. Senate)Education Energy
MilitaryMisc.
Economic
education energy government federalschool power military labor
aid water foreign insurancechildren nuclear tax aid
drug gas congress taxstudents petrol aid business
elementary research law employeeprevention pollution policy care
Mixture of Unigrams
Group-Topic Model
Education
+ DomesticForeign Economic
Social Security
+ Medicareeducation foreign labor social
school trade insurance securityfederal chemicals tax insurance
aid tariff congress medicalgovernment congress income care
tax drugs minimum medicareenergy communicable wage disability
research diseases business assistance
![Page 20: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/20.jpg)
Groups Discovered (US Senate)
Groups from topic Education + Domestic
![Page 21: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/21.jpg)
Senators Who Change Coalition the most Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid
![Page 22: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/22.jpg)
Dataset #2:The UN General Assembly
• Voting records of the UN General Assembly (1990 - 2003)
• A country may choose to vote Yes, No or Abstain
• 931 resolutions with text attributes (titles)
• 192 countries in total
• Also experiments later with resolutions from 1960-2003
Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting
The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:
In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
![Page 23: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/23.jpg)
Topics Discovered (UN)
Everything Nuclear
Human RightsSecurity
in Middle East
nuclear rights occupiedweapons human israel
use palestine syriaimplementation situation security
countries israel calls
Mixture ofUnigrams
Group-TopicModel
NuclearNon-proliferation
Nuclear Arms Race
Human Rights
nuclear nuclear rightsstates arms humanunited prevention palestine
weapons race occupiednations space israel
![Page 24: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/24.jpg)
GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.
![Page 25: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/25.jpg)
Groups and Topics, Trends over Time (UN)
![Page 26: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/26.jpg)
Structured Topic Models
Models that combine text analysiswith other structured data:
people, senders, receivers, organizations, votes,
time, locations, materials, ...
I call these...
![Page 27: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/27.jpg)
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
![Page 28: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/28.jpg)
Groups and Topics, Trends over Time (UN)
![Page 29: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/29.jpg)
Want to Model Trends over Time
• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time
• Is prevalence of topic growing or waning?
• How do roles, groups, influence shift over time?
![Page 30: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/30.jpg)
Topics over Time (TOT)
w t
α
Nd
z
D
T
T
Betaover time
Multinomialover words
β γ
Dirichlet
multinomialover topics
topicindex
wordtime
stamp
Dirichletprior
Uniformprior
[Wang, McCallum, KDD 2006]
![Page 31: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/31.jpg)
State of the Union Address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.
• 17156 ‘documents’
• 21534 words
• 669,425 tokens
Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.
1910
![Page 32: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/32.jpg)
Comparing
TOT
against
LDA
![Page 33: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/33.jpg)
TOT
versus
LDA
on my email
![Page 34: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/34.jpg)
Topic Distributions Conditioned on Time
time
top
ic m
ass
(in
ver
tica
l h
eig
ht)
in N
IPS
con
ference p
apers
![Page 35: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/35.jpg)
Discovering Group StructureTrends over Time
Group Modelwithout Time
Group Modelwith Time
groupid
observedrelation
per group-pairbinomial overrelation absent / present
multinomialdistributionover groups
time- stamp
G
per groupbeta overtime
![Page 36: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/36.jpg)
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
![Page 37: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/37.jpg)
Latent Dirichlet Allocation
[Blei, Ng, Jordan, 2003]
N
n
w
z
θ
α
Tφ
β
LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal
“motion”(+ some generic)
LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation
“images,motion, eyes”
topic distribution
topic
word
Per-topic multinomial over words
![Page 38: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/38.jpg)
Pachinko Machine
![Page 39: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/39.jpg)
Pachinko Allocation Model (PAM)[Li, McCallum, 2006]
α22
α31 α33
α41 α42 α43 α44 α45
Model stru
cture
,
not the g
raphical m
odel
α32
word1 word2 word3 word4 word5 word6 word7 word8
Model structure: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves
For each document: Sample a multinomial from each Dirichlet
For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf.Generate the word at the leaf
α21
α11
Like a Polya tree, but DAG shaped, with arbitrary number of children.
Thanks to Michael Jordan
for suggesting the name
![Page 40: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/40.jpg)
Pachinko Allocation Model[Li, McCallum, 2006]
Model stru
cture
,
not the g
raphical m
odel
Distributions over words (like “LDA topics”)
Distributions over topics;mixtures, representing topic correlations
Distributions over distributions over topics...
Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)
α22
α31 α33
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
α21
α11
![Page 41: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/41.jpg)
Pachinko Allocation Model[Li, McCallum, 2006]
Model stru
cture
,
not the g
raphical m
odel
Estimate all these Dirichlets from data.
Estimate model structure from data. (number of nodes, and connectivity)
α22
α31 α33
α41 α42 α43 α44 α45
α32
word1 word2 word3 word4 word5 word6 word7 word8
α21
α11
![Page 42: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/42.jpg)
Pachinko Allocation Special CasesLatent Dirichlet Allocation
α21 α22 α23 α24 α25
α11
word1 word2 word3 word4 word5 word6 word7 word8
![Page 43: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/43.jpg)
Inference – Gibbs Sampling
Dirichlet parameters α are estimated with moment matching
N
n
w
T’
α2
θ2
z2 z3
Tφ
β
α3
θ3
Jointlysampled
∑∑∑ ++
×++
×++
∝== −
m mp
wpw
p kpdk
kpdkp
k kd
kdk
wpwkw n
n
n
n
n
nzDtztzP
β
β
α
α
α
αβα
' ')(
)(
' '1)(
1
1)(
132 ),,,|,(
)( ktP )|( kp ttP )|( ptwP
![Page 44: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/44.jpg)
Example Topics
LDA 100motiondetectionfieldopticalflowsensitivemovingfunctionaldetectcontrastlightdimensionalintensitycomputermtmeasuresocclusiontemporaledgereal
PAM 100motionvideosurfacesurfacesfigurescenecameranoisy sequenceactivationgeneratedanalyticalpixelsmeasurementsassigneadvancelatedshownclosedperceptual
LDA 20visual modelmotionfieldobjectimageimagesobjectsfieldsreceptiveeyepositionspatialdirectiontargetvisionmultiplefigureorientationlocation
PAM 100eyeheadvorvestibulooculomotorvestibularvaryreflexvipanrapidsemicircularcanalsrespondsstreamscholinergicrotationtopographicallydetectorsning
“motion”(some generic)
“images,motioneyes” “motion” “eyes”
PAM 100imagedigitfacespixelsurfaceinterpolationscenepeopleviewingneighboringsensorspatchesmanifolddatasetmagnitudetransparencyrichdynamicalamountstor
“images”
![Page 45: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/45.jpg)
Blind Topic Evaluation
• Randomly select 25 similar pairs of topics generated from PAM and LDA
• 5 people• Each asked to “select
the topic in each pair that you find more semantically coherent.”
LDA PAM
5 votes 0 5
>= 4 votes 3 8
>= 3 votes 9 16
Topic counts
![Page 46: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/46.jpg)
Examples
PAM LDA
control
systems
robot
adaptive
environment
goal
state
controller
control
systems
based
adaptive
direct
con
controller
change
5 votes 0 votes
![Page 47: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/47.jpg)
Examples
4 votes 1 vote
PAM LDA
motion
image
detection
images
scene
vision
texture
segmentation
image
motion
images
multiple
local
generated
noisy
optical
![Page 48: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/48.jpg)
Examples
PAM LDA
algorithm
learning
algorithms
gradient
convergence
function
stochastic
weight
algorithm
algorithms
gradient
convergence
stochastic
line
descent
converge
PAM LDA
signals
source
separation
eeg
sources
blind
single
event
signal
signals
single
time
low
source
temporal
processing
4 votes 1 vote 1 vote 4 votes
![Page 49: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/49.jpg)
Topic Correlations
![Page 50: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/50.jpg)
Likelihood Comparison
• Varying number of topicsPAM supports ~5x more topics than LDA
![Page 51: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/51.jpg)
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
![Page 52: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/52.jpg)
Topic Interpretability
LDA
algorithmsalgorithmgenetic
problemsefficient
Topical N-grams
genetic algorithmsgenetic algorithm
evolutionary computationevolutionary algorithms
fitness function
![Page 53: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/53.jpg)
Topical N-gram Model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
α
WTW
γ1 γ2β 2
[Wang, McCallum 2005]See also:
[Steyvers, Griffiths, Newman, Smyth 2005]
topic
uni- / bi-gramstatus
words
uni- bi-
![Page 54: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/54.jpg)
Features of Topical N-Grams model
• Easily trained by Gibbs sampling– Can run efficiently on millions of words
• Topic-specific phrase discovery– “white house” has special meaning as a phrase
in the politics topic,– ... but not in the real estate topic.
![Page 55: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/55.jpg)
Topic Comparison
learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning
LDA
reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning rlfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods
policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies
Topical N-grams (2) Topical N-grams (1)
![Page 56: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/56.jpg)
Topic Comparison
wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels
LDA
speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent
speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid
Topical N-grams (2) Topical N-grams (1)
![Page 57: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/57.jpg)
Improve Basic Infrastructureof Topic Models
• Incorporate time
• Finer-grained, more interpretable topicsby representing topic correlations
• Discover relevant phrases
• Map influence and impact
![Page 58: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/58.jpg)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 59: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/59.jpg)
ResearchPaper
Cites
Previous Systems
![Page 60: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/60.jpg)
ResearchPaper
Cites
Person
UniversityVenue
Grant
Groups
Expertise
More Entities and Relations
![Page 61: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/61.jpg)
![Page 62: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/62.jpg)
![Page 63: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/63.jpg)
![Page 64: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/64.jpg)
![Page 65: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/65.jpg)
![Page 66: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/66.jpg)
![Page 67: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/67.jpg)
Topical TransferCitation counts from one topic to another.
Map “producers and consumers”
![Page 68: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/68.jpg)
Topical Bibliometric Impact Measures
• Topical Citation Counts
• Topical Impact Factors
• Topical Longevity
• Topical Precedence
• Topical Diversity
• Topical Transfer
[Mann, Mimno, McCallum, 2006]
![Page 69: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/69.jpg)
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cit’s Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr..
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase: a repository of Web pages
![Page 70: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/70.jpg)
Topical Diversity
Papers that had the most influence across many other fields...
![Page 71: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/71.jpg)
Topical DiversityEntropy of the topic distribution among
papers that cite this paper (this topic).
HighDiversity
LowDiversity
![Page 72: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/72.jpg)
Topical Bibliometric Impact Measures
• Topical Citation Counts
• Topical Impact Factors
• Topical Longevity
• Topical Precedence
• Topical Diversity
• Topical Transfer
[Mann, Mimno, McCallum, 2006]
![Page 73: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/73.jpg)
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Speech Recognition:
Some experiments on the recognition of speech, with one and two ears,E. Colin Cherry (1953)
Spectrographic study of vowel reduction, B. Lindblom (1963)
Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965)
Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974)
Automatic Recognition of Speakers from Their Voices, B. Atal (1976)
![Page 74: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/74.jpg)
Topical PrecedenceWithin a topic, what are the earliest papers that received more than n citations?
“Early-ness”
Information Retrieval:
On Relevance, Probabilistic Indexing and Information Retrieval,Kuhns and Maron (1960)
Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems,
Cooper (1968)
Relevance feedback in information retrieval, Rocchio (1971)
Relevance feedback and the optimization of retrieval effectiveness, Salton (1971)
New experiments in relevance feedback, Ide (1971)
Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)
![Page 75: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/75.jpg)
Topical Transfer Through Time
• Can we predict which research topicswill be “hot” at the ICML conference next year?
• ...based on– the hot topics in “neighboring” venues last year– learned “neighborhood” distances for venue pairs
![Page 76: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/76.jpg)
How do Ideas Progress Through Social Networks?
COLT
“ADA Boost”
ICML
ACL(NLP)
ICCV(Vision)
SIGIR(Info. Retrieval)
Hypothetical Example:
![Page 77: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/77.jpg)
How do Ideas Progress Through Social Networks?
COLT
“ADA Boost”
ICML
ACL(NLP)
ICCV(Vision)
SIGIR(Info. Retrieval)
Hypothetical Example:
![Page 78: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/78.jpg)
How do Ideas Progress Through Social Networks?
COLT
“ADA Boost”
ICML
ACL(NLP)
ICCV(Vision)
SIGIR(Info. Retrieval)
Hypothetical Example:
![Page 79: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/79.jpg)
Topic Prediction Models
Static Model
Transfer Model
Linear Regression and Ridge RegressionUsed for Coefficient Training.
![Page 80: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/80.jpg)
Preliminary Results
MeanSquaredPredictionError
# Venues used for prediction
Transfer Model with Ridge Regression is a good Predictor
(SmallerIs better) Transfer
Model
![Page 81: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/81.jpg)
Toward More Detailed, Structured Data
Prediction Outlier detection Decision support
Documentcollection
Actionableknowledge
Leveraging Text in Social Network Analysis
SegmentClassifyAssociateCluster
IE
Database
Discover patterns - entity types - links / relations - events
DataMining
Extract structured data aboutentities, relations, events
Structured Topic Models
![Page 82: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/82.jpg)
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Uncertainty Info
Emerging Patterns
Joint Inference
![Page 83: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/83.jpg)
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
ProbabilisticModel
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Solution:
Conditional Random Fields [Lafferty, McCallum, Pereira]
Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]
Discriminatively-trained undirected graphical models
Complex Inference and LearningJust what we researchers like to sink our teeth into!
Unified Model
![Page 84: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/84.jpg)
(Linear Chain) Conditional Random Fields
yt -1
yt
xt
yt+1
xt +1
xt -1
Finite state model Graphical model
Undirected graphical model, trained to maximize
conditional probability of output sequence given input sequence
. . .
FSM states
observations
yt+2
xt +2
yt+3
xt +3
said Jones a Microsoft VP …
OTHER PERSON OTHER ORG TITLE …
output seq
input seq
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
[Lafferty, McCallum, Pereira 2001]
€
p(y | x) =1
Zx
Φ(y t ,y t−1,x, t)t
∏ where
€
Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
![Page 85: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/85.jpg)
Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
![Page 86: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/86.jpg)
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars,
was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and
dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
----------------------------------------------------------------------------
----
: : Production of Milk and Milkfat 2/
:
Number :-------------------------------------------------------
Year : of : Per Milk Cow : Percentage : Total
:Milk Cows 1/:-------------------: of Fat in
All :------------------
: : Milk : Milkfat : Milk Produced : Milk :
Milkfat
----------------------------------------------------------------------------
----
: 1,000 Head --- Pounds --- Percent Million
Pounds
1993 : 9,589 15,704 575 3.66 150,582
5,514.4
1994 : 9,500 16,175 592 3.66 153,664
5,623.7
1995 : 9,461 16,451 602 3.66 155,644
5,694.3
----------------------------------------------------------------------------
----
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
100+ documents from www.fedstats.gov
![Page 87: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/87.jpg)
Table Extraction Experimental Results
Line labels,percent correct
95 %
65 %
85 %
HMM
StatelessMaxEnt
CRF
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
![Page 88: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/88.jpg)
IE from Research Papers[McCallum et al ‘99]
![Page 89: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/89.jpg)
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]
error40%
![Page 90: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/90.jpg)
Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.
Labels: Examples:
PER Yayuk BasukiInnocent Butare
ORG 3MKDPCleveland
LOC ClevelandNirmal HridayThe Oval
MISC JavaBasque1,000 Lakes Rally
![Page 91: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/91.jpg)
Named Entity Extraction Results
Method F1
HMMs BBN's Identifinder 73%
CRFs 90%
[McCallum & Li, 2003, CoNLL]
![Page 92: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/92.jpg)
MALLETMachine Learning for LanguagE Toolkit
• ~80k lines of Java• Based on experience with previous toolkits
– Rainbow, WhizBang. GATE, Weka.• Document classification, information extraction, clustering, co-reference,
cross-document co-reference, POS tagging, shallow parsing, relational classification, sequence alignment, structured topic models, social network analysis with text.
• Infrastructure for pipelining feature extraction and processing steps.• Many ML basics in common, convenient framework:
– naïve Bayes, MaxEnt, Boosting, SVMs; Dirichlets, Conjugate Gradient• Advanced ML algorithms:
– Conditional Random Fields, BFGS, Expectation Propagation, …
• Unlike other general toolkits (e.g. Weka) MALLET scales to millions of features, millions of training examples, as needed for NLP.
• Now being used in many universities & companies all over the world:– MIT, CMU, UPenn, Berkeley, UTexas, Purdue, Oregon State, UWash, UMass,
Google, Yahoo, BAE.– Also in UK, Germany, France.
![Page 93: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/93.jpg)
Semi-Supervised Learning
• Labeled data is expensive– Especially for sequence modeling tasks– POS tagging, word segmentation, NER
• Unlabeled data is abundant– The Web– Newswire– Other internal reports– etc.
![Page 94: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/94.jpg)
HMM-LDA Model
• Distinguish between semantic words and syntactic words
[Griffiths, et al. 2004]
![Page 95: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/95.jpg)
Experiments
• Dataset– Wall Street Journal (WSJ) collection labeled with
part-of-speech tags. There are totally 2312 documents in this corpus, 38665 unique words and 1.2M word tokens.
• 50 topics and 40 syntactic classes
• Gibbs sampling – 40 samples with a lag of 100 iterations between
them and an initial burn-in period of 4000 iterations.
![Page 96: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/96.jpg)
Sample Syntactic Clusters
make 0.0279 of 0.7448 way 0.0172 last 0.0767 sell 0.0210 in 0.0828 agreement 0.0140 first 0.0740 buy 0.0174 for 0.0355 price 0.0136 next 0.0479 take 0.0164 from 0.0239 time 0.0121 york 0.0433 get 0.0157 and 0.0238 bid 0.0103 third 0.0424 do 0.0155 to 0.0185 effort 0.0100 past 0.0368 pay 0.0152 ; 0.0096 position 0.0098 this 0.0361 go 0.0113 with 0.0073 meeting 0.0098 dow 0.0295 give 0.0104 that 0.0055 offer 0.0093 federal 0.0288 provide 0.0086 or 0.0039 day 0.0092 fiscal 0.0262
Table 1: Sample syntactic word clusters, each column displays the top 10 words in one cluster and their probabilities
![Page 97: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/97.jpg)
Sample Semantic Clusters
bank 0.0918 computer 0.0610 jaguar 0.0824 ad 0.0314 loans 0.0327 computers 0.0301 ford 0.0641 advertising 0.0298 banks 0.0291 ibm 0.0280 gm 0.0353 agency 0.0268 loan 0.0289 data 0.0200 shares 0.0249 brand 0.0181 thrift 0.0264 machines 0.0191 auto 0.0172 ads 0.0177 assets 0.0235 technology 0.0182 express 0.0144 saatchi 0.0162 savings 0.0220 software 0.0176 maker 0.0136 brands 0.0142 federal 0.0179 digital 0.0173 car 0.0134 account 0.0120 regulators 0.0146 systems 0.0169 share 0.0128 industry 0.0106 debt 0.0142 business 0.0151 saab 0.0116 clients 0.0105
Table 2: Sample semantic word clusters, each column displays the top 10 words in one cluster and their probabilities
![Page 98: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/98.jpg)
POS Tagging
• Features– Word unigrams and bigrams– Spelling features– Word suffixes– Cluster features
• HMM-LDA: the most likely class assignment for each word over all the samples
• HC: bit string prefixes of lengths 8, 12, 16 and 20
• CRFs
![Page 99: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/99.jpg)
Evaluation Results
(a) 10k labeled words, OOV rate = 24.46%
(b) 30k labeled words, OOV rate = 15.31%
(c) 50k labeled words, OOV rate = 12.49%
Error(%) No Clusters Hierarchical HMM-LDA
Overall 10.04 9.46 (5.78) 8.56 (14.74)
OOV 22.32 21.56 (3.40) 18.49 (17.16)
Error(%) No Clusters Hierarchical HMM-LDA
Overall 6.08 5.85 (3.78) 5.40 (11.18)
OOV 17.34 17.35 (-0.00) 15.01 (13.44)
Error(%) No Clusters Hierarchical HMM-LDA
Overall 5.34 5.12 (4.12) 4.78 (10.30)
OOV 16.36 16.21 (0.92) 14.45 (11.67)
18%reductionin error
![Page 100: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/100.jpg)
Desired Future Work
• Add more “structured data types” to topic models.
• Leverage Pachinko Allocation to learn topic hierarchies and topic correlations in time.
• New type of topic model– fast enough to work on streaming data– more naturally combines many data modalities
(add more “structured data types” together)– topics defined by both positive and negative features
• Use structured topic models to help predict influence and impact.
• Extremely low-supervision training of information extractors. Discover interesting entity/relation classes.
![Page 101: Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction.](https://reader036.fdocuments.us/reader036/viewer/2022062801/56649e4b5503460f94b3fe77/html5/thumbnails/101.jpg)
End of Talk