Pharos Summer School Fundamentals of Social Applications
description
Transcript of Pharos Summer School Fundamentals of Social Applications
Pharos Summer School Fundamentals
of Social Applications
June 2009Avaré Stewart
http://www.l3s.uni-hannover.de/~stewart/pharos/
Roadmap
• Part I: Overview Social Applications– current shortcomings, solutions
• Part II : Information Extraction (IE)– tasks, techniques, tools
• Part III: Evaluation
• Part IV: IE & IR Applications in Context
Overview of Social Applications
The Social Applications Phenomena
The Social Application
Phenomena today is driven
by Social Media
Social Media:• information content of
the “citizen journalist”, user generated content
• popular way, people connect in online world, personal & business relationships
20. April 2023Avaré Stewart4
What ‘s the Social Media Hype?
• Coverage:– Reach small or large audiences– Breaks publication barriers
• Business / Advertisement – Repeated Visiting: best links readers will
come back• Information Gathering / Sharing:
– Cut time you spend looking– Link economy is real…Give some, get
some– Dynamic Content: not endpoint of
conversation, but the beginning…• Social Intervention / Detection
– Rumors , fads, infectious disease
Capitalize on Social Processes Diffusion / Cascade
The core concepts of social mediaEspoo, April 2007
The Many Faces of Social Applications
Domain:• Music, politics, cycling, medicineMedia Type:• Video: YouTube, Daily MotionFacebookServices:• meeting people• expressing point view• serendipitous discovery
What Are Some Limitations with
Social Applictions?
20. April 2023
Avaré Stewart 8
Social Sites intentionally seek distinction
Problem: sheer number: redundancy, overlap:
• type of media, resources• topics
Overlaps exists: untapped to the benefit of those who actually constitute the social networking ecosystem
Social Networking Divide
Where's the “Social” Web ?
The ,so called, Social Web is ironically divided
Open Social Networking (OSN)
Aspects of an Open Social Network
• Unified Data Spaces• Personal Identity Unification• Unified Applications
10
http://esw.w3.org/topic/SweoIG/TaskForces/
CommunityProjects/LinkingOpenData
Unified Data Spaces Linking Open Data Cloud
Personal Indentity Unification• OpenID : a single digital• Retaggr : social media
profile card• Geek Chart : graphical
profile - pie chart• DandyID : collect online
profiles in one place
• FriendFeed : real-time aggregator, consolidates the updates from sites
Unified Applications
Multi-Site APIs: common API for social applications across multiple websites– OpenSocial
– Data Portability Project
Single Site –APIs: partner / interact programmatically– YouTube Data API: videos
– Spinn3r: indexing blogosphere
– etc....
13
Bloggers Who
Don’t Tag
Taggers Who
Don’t Blog
???
Social Network Divide
Pharos Scenario
Missing Link: Cross-Tagging
Avaré Bonaparte Stewart
14
Exploit the tags assertions made by users of one social site to personalize theexperience for users in another, comparable site
Overview: Cross Tagging
15
Better Recommendations
Cross-Tagging for Personalized Open Social
Networking, Stewart, Diaz, Balby Marinho 2008
Better Browsing
Better Search
What More Can We Do with Social
Applications?
Social Medial Communities & Content
Espoo, April 2007
Social media: examined, primarily for popularity in connecting people
In Pharos: examine blogs improved, personalized information access
Complex Information Needs & Social Media Search
• Polarity, opinion• Meme and themes• Related, multi-lingual resources• Entities: people, organizations, etc.• Relationships between entities• Event: who, what, where, when,
how
Events ? ... Momentum is Shifting
• Industry: – Complex Event Processing (CEP)– Event correlation:
• Event Filtering , Event Aggregation• Event Masking, Root Cause Analysis
• Research:– Event detection– Associations– De-duplicate
Humans think in terms of events
and entities
Events - natural abstraction of real
world
Humans think in terms of events
and entities
Events - natural abstraction of real
world
Information Retrieval, Meet Information Extraction ... from Blogs• Information Extraction IE :
– a subarea of Natural Language Processing (NLP)
– Needed to solve complex (event-driven) information needs
– hard, because natural language is complex, vague and ambiguous, i.e.: unstructured
• potentially harder, for blogs & informal sources
IEIR
Social Media
Anatomy of a Blog
Tag
Content
Permalink
Timestamp
TitleFeedBlogroll
Comment
Trackback
Archive Author
Rich Source for Personalized Information
Part II: Information Extraction
Tasks, Techniques and Tools
What is Information Extraction ?
Unstructured Data
• Encoded in a way that makes is difficult for computers to immediately interpret
• Multiple languages, across multiple documents
20. April 2023 25
Why Information Extraction?
• Large amount of unstructured or semistructured information– Web pages, email, news articles, call-center text records, business
reports, annotations, spreadsheets, research papers, blogs, tags, instant messages (IM), …
• High impact applications– Business intelligence, personal information management, Web
communities, Web search and advertising, scientific data management, e-government, medical records management, …
• Open ended and growing rapidly
• Information Extraction:– Superimpose formal meaning on unstructured information– Elicit facts and relationships– Feed database/knowledgebase
Why? ... Information is Locked Away...
Inaccesible data .... growing and sophisticated needs ... growing
Events, Facts, Relationships
What is Information Extraction (IE) ?
• ...isolates relevant text fragments, extracts relevant information from the fragments, and pieces together the targeted information in a coherent framework
• ... build systems that finds and link relevant information while ignoring extraneous and irrelevant information
• Cowie and Lehnert, 1996 p.81
IE is used to get some information out of unstructured data
Information Extraction : i.e. Disaster
Information Extraction (IE) System
Unstructured Text
StructuredText
20. April 2023 29
Information Extraction: Major Tasks • Segmentation
– Tokenization, Sentence Splitting• Classification
– POS Tagging, Lemmatization, Disambiguation, …– Entity Detection
• Association– Noun Phrase Chunking– Parsing– Relationship Detection
• Normalization & Deduplication– Anaphora Resolution– Normalization of Formats, Schema– Record Linkage, Record Deduplication– Mention Tracking
What are the Components and Tasks
of an Information Extraction
System?
ExternalKnowledge
General View of IE System
Thesaurus
Ontology
Knowledge Base
Preprocessing
OUTPUT:StructuredInformation
ExtractionAquisitionLearning
ExtractionGrammar
Feedback
INPUT:Source Text
INPUT:Training corpus
Moen 06
Preprocessing
Training Phase Deployment Phase
Inforamtion Extraction , Moens
Common IE Tasks: Preprocessing & Recognition
Pre-Processing Tasks
Normalization
Sentence Splitting
Tokenization
POS Tagging
Chunking
Parsing
Sense Disambiguation
Recognition Tasks
Named Entity (NE)
Co-reference Resolution(CO)
Template Element Construction (TE)
Template Relation Construction (TR)
Scenario Template (ST)
Semantic Role
Timex Line Recognition
Ex: Text Normalization
AVIAN INFLUENZA, HUMAN (101): EGYPT, 79TH, 80TH CASES*****************************************************A ProMED-mail post<http://www.promedmail.org>ProMED-mail is a program of theInternational Society for Infectious Diseaseshttp://www.isid.orgDate: Mon 8 Jun 2009Source: Egyptian Chronicles [edited]<http://egyptianchronicles.blogspot.com/2009/06/h5n1-follow-
up-no80.html>
Clean junk formatting
•Transformed to make it consistent•Performed before text is processed
Sentence Splitting
• Segments text into sentences
• Required for the tagger
• Domain- and application-independent
He called Mr. White at 4p.m. in Washington, D.C. Mr. Green responded.
The computer must tell which of the dots denote an actual sentence
Tokenization
• Tokenization / Word Segmentation:
– Numbers, punctuation, symbols
– string of contiguous alphanumeric characters with space on either side?
Words are not always surrounded by whitespace:
Abbreviation are etc. and Calif.
A text-based medium.
White space not indicating a word break:
San Franciso
Ditto: in spite of
Phone: 0171 378 0647
Parts of Speech (POS)
• POS: category / class• Words in same class have similar syntactic
behavior• Ex: Noun: person, place, thing, animal• Ex: verbs express action
Ex: Penn Treebank POS TagsetTag
Description
Example
CC Coord conjuction and, but, or
CD Cardinal number one, two
DT Determiner a , the
EX Existential there There
FW Foreign Word Mea culpa
IN Prep/ subordinate conjunction
of, in, by
JJ Adjective Yellow
JJR Adjective, comparative
Bigger
JJS Adjective, superlative
Wildest
LS List item marker 1, 2, One
MD Modal Can, should
NN Noun, Sing Dog
NNS Noun, plural dogs
Tag
Description
Example
NNP Proper noun, sing IBM
NNPS Proper noun, plural
West Indies
PDT predeterminer All, both
POS Possesive ending ´s
PRP Personal pronoun I , you , he
RB Adverb Quickly, never
RBR Adverb, comparative
faster
RBS Adverb, superlative
fastest
RP Particle Up, off
SYM Symbol +, %, &
TO To to
UH Interjection Ah, oops
VB Verb base form eat
Tag
Description
Example
VBD Verb, past tense ate
VBG Verb, gerund eating
VBN Verb, past partici Eaten
VBP Verb non-3prs eat
VBZ Verb, 3prs eats
WDT Wh-determ Which, that
WP Wh-pronoun What, who
WP$ Possesive-wh whose
WRB Wh-adverb How, where
$
#
(
)
Chunking
• Words are organized into groups• Phrases: word groupings, clumped as a
unit
Parsing
• Labeled syntactic tree corresponding to the interpretation of the sentence
• Resolution of syntactic ambiguities
Fruit flies like a banana
Time flies like an arrow
Sense Disambiguation
What are Some Basic RecognitionTasks?
IE Recognition Tasks
MUC Recognition Tasks
Named Entity (NE)
Co-reference Resolution (CO)
Template Element Construction (TE)
Template Relation Construction (TR)
Scenario Template (ST)
ACE Recognition Tasks
Entity detection and tracking (EDT)
Relation detection and characterization (RDC)
Event detection and characterization (EDC)
Temporal expression detection (TERN)
1987 1989 1991 1992 1993 1995 1998 2002 2009
MUC-1 MUC-2 MUC-3 MUC-4 MUC-5 MUC-6 MUC-7ACE
Pilot
Event
1999
ACE
Year
. . .
ACE +
Text Analysis
Conference (TAC)
Named Entity Recognition (NE)
• recognition of entity names: – people, organizations – place names – temporal expressions &
numerical expressions
Co-reference Resolution (CO)
• Identify chains of noun phrases that refer to the same object
• Scope:– Within document– Across document
John saw Mary. The girl was very beautiful; she wore a new red dress.
• Types: Pronominal : ’they’, ’it’, ’he’, ’hers’,
’themselves’, etc. resolve to : proper nouns, common nouns , other pronouns
Proper Noun Coreference• Names of people, places, products
and companies referred to in many different variations.
Minnesota Mining and Manufacturing
3M Corp.
New York
New York City
NYC
N.Y.C
3M
Ref: Coreference as a Foundation for Link Analysis over Free Text
Other Coreference Types
John Smith, chairman of General Electric, resigned yesterday.
John is the finest juggler in the world.
• Apposition: noun phrases, side by
side one define or
modified the other
• Predicate Nominal: noun phrase is main predicate of a sentence subject and predicate nominal connected by
a linking verb (copula)
Template Element Construction (TE)
• Specified classes and attributes of entities:
– person : name (name variants),– title, nationality, – description in the text– subtype
Template Relation Construction (TR)
• Two-slot template representing a binary relation:
– e.g., employee_of, product_of, location_of
– pointers to template elements
Fei-Yu Xu 08
Scenario Template Production (ST)
• information involvingseveral relations or events:
– Joint venture
– Partners
– Products
– Profits
Fei-Yu Xu 08
Can We Extract Temporal Expressions?
Temporal expression detection (TERN)
• Time Expression Recognition and Normalization– recognize and normalize expressions that refer to date
and time– Timestamp of events– Meaning of temporal expressions– Conditions associating time with a relation / event
• TIMEX2 Standard• XML tags + time • second generation TIMEX
Some Examples: TIMEX2 Time
I was sick <TIMEX2 VAL="1999-07-14"> yesterday </TIMEX2>.
I will be on vacation for <TIMEX2 VAL="P3W" ANCHOR_DIR="AFTER" ANCHOR_VAL="1999-07-15"> three weeks </TIMEX2>.
The contractor submitted a proposal on <TIMEX2 VAL="1999-07-13"> Tuesday </TIMEX2>.
<TIMEX2 VAL="1999-07-14"> The day after <TIMEX2 VAL="1999-07-13"> that </TIMEX2> </TIMEX2>, the contract was awarded.
Precise Time:
Duration:
Pronouns:
Thursday, July 15, 1999
20. April 2023 54
State of the Art Performance
• Named entity recognition– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s
• N-ary relation extraction, event detection– Much lower -> errors accumulate!
How Can Information Extraction Be Performed?
Common IE Techniques
• Knowledge Engineering
• Corpus Based / Machine Learning
Classification for IE
• Many problems needed for IE can be re-formulated as a classification problem
• Features: object description, context
• Class: which object belongs
• Input: Training Data• Classifier : Learning
Algorithm• Output: Hypothesis
fits the data
Classification Scheme
• The class /semantic disctintion that we want to assign information unit:
– Named Entitiy: protein, drug, disease– Semantic Role: i.e verb : agent– Grammatic Role: object, subject– Domain Independent: person, organization– Sentence boundary : {!,.,-}
Ex: FeaturesSemantic Role Recognition
Feature Value
Phrase type Noun / Verb phase, determined by the POS tag of syntactic head
Syntactic head Word that composes syntactic head of the phrase that represents i
Voice Active or passive
Named Entity Class Class : person, organization of syntactic head
Moens06
The actual set of features used is determined by a feature selecton strategy
Specific to the problem at hand
Ex. Features Coreference Resolution (CO)
Feature Value
Number Agreement True if i and j agree in number
Gender Aggrement True if i and j agree in gender
Alias True if is an alias of j, vice versa
Pronoun i ( j) True if i (j ) is a pronoun
Appositive True if j is appositve of i
Definitieness True is j is preceeded by „the“ or demonstrative pronoun
Grammatical Role
True if grammatical role of i and j matchi.e: subject, direct /indirect object,
Proper name True is both are proper names
Name entity class True is both have the same semantic class
Discourse distance Number of sentences or words that i and j are apart
Moens06
Do It Yourself: IE Task • A sample of text from the
Wall Street Journal is given, together with a template
• The task is to fill the template with information about succession events extracted from the text
• There are six events in total, although complete information is not available for all of them
Text:
New York Times Co. named Russell T. Lewis,
45, president and general manager of its
flagship New York Times newspaper,
responsible for all business-side
activities.
He was executive vice president and deputy
general manager. He succeeds Lance R.
Primis, who in September was named president
and chief operating officer of the parent.
Template:
<ORGANIZATION-1>
NAME : "New York Times Co.“
<ORGANIZATION-2>
NAME : "New York Times"
<PERSON-1>
NAME : "Russell T. Lewis“
<PERSON-2>
NAME : "Lance R. Primis"
http://gate.ac.uk/ie/ie_example.html
Some Techniques : At a Glance
What Tools Can I Use to Perform Information Extraction?
An IE Toolkit: Lexical Resources
Ontology
Treebank
Dictionary
Brown
Penn Treebank
WordNet
Machine Readable corpus, dictionary, etc.. and tools for processing them
BCO
Tools
Parser
NER Tagger
UMLS
GENIA
VerbNet Comlex
Linguistic Data Consortium (LDC)
GATEUIMA
Open Biomedical Ontology
Part III: Evaluation in Information Extraction
Evaluation
• We evaluate our systems to:– See how they are behaving w.r.t
golden standard– Compare them with other systems• Types of Evaluations:– Intrinsic: specific to extraction task– Extrinsic: task on which extraction relies,
e.g.: Information Retrieval task
Evaluation Precision / Recall
ExpertYes
ExpertNo
SystemYes
TP FP
SystemNo
FN TN
Recall = TP / (TP + FN) Precision = TP / (TP + FP) Fall Out = FP / (FP + TN)
fraction of correct/relevant answers which are predicted
proportion of incorrect class members given the number of incorrect class members i.e., Expert No
fraction of predictions which are correct/relevant
F Measure
Combine measure for Precision and Recall
P = precisionR = recallB = a factor that indicates the relative importance of recall and precision
When B = 1, recall and precision are of equal importance = > harmonic mean (F1-measure)
(B2 + 1) PR
B2 P + RF =
What Other Types of Metrics Exist Besides Precision and Recall?
John saw Mary. He thought she was a very beautiful girl and she wore a new red dress.
Vilain Metric : Pron. Coreference
• Equivalence Class evaluation
– Groups built by system compared against gold standard (Key)
– Compare equivalence classes defined by links in key and computed values (Response)
A Model-Theoretic Coreference Scoring Schem e
Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, Lynette
Hirschman
Coreference Chains
Mary
girl
she
he
John
Vilain Recall: Concepts
Key Links: <A-B , B-C>Response Links: { (A-C) }
S : equivalence class relative to KeyS = {A,B,C}, where |S| = 3
p(S): Response partition on S (from Key)
• intersection of S and Response• elements in Key, not
Response
p(S) = { (A-C) , (B) }
|p(S)| = 2
c(S): minimal number of "correct links” to generate S
c(S) = (|S| - 1) = 2
m(S): no. "missing" Response Links m(S) = (|p(S)| - 1)
Vilain: Recall / Precision
Recall
Precision
KeyEquiv Class
ResponseEquivClass
Precision : links added to Key
Recall : links added Response
Do it Youself: Vilain Metric
Part IV: Exploiting Information Extraction with IR in Social
Applications
77
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Datamine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
What does an Entity Extraction Scenario
Look Like?
Scenario I: OKKAM tackling the Flood of Identifiers
http://en.wikipedia.org/wiki/Barack_obama
http://dbpedia.org/resource/Barack_Obama
http://www.linkedin.com/in/barackobama http://farm4.static.flickr.com/3193/2437394249_824e76ed76.jpg?v=0
http://current.com/index.php/items/89822170/obama_to_sign_stimulus_bill_today_in_denver.htm
http://www.facebook.com/home.php#/barackobama?ref=s
http://www.reuters.com/news/globalcoverage/barackobama
http://www.OPENCALAIS.com/watch?v=z4W2_raF_iw
??
OKKAM & Information Extraction 79
Information Extraction & OKKAMization
OKKAM & Information Extraction 80
NER:
detect named
entity
decide about
type
(e.g.)
send ID Request (based
on entity name, type +
context information)
OKKAM
ENS
OKKAM
ENS
return OKKAM ID
(or list of candidates)
attach ID to entity
reference in text
Person
http://www.okkam.org/ens/idb3016709-b9e1-42c0-ac5f-6383d2e5b235
=> prepare for information integration,
entity cenrtic search, semantic
infusion (attachment of information
about entity)
=> prepare for information integration,
entity cenrtic search, semantic
infusion (attachment of information
about entity)
http://www.okkam.org/
What Does an Event Extraction Scenario
Look Like?
Scenaio II: Epidemic Intelligence
20. April 2023
Avaré Stewart 82
Goal: early identification of potential health threats:
• verification, assessment, investigation
State of Art: Event-Based• web data • NLP, Data Mining, Machine
Learning techniques• extract epidemic events from
the unstructured text.. • News, domain-specific
reports, blogs
online news
Event Mining for Early Detection, Rapid Response ...
How Can Events Be Used in Pharos Audio-
Visual Search?
Scenario III: Facets in Pharos
• Event-Centric Search / Browsing– Document representation no longer Bag-of-Words:– Events => N-ary relations between entities or classes
Scenario III: Extraction from Informal Text• Transcribed Speech
– Discourse structure of „Speech Text“ differs from written text
– Transcription errors– Missing orthographic features
• Sentence Boundaries difficult to detect• Automatic Speech Recognition (ASR) Vocabulary Problem
• Blogs– Affective, opinionated– Topic fluctuating, prose – Many authors, different style– Inconsistent capitalization patterns– Malformed sentences & phrases, Slang, .....
• Part V: Wrap Up & Conclusion
What Considerations Do I Need to Make for
My Information Extraction System?
Consideration for IE System
Description Dimension
document structure of the input text
• free text• semi-structured
richness of the natural language processing (NLP)
• shallow NLP• deep NLP
complexity of the pattern rules
• single slot• multiple slots
data size • training data • application data
degree of automation • supervised• semi-supervised• unsupervised
type of evaluation • gold standard corpus?• evaluation measures used ?• evaluation of machine learning
What Are Some Important Directions
in Information Extracation?
Research Trends in IE
Concept Description
[1] Semi / Un – Supervised, SelfLearning
Supervised methods assume: • annotated documents • broad coverage • suffcient data redundancy
[2] Open Information Extraction
•Target relations not know in advance
[3] Web Scale Systems • Number of relations is large
20. April 2023 92
Research trends in IE• Selfsupervised Information Extraction
at WebScale– KnowItAll: Extracting closed set of relations
[Etzioni 2005]– TextRunner: Extracting open set of of relations
[Banko 2007]– Open IE : The Tradeoffs Between Open and
Traditional Relation Extraction [Banko 2008]– SRES [Feldman 2006], LEILA [Suchanek 2006]:
Extracting closed relation set with more elaborate linguistic preprocessing
Scalability:• Large set of seed relations (e.g. entire IMDB)• Open ended corpora
Noise: Incorrect seed interpretations
In Summary ....
Information is No Longer Locked Away...
Events, Facts, Relationships, Opinions
Social Application Integration
IR and EI Tradeoffs
• IE needs more CPU power, suitable tradeoff between data size, analysis depth, complexity , time, etc.
• Deeper analysis , complex template structures consumes more time than shallow analysis and simple named entity recognition or binary relation extraction
• Ease of use needs improvement
… Lighting the Way …IE is acknowledged: an urgently needed information
technology - a constantly growing digitized world
society winners ?
Globalized information
…Those who outstrip competitors, comprehensive, integrated and precise access to digital information for decision making processes!
Thank You
Useful Tools
• ANNIE : Information Extraction System– http://gate.ac.uk/ie/annie.html
• Stanford Parser– http://nlp.stanford.edu:8080/parser/
• WhatsWhyWithMyNLP?– http://code.google.com/p/whatswrong/
• LingPipe– http://alias-i.com/lingpipe/html– http://www-nlp.stanford.edu/downloads/
Useful Links
• Software Tools for NLP– http://www-a2k.is.tokushima-u.ac.jp/member/
kita/NLP/nlp_tools.html
• Statistical NLP / corpus-based computational linguistics resources– http://nlp.stanford.edu/links/statnlp.html
• Stanford NLP Group– http://www-nlp.stanford.edu/downloads/
• Linguist List - Language and Resources– http://www.linguistlist.org/langres/index.html
Selected References
• Foundations of Statistical Natural Language Processing, Manning and Schutze
• Information Extraction, Moens• Text Mining Handbook, Feldman,
Sanger• Maximum Entropy Model for NLP,
Ratnaparkhi