Pharos Summer School Fundamentals of Social Applications

97
Pharos Summer School Fundamentals of Social Applications June 2009 Avaré Stewart [email protected] http://www.l3s.uni-hannover.de/~stewart/ pharos/

description

Pharos Summer School Fundamentals of Social Applications. June 2009 Avaré Stewart [email protected] http://www.l3s.uni-hannover.de/~stewart/pharos/. Roadmap. Part I: Overview Social Applications current shortcomings, solutions Part II : Information Extraction (IE) - PowerPoint PPT Presentation

Transcript of Pharos Summer School Fundamentals of Social Applications

Page 1: Pharos Summer School  Fundamentals  of  Social Applications

Pharos Summer School Fundamentals

of Social Applications

June 2009Avaré Stewart

[email protected]

http://www.l3s.uni-hannover.de/~stewart/pharos/

Page 2: Pharos Summer School  Fundamentals  of  Social Applications

Roadmap

• Part I: Overview Social Applications– current shortcomings, solutions

• Part II : Information Extraction (IE)– tasks, techniques, tools

• Part III: Evaluation

• Part IV: IE & IR Applications in Context

Page 3: Pharos Summer School  Fundamentals  of  Social Applications

Overview of Social Applications

Page 4: Pharos Summer School  Fundamentals  of  Social Applications

The Social Applications Phenomena

The Social Application

Phenomena today is driven

by Social Media

Social Media:• information content of

the “citizen journalist”, user generated content

• popular way, people connect in online world, personal & business relationships

20. April 2023Avaré Stewart4

Page 5: Pharos Summer School  Fundamentals  of  Social Applications

What ‘s the Social Media Hype?

• Coverage:– Reach small or large audiences– Breaks publication barriers

• Business / Advertisement – Repeated Visiting: best links readers will

come back• Information Gathering / Sharing:

– Cut time you spend looking– Link economy is real…Give some, get

some– Dynamic Content: not endpoint of

conversation, but the beginning…• Social Intervention / Detection

– Rumors , fads, infectious disease

Capitalize on Social Processes Diffusion / Cascade

The core concepts of social mediaEspoo, April 2007

Page 6: Pharos Summer School  Fundamentals  of  Social Applications

The Many Faces of Social Applications

Domain:• Music, politics, cycling, medicineMedia Type:• Video: YouTube, Daily MotionFacebookServices:• meeting people• expressing point view• serendipitous discovery

Page 7: Pharos Summer School  Fundamentals  of  Social Applications

What Are Some Limitations with

Social Applictions?

Page 8: Pharos Summer School  Fundamentals  of  Social Applications

20. April 2023

Avaré Stewart 8

Social Sites intentionally seek distinction

Problem: sheer number: redundancy, overlap:

• type of media, resources• topics

Overlaps exists: untapped to the benefit of those who actually constitute the social networking ecosystem

Social Networking Divide

Where's the “Social” Web ?

The ,so called, Social Web is ironically divided

Page 9: Pharos Summer School  Fundamentals  of  Social Applications

Open Social Networking (OSN)

Aspects of an Open Social Network

• Unified Data Spaces• Personal Identity Unification• Unified Applications

Page 10: Pharos Summer School  Fundamentals  of  Social Applications

10

http://esw.w3.org/topic/SweoIG/TaskForces/

CommunityProjects/LinkingOpenData

Unified Data Spaces Linking Open Data Cloud

Page 11: Pharos Summer School  Fundamentals  of  Social Applications

Personal Indentity Unification• OpenID : a single digital• Retaggr : social media

profile card• Geek Chart : graphical

profile - pie chart• DandyID : collect online

profiles in one place

• FriendFeed : real-time aggregator, consolidates the updates from sites

Page 12: Pharos Summer School  Fundamentals  of  Social Applications

Unified Applications

Multi-Site APIs: common API for social applications across multiple websites– OpenSocial

– Data Portability Project

Single Site –APIs: partner / interact programmatically– YouTube Data API: videos

– Spinn3r: indexing blogosphere

– etc....

Page 13: Pharos Summer School  Fundamentals  of  Social Applications

13

Bloggers Who

Don’t Tag

Taggers Who

Don’t Blog

???

Social Network Divide

Pharos Scenario

Page 14: Pharos Summer School  Fundamentals  of  Social Applications

Missing Link: Cross-Tagging

Avaré Bonaparte Stewart

14

Exploit the tags assertions made by users of one social site to personalize theexperience for users in another, comparable site

Page 15: Pharos Summer School  Fundamentals  of  Social Applications

Overview: Cross Tagging

15

Better Recommendations

Cross-Tagging for Personalized Open Social

Networking, Stewart, Diaz, Balby Marinho 2008

Better Browsing

Better Search

Page 16: Pharos Summer School  Fundamentals  of  Social Applications

What More Can We Do with Social

Applications?

Page 17: Pharos Summer School  Fundamentals  of  Social Applications

Social Medial Communities & Content

Espoo, April 2007

Social media: examined, primarily for popularity in connecting people

In Pharos: examine blogs improved, personalized information access

Page 18: Pharos Summer School  Fundamentals  of  Social Applications

Complex Information Needs & Social Media Search

• Polarity, opinion• Meme and themes• Related, multi-lingual resources• Entities: people, organizations, etc.• Relationships between entities• Event: who, what, where, when,

how

Page 19: Pharos Summer School  Fundamentals  of  Social Applications

Events ? ... Momentum is Shifting

• Industry: – Complex Event Processing (CEP)– Event correlation:

• Event Filtering , Event Aggregation• Event Masking, Root Cause Analysis

• Research:– Event detection– Associations– De-duplicate

Humans think in terms of events

and entities

Events - natural abstraction of real

world

Humans think in terms of events

and entities

Events - natural abstraction of real

world

Page 20: Pharos Summer School  Fundamentals  of  Social Applications

Information Retrieval, Meet Information Extraction ... from Blogs• Information Extraction IE :

– a subarea of Natural Language Processing (NLP)

– Needed to solve complex (event-driven) information needs

– hard, because natural language is complex, vague and ambiguous, i.e.: unstructured

• potentially harder, for blogs & informal sources

IEIR

Social Media

Page 21: Pharos Summer School  Fundamentals  of  Social Applications

Anatomy of a Blog

Tag

Content

Permalink

Timestamp

TitleFeedBlogroll

Comment

Trackback

Archive Author

Rich Source for Personalized Information

Page 22: Pharos Summer School  Fundamentals  of  Social Applications

Part II: Information Extraction

Tasks, Techniques and Tools

Page 23: Pharos Summer School  Fundamentals  of  Social Applications

What is Information Extraction ?

Page 24: Pharos Summer School  Fundamentals  of  Social Applications

Unstructured Data

• Encoded in a way that makes is difficult for computers to immediately interpret

• Multiple languages, across multiple documents

Page 25: Pharos Summer School  Fundamentals  of  Social Applications

20. April 2023 25

Why Information Extraction?

• Large amount of unstructured or semistructured information– Web pages, email, news articles, call-center text records, business

reports, annotations, spreadsheets, research papers, blogs, tags, instant messages (IM), …

• High impact applications– Business intelligence, personal information management, Web

communities, Web search and advertising, scientific data management, e-government, medical records management, …

• Open ended and growing rapidly

• Information Extraction:– Superimpose formal meaning on unstructured information– Elicit facts and relationships– Feed database/knowledgebase

Page 26: Pharos Summer School  Fundamentals  of  Social Applications

Why? ... Information is Locked Away...

Inaccesible data .... growing and sophisticated needs ... growing

Events, Facts, Relationships

Page 27: Pharos Summer School  Fundamentals  of  Social Applications

What is Information Extraction (IE) ?

• ...isolates relevant text fragments, extracts relevant information from the fragments, and pieces together the targeted information in a coherent framework

• ... build systems that finds and link relevant information while ignoring extraneous and irrelevant information

• Cowie and Lehnert, 1996 p.81

IE is used to get some information out of unstructured data

Page 28: Pharos Summer School  Fundamentals  of  Social Applications

Information Extraction : i.e. Disaster

Information Extraction (IE) System

Unstructured Text

StructuredText

Page 29: Pharos Summer School  Fundamentals  of  Social Applications

20. April 2023 29

Information Extraction: Major Tasks • Segmentation

– Tokenization, Sentence Splitting• Classification

– POS Tagging, Lemmatization, Disambiguation, …– Entity Detection

• Association– Noun Phrase Chunking– Parsing– Relationship Detection

• Normalization & Deduplication– Anaphora Resolution– Normalization of Formats, Schema– Record Linkage, Record Deduplication– Mention Tracking

Page 30: Pharos Summer School  Fundamentals  of  Social Applications

What are the Components and Tasks

of an Information Extraction

System?

Page 31: Pharos Summer School  Fundamentals  of  Social Applications

ExternalKnowledge

General View of IE System

Thesaurus

Ontology

Knowledge Base

Preprocessing

OUTPUT:StructuredInformation

ExtractionAquisitionLearning

ExtractionGrammar

Feedback

INPUT:Source Text

INPUT:Training corpus

Moen 06

Preprocessing

Training Phase Deployment Phase

Inforamtion Extraction , Moens

Page 32: Pharos Summer School  Fundamentals  of  Social Applications

Common IE Tasks: Preprocessing & Recognition

Pre-Processing Tasks

Normalization

Sentence Splitting

Tokenization

POS Tagging

Chunking

Parsing

Sense Disambiguation

Recognition Tasks

Named Entity (NE)

Co-reference Resolution(CO)

Template Element Construction (TE)

Template Relation Construction (TR)

Scenario Template (ST)

Semantic Role

Timex Line Recognition

Page 33: Pharos Summer School  Fundamentals  of  Social Applications

Ex: Text Normalization

AVIAN INFLUENZA, HUMAN (101): EGYPT, 79TH, 80TH CASES*****************************************************A ProMED-mail post<http://www.promedmail.org>ProMED-mail is a program of theInternational Society for Infectious Diseaseshttp://www.isid.orgDate: Mon 8 Jun 2009Source: Egyptian Chronicles [edited]<http://egyptianchronicles.blogspot.com/2009/06/h5n1-follow-

up-no80.html>

Clean junk formatting

•Transformed to make it consistent•Performed before text is processed

Page 34: Pharos Summer School  Fundamentals  of  Social Applications

Sentence Splitting

• Segments text into sentences

• Required for the tagger

• Domain- and application-independent

He called Mr. White at 4p.m. in Washington, D.C. Mr. Green responded.

The computer must tell which of the dots denote an actual sentence

Page 35: Pharos Summer School  Fundamentals  of  Social Applications

Tokenization

• Tokenization / Word Segmentation:

– Numbers, punctuation, symbols

– string of contiguous alphanumeric characters with space on either side?

Words are not always surrounded by whitespace:

Abbreviation are etc. and Calif.

A text-based medium.

White space not indicating a word break:

San Franciso

Ditto: in spite of

Phone: 0171 378 0647

Page 36: Pharos Summer School  Fundamentals  of  Social Applications

Parts of Speech (POS)

• POS: category / class• Words in same class have similar syntactic

behavior• Ex: Noun: person, place, thing, animal• Ex: verbs express action

Page 37: Pharos Summer School  Fundamentals  of  Social Applications

Ex: Penn Treebank POS TagsetTag

Description

Example

CC Coord conjuction and, but, or

CD Cardinal number one, two

DT Determiner a , the

EX Existential there There

FW Foreign Word Mea culpa

IN Prep/ subordinate conjunction

of, in, by

JJ Adjective Yellow

JJR Adjective, comparative

Bigger

JJS Adjective, superlative

Wildest

LS List item marker 1, 2, One

MD Modal Can, should

NN Noun, Sing Dog

NNS Noun, plural dogs

Tag

Description

Example

NNP Proper noun, sing IBM

NNPS Proper noun, plural

West Indies

PDT predeterminer All, both

POS Possesive ending ´s

PRP Personal pronoun I , you , he

RB Adverb Quickly, never

RBR Adverb, comparative

faster

RBS Adverb, superlative

fastest

RP Particle Up, off

SYM Symbol +, %, &

TO To to

UH Interjection Ah, oops

VB Verb base form eat

Tag

Description

Example

VBD Verb, past tense ate

VBG Verb, gerund eating

VBN Verb, past partici Eaten

VBP Verb non-3prs eat

VBZ Verb, 3prs eats

WDT Wh-determ Which, that

WP Wh-pronoun What, who

WP$ Possesive-wh whose

WRB Wh-adverb How, where

$

#

(

)

Page 38: Pharos Summer School  Fundamentals  of  Social Applications

Chunking

• Words are organized into groups• Phrases: word groupings, clumped as a

unit

Page 39: Pharos Summer School  Fundamentals  of  Social Applications

Parsing

• Labeled syntactic tree corresponding to the interpretation of the sentence

• Resolution of syntactic ambiguities

Page 40: Pharos Summer School  Fundamentals  of  Social Applications

Fruit flies like a banana

Time flies like an arrow

Sense Disambiguation

Page 41: Pharos Summer School  Fundamentals  of  Social Applications

What are Some Basic RecognitionTasks?

Page 42: Pharos Summer School  Fundamentals  of  Social Applications

IE Recognition Tasks

MUC Recognition Tasks

Named Entity (NE)

Co-reference Resolution (CO)

Template Element Construction (TE)

Template Relation Construction (TR)

Scenario Template (ST)

ACE Recognition Tasks

Entity detection and tracking (EDT)

Relation detection and characterization (RDC)

Event detection and characterization (EDC)

Temporal expression detection (TERN)

1987 1989 1991 1992 1993 1995 1998 2002 2009

MUC-1 MUC-2 MUC-3 MUC-4 MUC-5 MUC-6 MUC-7ACE

Pilot

Event

1999

ACE

Year

. . .

ACE +

Text Analysis

Conference (TAC)

Page 43: Pharos Summer School  Fundamentals  of  Social Applications

Named Entity Recognition (NE)

• recognition of entity names: – people, organizations – place names – temporal expressions &

numerical expressions

Page 44: Pharos Summer School  Fundamentals  of  Social Applications

Co-reference Resolution (CO)

• Identify chains of noun phrases that refer to the same object

• Scope:– Within document– Across document

John saw Mary. The girl was very beautiful; she wore a new red dress.

• Types: Pronominal : ’they’, ’it’, ’he’, ’hers’,

’themselves’, etc. resolve to : proper nouns, common nouns , other pronouns

Page 45: Pharos Summer School  Fundamentals  of  Social Applications

Proper Noun Coreference• Names of people, places, products

and companies referred to in many different variations.

Minnesota Mining and Manufacturing

3M Corp.

New York

New York City

NYC

N.Y.C

3M

Ref: Coreference as a Foundation for Link Analysis over Free Text

Page 46: Pharos Summer School  Fundamentals  of  Social Applications

Other Coreference Types

John Smith, chairman of General Electric, resigned yesterday.

John is the finest juggler in the world.

• Apposition: noun phrases, side by

side one define or

modified the other

• Predicate Nominal: noun phrase is main predicate of a sentence subject and predicate nominal connected by

a linking verb (copula)

Page 47: Pharos Summer School  Fundamentals  of  Social Applications

Template Element Construction (TE)

• Specified classes and attributes of entities:

– person : name (name variants),– title, nationality, – description in the text– subtype

Page 48: Pharos Summer School  Fundamentals  of  Social Applications

Template Relation Construction (TR)

• Two-slot template representing a binary relation:

– e.g., employee_of, product_of, location_of

– pointers to template elements

Fei-Yu Xu 08

Page 49: Pharos Summer School  Fundamentals  of  Social Applications

Scenario Template Production (ST)

• information involvingseveral relations or events:

– Joint venture

– Partners

– Products

– Profits

Fei-Yu Xu 08

Page 50: Pharos Summer School  Fundamentals  of  Social Applications

Can We Extract Temporal Expressions?

Page 51: Pharos Summer School  Fundamentals  of  Social Applications

Temporal expression detection (TERN)

• Time Expression Recognition and Normalization– recognize and normalize expressions that refer to date

and time– Timestamp of events– Meaning of temporal expressions– Conditions associating time with a relation / event

• TIMEX2 Standard• XML tags + time • second generation TIMEX

Page 52: Pharos Summer School  Fundamentals  of  Social Applications

Some Examples: TIMEX2 Time

I was sick <TIMEX2 VAL="1999-07-14"> yesterday </TIMEX2>.

I will be on vacation for <TIMEX2 VAL="P3W" ANCHOR_DIR="AFTER" ANCHOR_VAL="1999-07-15"> three weeks </TIMEX2>.

The contractor submitted a proposal on <TIMEX2 VAL="1999-07-13"> Tuesday </TIMEX2>.

<TIMEX2 VAL="1999-07-14"> The day after <TIMEX2 VAL="1999-07-13"> that </TIMEX2> </TIMEX2>, the contract was awarded.

Precise Time:

Duration:

Pronouns:

Thursday, July 15, 1999

Page 53: Pharos Summer School  Fundamentals  of  Social Applications

20. April 2023 54

State of the Art Performance

• Named entity recognition– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s

• N-ary relation extraction, event detection– Much lower -> errors accumulate!

Page 54: Pharos Summer School  Fundamentals  of  Social Applications

How Can Information Extraction Be Performed?

Page 55: Pharos Summer School  Fundamentals  of  Social Applications

Common IE Techniques

• Knowledge Engineering

• Corpus Based / Machine Learning

Page 56: Pharos Summer School  Fundamentals  of  Social Applications

Classification for IE

• Many problems needed for IE can be re-formulated as a classification problem

• Features: object description, context

• Class: which object belongs

• Input: Training Data• Classifier : Learning

Algorithm• Output: Hypothesis

fits the data

Page 57: Pharos Summer School  Fundamentals  of  Social Applications

Classification Scheme

• The class /semantic disctintion that we want to assign information unit:

– Named Entitiy: protein, drug, disease– Semantic Role: i.e verb : agent– Grammatic Role: object, subject– Domain Independent: person, organization– Sentence boundary : {!,.,-}

Page 58: Pharos Summer School  Fundamentals  of  Social Applications

Ex: FeaturesSemantic Role Recognition

Feature Value

Phrase type Noun / Verb phase, determined by the POS tag of syntactic head

Syntactic head Word that composes syntactic head of the phrase that represents i

Voice Active or passive

Named Entity Class Class : person, organization of syntactic head

Moens06

The actual set of features used is determined by a feature selecton strategy

Specific to the problem at hand

Page 59: Pharos Summer School  Fundamentals  of  Social Applications

Ex. Features Coreference Resolution (CO)

Feature Value

Number Agreement True if i and j agree in number

Gender Aggrement True if i and j agree in gender

Alias True if is an alias of j, vice versa

Pronoun i ( j) True if i (j ) is a pronoun

Appositive True if j is appositve of i

Definitieness True is j is preceeded by „the“ or demonstrative pronoun

Grammatical Role

True if grammatical role of i and j matchi.e: subject, direct /indirect object,

Proper name True is both are proper names

Name entity class True is both have the same semantic class

Discourse distance Number of sentences or words that i and j are apart

Moens06

Page 60: Pharos Summer School  Fundamentals  of  Social Applications

Do It Yourself: IE Task • A sample of text from the

Wall Street Journal is given, together with a template

• The task is to fill the template with information about succession events extracted from the text

• There are six events in total, although complete information is not available for all of them

Text:

New York Times Co. named Russell T. Lewis,

45, president and general manager of its

flagship New York Times newspaper,

responsible for all business-side

activities.

He was executive vice president and deputy

general manager. He succeeds Lance R.

Primis, who in September was named president

and chief operating officer of the parent.

Template:

<ORGANIZATION-1>

NAME : "New York Times Co.“

<ORGANIZATION-2>

NAME : "New York Times"

<PERSON-1>

NAME : "Russell T. Lewis“

<PERSON-2>

NAME : "Lance R. Primis"

http://gate.ac.uk/ie/ie_example.html

Page 61: Pharos Summer School  Fundamentals  of  Social Applications

Some Techniques : At a Glance

Page 62: Pharos Summer School  Fundamentals  of  Social Applications

What Tools Can I Use to Perform Information Extraction?

Page 63: Pharos Summer School  Fundamentals  of  Social Applications

An IE Toolkit: Lexical Resources

Ontology

Treebank

Dictionary

Brown

Penn Treebank

WordNet

Machine Readable corpus, dictionary, etc.. and tools for processing them

BCO

Tools

Parser

NER Tagger

UMLS

GENIA

VerbNet Comlex

Linguistic Data Consortium (LDC)

GATEUIMA

Open Biomedical Ontology

Page 64: Pharos Summer School  Fundamentals  of  Social Applications

Part III: Evaluation in Information Extraction

Page 65: Pharos Summer School  Fundamentals  of  Social Applications

Evaluation

• We evaluate our systems to:– See how they are behaving w.r.t

golden standard– Compare them with other systems• Types of Evaluations:– Intrinsic: specific to extraction task– Extrinsic: task on which extraction relies,

e.g.: Information Retrieval task

Page 66: Pharos Summer School  Fundamentals  of  Social Applications

Evaluation Precision / Recall

ExpertYes

ExpertNo

SystemYes

TP FP

SystemNo

FN TN

Recall = TP / (TP + FN) Precision = TP / (TP + FP) Fall Out = FP / (FP + TN)

fraction of correct/relevant answers which are predicted

proportion of incorrect class members given the number of incorrect class members i.e., Expert No

fraction of predictions which are correct/relevant

Page 67: Pharos Summer School  Fundamentals  of  Social Applications

F Measure

Combine measure for Precision and Recall

P = precisionR = recallB = a factor that indicates the relative importance of recall and precision

When B = 1, recall and precision are of equal importance = > harmonic mean (F1-measure)

(B2 + 1) PR

B2 P + RF =

Page 68: Pharos Summer School  Fundamentals  of  Social Applications

What Other Types of Metrics Exist Besides Precision and Recall?

Page 69: Pharos Summer School  Fundamentals  of  Social Applications

John saw Mary. He thought she was a very beautiful girl and she wore a new red dress.

Vilain Metric : Pron. Coreference

• Equivalence Class evaluation

– Groups built by system compared against gold standard (Key)

– Compare equivalence classes defined by links in key and computed values (Response)

A Model-Theoretic Coreference Scoring Schem e

Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, Lynette

Hirschman

Coreference Chains

Mary

girl

she

he

John

Page 70: Pharos Summer School  Fundamentals  of  Social Applications

Vilain Recall: Concepts

Key Links: <A-B , B-C>Response Links: { (A-C) }

S : equivalence class relative to KeyS = {A,B,C}, where |S| = 3

p(S): Response partition on S (from Key)

• intersection of S and Response• elements in Key, not

Response

p(S) = { (A-C) , (B) }

|p(S)| = 2

c(S): minimal number of "correct links” to generate S

c(S) = (|S| - 1) = 2

m(S): no. "missing" Response Links m(S) = (|p(S)| - 1)

Page 71: Pharos Summer School  Fundamentals  of  Social Applications

Vilain: Recall / Precision

Recall

Precision

KeyEquiv Class

ResponseEquivClass

Precision : links added to Key

Recall : links added Response

Page 72: Pharos Summer School  Fundamentals  of  Social Applications

Do it Youself: Vilain Metric

Page 73: Pharos Summer School  Fundamentals  of  Social Applications

Part IV: Exploiting Information Extraction with IR in Social

Applications

Page 74: Pharos Summer School  Fundamentals  of  Social Applications

77

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Datamine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Page 75: Pharos Summer School  Fundamentals  of  Social Applications

What does an Entity Extraction Scenario

Look Like?

Page 76: Pharos Summer School  Fundamentals  of  Social Applications

Scenario I: OKKAM tackling the Flood of Identifiers

http://en.wikipedia.org/wiki/Barack_obama

http://dbpedia.org/resource/Barack_Obama

http://www.linkedin.com/in/barackobama http://farm4.static.flickr.com/3193/2437394249_824e76ed76.jpg?v=0

http://current.com/index.php/items/89822170/obama_to_sign_stimulus_bill_today_in_denver.htm

http://www.facebook.com/home.php#/barackobama?ref=s

http://www.reuters.com/news/globalcoverage/barackobama

http://www.OPENCALAIS.com/watch?v=z4W2_raF_iw

??

OKKAM & Information Extraction 79

Page 77: Pharos Summer School  Fundamentals  of  Social Applications

Information Extraction & OKKAMization

OKKAM & Information Extraction 80

NER:

detect named

entity

decide about

type

(e.g.)

send ID Request (based

on entity name, type +

context information)

OKKAM

ENS

OKKAM

ENS

return OKKAM ID

(or list of candidates)

attach ID to entity

reference in text

Person

http://www.okkam.org/ens/idb3016709-b9e1-42c0-ac5f-6383d2e5b235

=> prepare for information integration,

entity cenrtic search, semantic

infusion (attachment of information

about entity)

=> prepare for information integration,

entity cenrtic search, semantic

infusion (attachment of information

about entity)

http://www.okkam.org/

Page 78: Pharos Summer School  Fundamentals  of  Social Applications

What Does an Event Extraction Scenario

Look Like?

Page 79: Pharos Summer School  Fundamentals  of  Social Applications

Scenaio II: Epidemic Intelligence

20. April 2023

Avaré Stewart 82

Goal: early identification of potential health threats:

• verification, assessment, investigation

State of Art: Event-Based• web data • NLP, Data Mining, Machine

Learning techniques• extract epidemic events from

the unstructured text.. • News, domain-specific

reports, blogs

online news

Page 80: Pharos Summer School  Fundamentals  of  Social Applications

Event Mining for Early Detection, Rapid Response ...

Page 81: Pharos Summer School  Fundamentals  of  Social Applications

How Can Events Be Used in Pharos Audio-

Visual Search?

Page 82: Pharos Summer School  Fundamentals  of  Social Applications

Scenario III: Facets in Pharos

• Event-Centric Search / Browsing– Document representation no longer Bag-of-Words:– Events => N-ary relations between entities or classes

Page 83: Pharos Summer School  Fundamentals  of  Social Applications

Scenario III: Extraction from Informal Text• Transcribed Speech

– Discourse structure of „Speech Text“ differs from written text

– Transcription errors– Missing orthographic features

• Sentence Boundaries difficult to detect• Automatic Speech Recognition (ASR) Vocabulary Problem

• Blogs– Affective, opinionated– Topic fluctuating, prose – Many authors, different style– Inconsistent capitalization patterns– Malformed sentences & phrases, Slang, .....

Page 84: Pharos Summer School  Fundamentals  of  Social Applications

• Part V: Wrap Up & Conclusion

Page 85: Pharos Summer School  Fundamentals  of  Social Applications

What Considerations Do I Need to Make for

My Information Extraction System?

Page 86: Pharos Summer School  Fundamentals  of  Social Applications

Consideration for IE System

Description Dimension

document structure of the input text

• free text• semi-structured

richness of the natural language processing (NLP)

• shallow NLP• deep NLP

complexity of the pattern rules

• single slot• multiple slots

data size • training data • application data

degree of automation • supervised• semi-supervised• unsupervised

type of evaluation • gold standard corpus?• evaluation measures used ?• evaluation of machine learning

Page 87: Pharos Summer School  Fundamentals  of  Social Applications

What Are Some Important Directions

in Information Extracation?

Page 88: Pharos Summer School  Fundamentals  of  Social Applications

Research Trends in IE

Concept Description

[1] Semi / Un – Supervised, SelfLearning

Supervised methods assume: • annotated documents • broad coverage • suffcient data redundancy

[2] Open Information Extraction

•Target relations not know in advance

[3] Web Scale Systems • Number of relations is large

Page 89: Pharos Summer School  Fundamentals  of  Social Applications

20. April 2023 92

Research trends in IE• Selfsupervised Information Extraction

at WebScale– KnowItAll: Extracting closed set of relations

[Etzioni 2005]– TextRunner: Extracting open set of of relations

[Banko 2007]– Open IE : The Tradeoffs Between Open and

Traditional Relation Extraction [Banko 2008]– SRES [Feldman 2006], LEILA [Suchanek 2006]:

Extracting closed relation set with more elaborate linguistic preprocessing

Scalability:• Large set of seed relations (e.g. entire IMDB)• Open ended corpora

Noise: Incorrect seed interpretations

Page 90: Pharos Summer School  Fundamentals  of  Social Applications

In Summary ....

Page 91: Pharos Summer School  Fundamentals  of  Social Applications

Information is No Longer Locked Away...

Events, Facts, Relationships, Opinions

Social Application Integration

Page 92: Pharos Summer School  Fundamentals  of  Social Applications

IR and EI Tradeoffs

• IE needs more CPU power, suitable tradeoff between data size, analysis depth, complexity , time, etc.

• Deeper analysis , complex template structures consumes more time than shallow analysis and simple named entity recognition or binary relation extraction

• Ease of use needs improvement

Page 93: Pharos Summer School  Fundamentals  of  Social Applications

… Lighting the Way …IE is acknowledged: an urgently needed information

technology - a constantly growing digitized world

society winners ?

Globalized information

…Those who outstrip competitors, comprehensive, integrated and precise access to digital information for decision making processes!

Page 94: Pharos Summer School  Fundamentals  of  Social Applications

Thank You

Page 95: Pharos Summer School  Fundamentals  of  Social Applications

Useful Tools

• ANNIE : Information Extraction System– http://gate.ac.uk/ie/annie.html

• Stanford Parser– http://nlp.stanford.edu:8080/parser/

• WhatsWhyWithMyNLP?– http://code.google.com/p/whatswrong/

• LingPipe– http://alias-i.com/lingpipe/html– http://www-nlp.stanford.edu/downloads/

Page 96: Pharos Summer School  Fundamentals  of  Social Applications

Useful Links

• Software Tools for NLP– http://www-a2k.is.tokushima-u.ac.jp/member/

kita/NLP/nlp_tools.html

• Statistical NLP / corpus-based computational linguistics resources– http://nlp.stanford.edu/links/statnlp.html

• Stanford NLP Group– http://www-nlp.stanford.edu/downloads/

• Linguist List - Language and Resources– http://www.linguistlist.org/langres/index.html

Page 97: Pharos Summer School  Fundamentals  of  Social Applications

Selected References

• Foundations of Statistical Natural Language Processing, Manning and Schutze

• Information Extraction, Moens• Text Mining Handbook, Feldman,

Sanger• Maximum Entropy Model for NLP,

Ratnaparkhi