BeeSpace Informatics Research
-
Upload
keiko-gill -
Category
Documents
-
view
18 -
download
0
description
Transcript of BeeSpace Informatics Research
![Page 1: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/1.jpg)
BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai
Department of Computer ScienceInstitute for Genomic Biology
StatisticsGraduate School of Library & Information Science
University of Illinois at Urbana-Champaign
BeeSpace Workshop, May 21, 2007
![Page 2: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/2.jpg)
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer …
Relational Database
Text Miner
Meta Data
Task Support
Space Navigation
ContentAnalysis
![Page 3: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/3.jpg)
Part 1: Content Analysis
![Page 4: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/4.jpg)
Natural Language Understanding
…We have cloned and sequenced
a cDNA encoding Apis mellifera ultraspiracle (AMUSP)
and examined its responses to …
NP
NP NP
NPVP
VP VP
Gene Gene
![Page 5: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/5.jpg)
Sample Technique 1: Automatic Gene Recognition
• Syntactic clues:– Capitalization (especially acronyms)– Numbers (gene families)– Punctuation: -, /, :, etc.
• Contextual clues:– Local: surrounding words such as “gene”,
“encoding”, “regulation”, “expressed”, etc.– Global: same noun phrase occurs several times in
the same article
![Page 6: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/6.jpg)
Maximum Entropy Modelfor Gene Tagging
• Given an observation (a token or a noun phrase), together with its context, denoted as x
• Predict y {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:– y = gene & candidate phrase starts with a capital letter– y = gene & candidate phrase contains digits
• Estimate i with training data
![Page 7: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/7.jpg)
Domain overfitting problem• When a learning based gene tagger is applied to a
domain different from the training domain(s), the performance tends to decrease significantly.
• The same problem occurs in other types of text, e.g., named entities in news articles.
Training domain Test domain F1mouse mouse 0.541
fly mouse 0.281Reuters Reuters 0.908Reuters WSJ 0.643
![Page 8: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/8.jpg)
Observation I
• Overemphasis on domain-specific features in the trained model
winglessdaughterless
eyelessapexless…
fly
“suffix –less” weighted high in the model trained from fly data
![Page 9: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/9.jpg)
Observation II
• Generalizable features: generalize well in all domains– …decapentaplegic and wingless are expressed in
analogous patterns in each primordium of… (fly)– …that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)
![Page 10: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/10.jpg)
Observation II
• Generalizable features: generalize well in all domains– …decapentaplegic and wingless are expressed in
analogous patterns in each primordium of… (fly)– …that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)
“wi+2 = expressed” is generalizable
![Page 11: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/11.jpg)
Generalizability-based feature ranking
fly mouse D3 Dm…training data
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
…0.125………0.167……
![Page 12: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/12.jpg)
T1 Tm…training data
E
testing
test data
O1 Om
…
individual domainfeature ranking
domain-specific features
feature re-ranking
O’
generalizable features
feature selection for D1
feature selection for D0
top d0 features for D0
top d1 features for D1
feature selection for Dm
top dm features for Dm
…
learningentity recognizer
d = λ0d0 + (1 – λ0)(λ1d1 + … + λmdm)d features
Adapting Biological Named Entity Recognizer
λ0, λ1, … , λm
![Page 13: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/13.jpg)
Effectiveness of Domain AdaptationExp Method Precision Recall F1
F+M→Y Baseline 0.557 0.466 0.508Domain 0.575 0.516 0.544% Imprv. +3.2% +10.7% +7.1%
F+Y→M Baseline 0.571 0.335 0.422Domain 0.582 0.381 0.461% Imprv. +1.9% +13.7% +9.2%
M+Y→F Baseline 0.583 0.097 0.166Domain 0.591 0.139 0.225% Imprv. +1.4% +43.3% +35.5%
•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast)
![Page 14: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/14.jpg)
Gene Recognition in V3
• A variation of the basic maximum entropy– Classes: {Begin, Inside, Outside} – Features: syntactical features, POS tags, class
labels of previous two tokens– Post-processing to exploit global features
• Leverage existing toolkit: BMR
![Page 15: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/15.jpg)
Part 2: Navigation Support
![Page 16: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/16.jpg)
Space-Region Navigation
Literature Spaces
Bee Fly
Behavior
Bird…
Topic Regions
Bee Forager
MAP MAP
Bird Singing
EXTRACT
…Fly Rover
EXTRACT
SWITCHING
Intersection, Union,…
Intersection, Union,…
My Regions/Topics
My Spaces
![Page 17: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/17.jpg)
MAP: Topic/RegionSpace
• MAP: Use the topic/region description as a query to search a given space
• Retrieval algorithm:– Query word distribution: p(w|Q)
– Document word distribution: p(w|D)
– Score a document based on similarity of Q and D
• Leverage existing retrieval toolkits: Lemur/Indri
Vocabularyw D
QQDQ wp
wpwpDDQscore
)|()|(
log)|()||(),(
![Page 18: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/18.jpg)
EXTRACT: Space Topic/Region
• Assume k topics, each being represented by a word distribution
• Use a k-component mixture model to fit the documents in a given space (EM algorithm)
• The estimated k component word distributions are taken as k topic regions
| |
1 1
log ( | ) log[ ( | ) (1 ) ( | )]D k
i B j i jD C i j
p C p D p D
Likelihood:
Maximum likelihood estimator: * arg max ( | )p C
Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p
![Page 19: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/19.jpg)
A Sample Topic & Corresponding Space
filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626
actin filamentsflight muscleflight muscles
labels
• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle
Word Distribution (language model)
Example documents
Meaningful labels
![Page 20: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/20.jpg)
Incorporating Topic Priors
• Either topic extraction or clustering:– User exploration: usually has preference.– E.g., want one topic/cluster is about foraging
behavior
• Use prior to guild topic extraction– Prior as a simple language model– E.g. forage 0.2; foraging 0.3; food 0.05; etc.
![Page 21: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/21.jpg)
Incorporating a Topic Prior
Original EM:
EM with Prior:
Prior
Prior
![Page 22: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/22.jpg)
Incorporating Topic Priors: Sample Topic 1
age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439
Prior:
labor 0.2division 0.2
![Page 23: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/23.jpg)
Incorporating Topic Priors: Sample Topic 2
behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045
Prior:
behavioral 0.2maturation 0.2
![Page 24: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/24.jpg)
foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051
foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228
Exploit Prior for Concept Switching
![Page 25: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/25.jpg)
Part 3: Task Support
![Page 26: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/26.jpg)
Gene Summarization
• Task: Automatically generate a text summary for a given gene
• Challenge: Need to summarize different aspects of a gene
• Standard summarization methods would generate an unstructured summary
• Solution: A new method for generating semi-structured summaries
![Page 27: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/27.jpg)
An Ideal Gene Summary• http://flybase.bio.indiana.edu/.bin/fbidq.html?FBgn0000017
GP
EL
SI
GI
MP
WFPI
![Page 28: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/28.jpg)
Semi-structured Text Summarization
![Page 29: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/29.jpg)
Summary example (Abl)
![Page 30: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/30.jpg)
A General Entity Summarizer
• Task: Given any entity and k aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category
![Page 31: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/31.jpg)
Summary• All the methods we developed are
– General– Scalable
• The problems are hard, but good progress has been made in all the directions– The V3 system has only incorporated the basic research results– More advanced technologies are available for immediate
implementation• Better tokenization for retrieval• Domain adaptation techniques• Automatic topic labeling• General entity summarizer
• More research to be done in– Entity & relation extraction– Graph mining/question answering– Domain adaptation – Active learning
![Page 32: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/32.jpg)
Looking Ahead: X-Space…
Literature Text
Search Engine
Words/Phrases Entities
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer …
Relational Database
Text Miner
Meta Data
Task Support
Space Navigation
ContentAnalysis
![Page 33: BeeSpace Informatics Research](https://reader035.fdocuments.us/reader035/viewer/2022081604/56812a67550346895d8dea65/html5/thumbnails/33.jpg)
Thank You!
Questions?