Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...
-
Upload
stephen-rogers -
Category
Documents
-
view
213 -
download
0
Transcript of Ta Nha Linh 1TIM13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh...
1TIM
Ta Nha Linh
13 March 2009
Harvesting useful information on researchers' home pages
Ta Nha Linh
Supervisor: Asst. Prof. Min-Yen Kan
2TIM
Ta Nha Linh
13 March 2009
Outline
• Motivation
• Challenges
• Researchers Information COllector (RICO)
• Contributions
• Future Works
3TIM
Ta Nha Linh
13 March 2009
Outline
• Motivation
• Challenges
• Researchers Information COllector (RICO)
• Contributions
• Future Works
4TIM
Ta Nha Linh
13 March 2009
Motivation
• Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink
• How about the authors of those publications?
• Publication-centric.
5TIM
Ta Nha Linh
13 March 2009
Motivation
• Researcher-centric database?– Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only
– Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences
– Some other similar databases: manual update, specific to certain organization
6TIM
Ta Nha Linh
13 March 2009
• Goal: Automated system to build researchers database, for multiple disciplines
• Input: Researchers’ home pages.
– Basic information
– Contact information
– Educational history
– Publications
7TIM
Ta Nha Linh
13 March 2009
Outline
• Motivation
• Challenges
• Researchers Information COllector (RICO)
• Contributions
• Future Works
8TIM
Ta Nha Linh
13 March 2009
Challenges
• Different layouts– Templates
– Personal pages
• Different content– Pages introducing researchers
– CV-like
– Personal pages
• Different content structures– Tables / lists
– Natural language text
9TIM
Ta Nha Linh
13 March 2009
10TIM
Ta Nha Linh
13 March 2009
11TIM
Ta Nha Linh
13 March 2009
12TIM
Ta Nha Linh
13 March 2009
Challenges
• Different data presentations
hangli at microsoft dot com cs.duke.edu, junyang [email protected] erafalin(at)cs.tufts.edu <Image src=’email.jpg’/> Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk wmt then the at-sign then uci dot edu
13TIM
Ta Nha Linh
13 March 2009
Outline
• Motivation
• Challenges
• Researchers Information COllector (RICO)
• Contributions
• Future Works
14TIM
Ta Nha Linh
13 March 2009
Researchers Information COllector (RICO)
• Field Identification
• Home page Identification
• Post Processing
15TIM
Ta Nha Linh
13 March 2009
RICO - Architecture
Home page Identification
Field Identification
Post-Processing
16TIM
Ta Nha Linh
13 March 2009
Researchers Information COllector (RICO)
• Field Identification
• Home page Identification
• Post Processing
17TIM
Ta Nha Linh
13 March 2009
Field Identification - Purpose
• To identify data in the page contents to corresponding fields in a pre-defined set of desired information.
• Current set includes:Name – Position – Affiliation
Address – Phone – Fax - Email
BS year – BS major – BS university
MS year – MS major – MS university
PhD year – PhD major – PhD university
Research Interest – Publications
18TIM
Ta Nha Linh
13 March 2009
Field Identification - Related works• Tang et al (2007), (2008) – ArnetMiner
– Prepocessing: tokenize text into 5 categories
– Tagging of tokens by using Conditional Random Field (CRF)
– F1 = 83.37% (~1,000 researchers)
– Set of features used: + Content features (word, morphological, image
features)+ Pattern features (positive word, special token,
reseacher name features)+ Term features (term, dictionary features)
19TIM
Ta Nha Linh
13 March 2009
Field Identification - Related works
• Tang et al (2007), (2008) – ArnetMiner
– Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM.
– Based only on text of the page. Stylistic information can be of use.
20TIM
Ta Nha Linh
13 March 2009
Field Identification - Methodology
• Input: a researcher home page
• CRF is the learning model
• Features used– Global features
– Lexicon features
– Context features
– Dictionaries features
– Stylistic features
21TIM
Ta Nha Linh
13 March 2009
Field Identification - Methodology
• Global features: apply for current token– Morphological features
– Initials
– Number
– Punctuation
• Lexicon features: apply for current token– Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax, Email
22TIM
Ta Nha Linh
13 March 2009
Field Identification - Methodology• Context features: apply for whole line
– Name context– Address context– Phone context: 'phone', 'tel', 'mobile'– Fax context: 'fax', 'facsimile'– Email context: 'email', 'e-mail'– Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor'– Master (MS) context: appearance of 'M.S' or 'MS' or 'Master'– Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy'– Research-interest context: multiple line property– Publication context: multiple line property– Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line.
23TIM
Ta Nha Linh
13 March 2009
Field Identification - Methodology• Dictionaries
– Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature
– Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests
– Research dictionary: classified into high/mid/low confidence.
– Universities dictionary: of names of most of universities, according to Open Directory
24TIM
Ta Nha Linh
13 March 2009
Field Identification - Methodology
• Stylistic features– List feature
– Table features
– Section feature: based on html tags like <div>, <p>, <title>, header tags, list elements, table
25TIM
Ta Nha Linh
13 March 2009
Field Identification - PerformanceData set of 40 home pages, cross validation
Overall Precision: 70.66 – Recall: 62.73 – F1: 64.87
Classes Precision Recall F1
name 75.66% 51.34% 61.17
phone 53.38% 89.25% 66.80
fax 47.73% 72.41% 57.53
email 79.31% 70.77% 74.80
address 78.90% 74.57% 76.67
affiliation 30.27% 59.47% 40.12
position 79.46% 64.49% 71.20
research-interest
48.48% 36.04% 41.34
publications 71.05% 43.27% 53.79
Classes Precision Recall F1
bs-major 88.89% 78.05% 83.12
bs-uni 68.67% 57.00% 62.30
bs-year 90.00% 72.00% 80.00
ms-major 71.43% 32.26% 44.44
ms-uni 52.94% 52.94% 52.94
ms-year 77.78% 56.00% 65.12
phd-major 83.33% 73.17% 77.92
phd-uni 74.56% 72.03% 73.28
phd-year 100.00% 74.07% 85.11
26TIM
Ta Nha Linh
13 March 2009
Field Identification - Discussion
• Data fields to be annotated similar to those from ArnetMiner.– Extra: Name, Research Areas, Publications
– Missing: Image
• Stylistic feature used is minimal
27TIM
Ta Nha Linh
13 March 2009
Field Identification - Discussion
• F1 value is significantly lower than that of ArnetMiner’s– ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. RICO has no prior knowledge about the page to be parsed.
Heuristic to improve confidence of ‘Name’
Make use of Affiliation name input
– Identifying ‘Research Interest’ and ‘Publications’ is challenging.
Improve ‘Publications’
28TIM
Ta Nha Linh
13 March 2009
Researchers Information COllector (RICO)
• Field Identification
• Home page Identification
• Post Processing
29TIM
Ta Nha Linh
13 March 2009
Home page Identification - Purpose
• Add-on component
• To complete automation of the system
30TIM
Ta Nha Linh
13 March 2009
Home page Identification – Related works
• Ahoy!– Input: Researcher name and (optional) institution name
– “Home page”: allocated page, classified by URL patterns
• RICO– Input: Institution name
– “Home page”: allocated page with biographical information, classified by contents
31TIM
Ta Nha Linh
13 March 2009
Home page Identification – Methodology• Collect a list of Universities domains
• Use Yahoo! BOSS to search for professors in the institutions
• For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’.
• Classify by the number of appearance of keywords.
• Home pages will be passed to Fields Identification component.
32TIM
Ta Nha Linh
13 March 2009
Home page Identification – Discussion
• Query used not able to get all relevant pages. Tune for majority: professors in institutions.– Target researchers in research organizations.
• Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page Treated as 2 different records.– Need high confidence in overall system performance. But researcher names are not unique.
– Best if can eliminate duplication by analyzing URLs. But domain hierarchies differ within department, between departments, and between institutions.
33TIM
Ta Nha Linh
13 March 2009
Researchers Information COllector (RICO)
• Field Identification
• Home page Identification
• Post Processing
34TIM
Ta Nha Linh
13 March 2009
Post-processing - Purpose
• Input: CRF++ output file from Fields Identification.
• Group neighboring tokens identified with the same annotation tag
• Deduplication
• Store into database (current size ~ 170,000 researchers)
35TIM
Ta Nha Linh
13 March 2009
Outline
• Motivation
• Challenges
• Researchers Information COllector (RICO)
• Contributions
• Future Works
36TIM
Ta Nha Linh
13 March 2009
Contribution
• Produced an automated system for fetching researchers’ information from the world wide web.
• Introduced a number of features for Field Identification machine learning.
37TIM
Ta Nha Linh
13 March 2009
Outline
• Motivation
• Challenges
• Researchers Information COllector (RICO)
• Contributions
• Future Works
38TIM
Ta Nha Linh
13 March 2009
Future improvements• Field Identification
– Introduce more features, especially stylistic features– Strengthen features targeting Name, Research Interest and Publications tags– Cater for the <image> tag– Be able to handle pages using HTML frames– Be able to follow links on the page if necessary
• Home page Identification– Improve heuristics
• Post-processing– Be able to refine output from Fields Identification
• A new component to facilitate front end for user to query the database
39TIM
Ta Nha Linh
13 March 2009
THANK YOU!
Question?