Information Extraction on the Web
description
Transcript of Information Extraction on the Web
Information Extraction on the Web
Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central [email protected]
Outline
What is information extraction?Document typesApplicationsWrapper inductionAutomatic Wrapper generatorConclusions
An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.
Example-- Parser input a sequence of lexical items and
perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete
What’s information extraction?
Modules
Text Zonerturn a text into a set of text segments Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributesFilterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones
Document types
Plain text: ( 一句一句,平鋪直述 ) 利用 lexical 、 semantic analysis 。 AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTA
L(Soderland 95), HASTEN(Krupka 95) 。
Web page : ( 半結構性文件 ) 利用 html 語法特性 -tag 。 觀察所得之 heuristics: Layout 。
Applications
Meta Search EnginesInformation Agents 以特定目的為導向,例 :
新聞代理人 (News spider) 網羅新聞 購物比價 找工作
ShopBot (Doorenbos 97), Software LEGO(Hsu 99) 。
Information Integration Systems
Unprocessed,Unintegrated
Details
Translation and Wrapping
Semantic Integration
Mediation
AbstractedInformation
Text,Images/Video,Spreadsheets
Hierarchical& NetworkDatabases
RelationalDatabases
Object &Knowledge
Bases
SQL ORBWrapper Wrapper
Mediator Mediator
Human & Computer Users
Heterogeneous Data Sources
InformationIntegrationService
Mediator
User Services:• Query• Monitor• Update Agent/Module
Coordination
What is a wrapper?
Wrapper An extracting program to extract
desired information from Web pages.Semi-Structure Doc.– wrapper→ Structure Info.
Web Wrappers
Web wrappers wrap... “Query-able’’ or “Search-able’’ Web
sites Web pages with large itemized lists
The primary issues are: How to build the extractor quickly?
Free Text Extraction v.s. Semi-structured Text Extraction
Example: to extract attributes --- job title, employer and phone number --- from a job item list Free text extraction can depend on NL
knowledge“The department of computer science at Cranberry
Lemon University has a faculty position opening. Please call (555)333-5555 for more details.”
Semistructured text extraction? --- depend on appearance and regularity“Faculty position, department of computer science,
Cranberry Lemon University. Call (555)333-5555”
Wrapper Representations
Delimiter-based finite state automata<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Congo</B><I>242</I><BR><B>Egypt</B><I>20</I><BR><B>Belize</B><I>501<I><BR><B>Spain</B><I>34</I><BR></BODY></HTML>
2 31
extract skip extractskip
<B> </I><I></B>4
Related Work
Shopbot Doorenbos, Etzioni, Weld, AA-97
Ariadne Ashish, Knoblock, Coopis-97
WIEN Kushmerick, Weld, IJCAI-97
Related Work (Cont.)
SoftMealy wrapper representation Hsu, IJCAI-99
STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST
IEPAD Chang, WWW01
WIEN
HLRT (Head-Left-Right-Tail) Labeling: by PageOracle, LableOracle. PAC analysis Extract 48% web pages successfully. Weakness:
Missing attributes, attributes not in order, tabular data..etc.
Softmealy
Finite-State Transducers for Semi-Structured Text Mining
Labeling: use a interface to label example by manually.
FST (Finite-State Transducer) Sigle-pass Multi-pass
SoftMealy wrapper representation Uses finite-state transducer where each d
istinct attribute permutations can be encoded as a successful path
Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes
Finite State Transducer
b
M -A A
-N
N-UU
e
extract
extractextract
extractskip
skipskip
skip
skip多解決了(N, M) 、(N, A, M)2 個情形
STALKER
STALKER “STALKER: Learning Extraction Rules for Se
mi-structured, Web-based Information Sources”. AAAI-98, Muslea.
Embeded Catalog Description is a tree-like structure.
Multi-Pass or Hierarchical Wrapper
先 extract Body
再 extract Tuples
Pass1: extract U
Pass2:extract N
Pass3:extract A
Pass4:extract M
Rule Generating
1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D42nd: uncover{D1, D2} Candicate:{; _Symbol_}
Extract Credit info.
Features
Process is performed in a hierarchical manner.沒有 Attributes not in order 的問題。Use disjunctive rule 可以解決 Missing attributes 的問題。
Comparison
Both : can handle irregular missing attributes. 對於未見過的 attribute ,需要 training
Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快
Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢
Comparison
Quote Server Stalker: 10 example tuples, 79%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 85%, single-pass 97%
Internet Address Finder Stalker: 80% ~ 100%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 68%, single-pass 41%,
Comparison
Okra(tabular pages) Stalker: 97%, 1 example tuple WIEN: 100% , 13 example tuples, 30 test SoftMealy: single-pass 100%, 1 example tuple, 30
testBig-book(tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 test SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test