LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb:...
Transcript of LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb:...
![Page 1: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/1.jpg)
LaSEWeb: Automating Search Strategies over Semi-Structured Web Data
Oleksandr PolozovUniversity of Washington
Sumit GulwaniMicrosoft Research
KDD 2014 — August 27, 2014
![Page 2: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/2.jpg)
Motivation: search engine micro-segments
![Page 3: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/3.jpg)
Motivation: search engine micro-segments
![Page 4: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/4.jpg)
Motivation: search engine micro-segments
![Page 5: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/5.jpg)
Motivation: search engine micro-segments
![Page 6: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/6.jpg)
Repetitive search tasks
Structured databases
Precise, but limited in content
No time-sensitive information
Provide no context (sources)
![Page 7: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/7.jpg)
Repetitive search tasks
Structured databases
Precise, but limited in content
No time-sensitive information
Provide no context (sources)
Web mining scripts
Two extremes: Powerful ML, which has to be re-
learned for each micro-segment
Fragile HTML layout parser
Inaccessible for end-users
![Page 8: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/8.jpg)
LaSEWeb Query Language
• A semantic scripting language for semi-structural information extraction from the Web
• Models natural patterns from the humans’ search strategies
LaSEWeb interpreter• Explores multiple webpages, clusters different answer candidates,
and provides context for each answer
• Makes use of state-of-the-art NLP/ML/PL algorithms
![Page 9: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/9.jpg)
Example: phone number
𝑣 = (“Sumit Gulwani”)
let 𝜂𝑡 = 𝐸𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒𝑑 𝑣1 inlet 𝜂𝑏 = 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝐿𝑜𝑜𝑘𝑢𝑝 𝑆𝑦𝑛("phone"), ℓ𝑎 in
𝑈𝑛𝑖𝑜𝑛 𝜂𝑡 , 𝜂𝑏where 𝑅𝑒𝑔𝑒𝑥 ℓ𝑎, "\(\d+\)\W ∗ \d + \W ∗ \d+"where 𝐿𝑎𝑦𝑜𝑢𝑡 𝜂𝑡 , 𝜂𝑏 , Down and 𝑁𝑒𝑎𝑟𝑏𝑦 𝜂𝑡 , 𝜂𝑏
![Page 10: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/10.jpg)
Example: phone number
𝑣 = (“Sumit Gulwani”)
let 𝜂𝑡 = 𝐸𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒𝑑 𝑣1 inlet 𝜂𝑏 = 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝐿𝑜𝑜𝑘𝑢𝑝 𝑆𝑦𝑛("phone"), ℓ𝑎 in
𝑈𝑛𝑖𝑜𝑛 𝜂𝑡 , 𝜂𝑏where 𝑅𝑒𝑔𝑒𝑥 ℓ𝑎, "\(\d+\)\W ∗ \d + \W ∗ \d+"where 𝐿𝑎𝑦𝑜𝑢𝑡 𝜂𝑡 , 𝜂𝑏 , Down and 𝑁𝑒𝑎𝑟𝑏𝑦 𝜂𝑡 , 𝜂𝑏
• Visual attributes
![Page 11: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/11.jpg)
Example: phone number
𝑣 = (“Sumit Gulwani”)
let 𝜂𝑡 = 𝐸𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒𝑑 𝑣1 inlet 𝜂𝑏 = 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝐿𝑜𝑜𝑘𝑢𝑝 𝑆𝑦𝑛("phone"), ℓ𝑎 in
𝑈𝑛𝑖𝑜𝑛 𝜂𝑡 , 𝜂𝑏where 𝑅𝑒𝑔𝑒𝑥 ℓ𝑎, "\(\d+\)\W ∗ \d + \W ∗ \d+"where 𝐿𝑎𝑦𝑜𝑢𝑡 𝜂𝑡 , 𝜂𝑏 , Down and 𝑁𝑒𝑎𝑟𝑏𝑦 𝜂𝑡 , 𝜂𝑏
• Visual attributes• Implicit table detection
![Page 12: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/12.jpg)
Example: phone number
𝑣 = (“Sumit Gulwani”)
let 𝜂𝑡 = 𝐸𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒𝑑 𝑣1 inlet 𝜂𝑏 = 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝐿𝑜𝑜𝑘𝑢𝑝 𝑆𝑦𝑛("phone"), ℓ𝑎 in
𝑈𝑛𝑖𝑜𝑛 𝜂𝑡 , 𝜂𝑏where 𝑅𝑒𝑔𝑒𝑥 ℓ𝑎, "\(\d+\)\W ∗ \d + \W ∗ \d+"where 𝐿𝑎𝑦𝑜𝑢𝑡 𝜂𝑡 , 𝜂𝑏 , Down and 𝑁𝑒𝑎𝑟𝑏𝑦 𝜂𝑡 , 𝜂𝑏
• Visual attributes• Implicit table detection• Linguistic patterns
![Page 13: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/13.jpg)
Example: phone number
𝑣 = (“Sumit Gulwani”)
let 𝜂𝑡 = 𝐸𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒𝑑 𝑣1 inlet 𝜂𝑏 = 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝐿𝑜𝑜𝑘𝑢𝑝 𝑆𝑦𝑛("phone"), ℓ𝑎 in
𝑈𝑛𝑖𝑜𝑛 𝜂𝑡 , 𝜂𝑏where 𝑅𝑒𝑔𝑒𝑥 ℓ𝑎, "\(\d+\)\W ∗ \d + \W ∗ \d+"where 𝐿𝑎𝑦𝑜𝑢𝑡 𝜂𝑡 , 𝜂𝑏 , Down and 𝑁𝑒𝑎𝑟𝑏𝑦 𝜂𝑡 , 𝜂𝑏
• Visual attributes• Implicit table detection• Linguistic patterns• Clustering across webpages
![Page 14: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/14.jpg)
Language Structure
• Match: webpage layout, style, end-user appearance
• Use: in-memory rendering, DOM analysis
• 𝑁𝑒𝑎𝑟𝑏𝑦, 𝐸𝑚𝑝ℎ𝑎𝑠𝑖𝑧𝑒𝑑, 𝐿𝑎𝑦𝑜𝑢𝑡, 𝐶𝑆𝑆…
Visualpatterns
• Match: relational patterns on implicit tables
• Use: table detection, plain text analysis using programming-by-example technologies
• 𝑉𝐿𝑂𝑂𝐾𝑈𝑃, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝐿𝑜𝑜𝑘𝑢𝑝…
Structural patterns
• Match: semantic text properties
• Use: POS tagging, sentence parsing, entity recognition, synonymy detection…
• 𝑆𝑦𝑛, 𝑃𝑂𝑆, 𝐸𝑛𝑡𝑖𝑡𝑦, 𝑁𝑃, 𝑆𝑎𝑚𝑒𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒…
Linguistic patterns
[1] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by Gibbs sampling. In ACL, 2005.[2] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, 2003.[3] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLT-NAACL, 2003.
[4] C. Quirk, P. Choudhury, J. Gao, H. Suzuki, K. Toutanova, M. Gamon, W.-t. Yih, L. Vanderwende, and C. Cherry. MSR SPLAT, a language analysis toolkit. In ACL, 2012.[5] W.-t. Yih, G. Zweig, and J. C. Platt. Polarity inducing latent semantic analysis. In ACL, 2012.[6] S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, 2011.[7] M. J. Cafarella., A. Halevy, and J. Madhavan. Structured data on the web. In CACM 54.2 (2011): 72-79.
![Page 15: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/15.jpg)
Program interpreter: “user emulation” algorithm
![Page 16: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/16.jpg)
Program interpreter: “user emulation” algorithm
𝑣 = "computer"
LaSEWebEngine
LaSEWeb“inventors”MS script
![Page 17: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/17.jpg)
Program interpreter: “user emulation” algorithm
𝑣 = "computer"
LaSEWebEngine
LaSEWeb“inventors”MS script
Seed query
![Page 18: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/18.jpg)
Program interpreter: “user emulation” algorithm
𝑣 = "computer"
LaSEWebEngine
LaSEWeb“inventors”MS script
Seed query
“John Atanasoff”
“John Vincent Atanasoff”
“Charles Babbage”
“Babbage, C.”
“konrad zuse”
![Page 19: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/19.jpg)
Program interpreter: “user emulation” algorithm
𝑣 = "computer"
LaSEWebEngine
LaSEWeb“inventors”MS script
Seed query
“John Atanasoff”
“John Vincent Atanasoff”
“Charles Babbage”
“Babbage, C.”
“konrad zuse”
𝑠𝑐𝑜𝑟𝑒 𝐶𝑖 =1
𝑈
𝑗=1
𝑈
𝑠∈𝐶𝑖
𝑐 𝑠, 𝑢𝑗
𝑐 𝑢𝑗
![Page 20: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/20.jpg)
Program interpreter: “user emulation” algorithm
𝑣 = "computer"
LaSEWebEngine
LaSEWeb“inventors”MS script
Seed query
“John Atanasoff”
“John Vincent Atanasoff”
“Charles Babbage”
“Babbage, C.”
“konrad zuse”
𝑠𝑐𝑜𝑟𝑒 𝐶𝑖 =1
𝑈
𝑗=1
𝑈
𝑠∈𝐶𝑖
𝑐 𝑠, 𝑢𝑗
𝑐 𝑢𝑗
John Atanasoff (14.5%)http://www.computerhope.comhttp://www.ehow.comhttp://inventors.about.com
Charles Babbage (10.5%)http://www.buzzle.comhttp://www.ask.com
…
![Page 21: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/21.jpg)
Experiments
• ~95% precision and 71% recall on factoid micro-segments• For micro-segments: Precision measured by random sampling, based on top-3 results• For end-user repetitive search tasks: Precision/recall measured manually
• Average execution time: ~5 sec/webpage• Depends on the rendering settings
• Current setting: offline deployment / database population
![Page 22: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/22.jpg)
Summary & Future work
• Typical patterns of human search strategies in a scripting language for IE• Match semi-structured Web content• Existing cross-disciplinary technologies used as building blocks• Exploit information redundancy across multiple webpages
• Applications:1. Micro-segments of factoid questions in search engines2. Repeatable batch data extraction tasks for end-users3. Structured database population from free Web text4. English language comprehension problem generation
• Future work:• Automatic query execution plans in the language• Integration with “natural language → logic” engines
![Page 23: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/23.jpg)
Summary & Future work
• Typical patterns of human search strategies in a scripting language for IE• Match semi-structured Web content• Existing cross-disciplinary technologies used as building blocks• Exploit information redundancy across multiple webpages
• Applications:1. Micro-segments of factoid questions in search engines2. Repeatable batch data extraction tasks for end-users3. Structured database population from free Web text4. English language comprehension problem generation
• Future work:• Automatic query execution plans in the language• Integration with “natural language → logic” engines
1. The principal characterized his pupils as _________ because they were pampered and spoiled by their indulgent parents.
2. The commentator characterized the electorate as _________ because it was unpredictable and given to constantly shifting moods.
(a) cosseted (b) disingenuous (c) corrosive (d) laconic (e) mercurial
![Page 24: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/24.jpg)
Summary & Future work
• Typical patterns of human search strategies in a scripting language for IE• Match semi-structured Web content• Existing cross-disciplinary technologies used as building blocks• Exploit information redundancy across multiple webpages
• Applications:1. Micro-segments of factoid questions in search engines2. Repeatable batch data extraction tasks for end-users3. Structured database population from free Web text4. English language comprehension problem generation
• Future work:• Automatic query execution plans in the language• Integration with “natural language → logic” engines
![Page 25: LaSEWeb: Automating Search Strategies over Semi-Structured Web Data · 2018-01-04 · LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University](https://reader034.fdocuments.us/reader034/viewer/2022042207/5ea9c0e11cc62d62cc2c4635/html5/thumbnails/25.jpg)
Thanks for listening!Questions?