KnowItAll

April 5 2007

William Cohen

Announcements

• Reminder: project presentations (or progress report)

– Sign up for a 30min presentation (or else)– First pair of slots is April 17– Last pair of slots is May 10

• William is out of town April 6-April 9– So, no office hours Friday.

• Next week: no critiques assigned– But I will lecture

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99

Cucerzan & Yarowsky ‘99

Etzioni et al 2005

Rosenfeld and Feldman

Stevenson & Greenwood

Clever idea for learning relation patterns & strong

experimental results

De-emphasize duality, focus on distance between patterns.

Know It All

Architecture

Set of (disjoint?) predicates to consider + two names for each

~= [H92]

• Context – keywords from user to filter out non-domain pages• … ?

Architecture

Bootstrapping - 1

“city”

template rule

Bootstrapping - 2

Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”)i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”)

These are then used to create features: fU(x)>θ and fU(x)<θ

Bootstrapping - 3

1. Submit the queries & apply the rules to produce initial seeds.

2. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)|

3. Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds.

4. Train a NaiveBayes classifier using thresholded U’s as features.

Bootstrapping - 4

Estimate using the classifier

based on the previously-

trained discriminators

Some ad hoc stopping conditions… (“signal to noise” ratio)

Architecture - 2

Extensions to KnowItAll

• Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want– Eg target is “scientist”, but natural clusters are “biologist”,

“physicist”, “chemist”

• Solution: subclass extraction– Modify template/rule system to extract subclasses of target

class (eg scientist chemist, biologist, …)– Check extracted subclasses with WordNet and/or PMI-like

method (as for instances)– Extract from each subclass recursively

Extensions to KnowItAll• Problem: Set of rules is limited:

– Derived from fixed set of “templates” (general patterns ~ from H92)

• Solution 1: Pattern learning: augment the initial set of rules derivable from templates

1. Search for instances I on the web2. Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4”3. Assume classes are disjoint and estimate recall/precision of each pattern P4. Exclude patterns that cover only one seed (very low recall)5. Take the top 200 remaining patterns and

• Evaluate them as extractors “using PMI” (?)• Evaluate them as discriminators (in usual way?)

Examples: “headquartered in <city>”, “<city> hotels”, …,

Extensions to KnowItAll• Solution 2:

– List extraction: augment the initial set of rules with rules that are local to a specific web page

1. Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”)

2. For each page P:• Find subtrees T of the DOM tree that contain >k seeds• Find longest common prefix/suffix of the seeds in T

– [Some heuristics added to generalize this further]• Find all other strings inside T with the same prefix/suffix

• Heuristically select the “best” wrapper for a page– Wrapper = P, T, prefix, suffix

w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

w3 Italy, Japan, France, Israel, Spain, Brazil

w4 Italy, Japan

Results - City

Results - Film

Results - Scientist

Observations

• Corpus is accessed indirectly thru Google API– Only use top k discriminators– Run extractors via query keywords & extract– Limited by network access time

• Lots of moving parts to engineer– Rule templates– Signal-to-noise– LE wrapper evaluation details– Parameters: number of discriminators, number of seeds to

keep, number of names per concept, ….

KnowItNow: Son of KnowItAll

• Goal: faster results, not better results• Difference 1:

– Store documents locally– Build local index (Bindings Engine) optimized for

finding instances of KnowItAll rules and patterns• Based on inverted index

term (doc,position,contextInfo)

• Difference 2:– New model (URNS model) to merge information from multiple

extraction rules– Intuition: instances generated from each extractor are assumed

to be a mixture of two distributions1. Random noise from large instance pool2. Stuff with known structure (e.g., uniform, Zipf’s law, …)

– Using EM you can estimate mixture probabilities and parameters of non-noisy data Prob(x noise|x extracted)

137 colors = 41% of mass 15,346 colors = 59% of mass Prob(noise)= 0.59Non-noisy data: uniform• over 137 instances

59% of mass doesn’t Prob(noise)= 0.59Non-noisy data: Zipf’s• over >N instances

41% of mass fits powerlaw

KnowItAll

Documents

Transcript of KnowItAll

We KnowItAll · 2 3 “In the Knowledge Lies the Power” Manual “knowledge engineering” Cyc, Freebase, WordNet Volunteers Open Mind, DBPedia von Ahn: games ÎVerbosity KnowItAll

knowitall Id Expert - Bio-rad€¦ · When it comes to identifying unknown spectra, it’s difﬁcult to ﬁgure out where to begin. Bio-Rad’s KnowItAll ID Expert offers both novices

96189-Bio-Rad KnowItAll IR NIR Edition Brochure · 2011-03-27 · The KnowItAll IR/NIR Edition offers the following applications and options. Full details on each application is outlined

KnowItAll Vibrational Spectroscopy · PDF fileWhether you use IR, NIR, or Raman, KnowItAll® Vibrational Spectroscopy Software has the right solution for your lab! Bio-Rad’s KnowItAll

KnowItAll Installation Instructions & Quick Start Guide...KnowItAll Installation Instructions & Quick Start Guide This document reviews main features of the software so users can get

KnowItAll Analytical Edition - Bio-Rad Laboratories · Spectral Database Building and Management Chemists and spectroscopists produce valuable data every day within their organizations.

96394-Bio-Rad KnowItAll U Brochure · IR - Dyes - Bio-Rad Sadtler IR - Dyes, Pigments & Stains - Bio-Rad Sadtler IR - Electric Power Plant Materials - Bio-Rad Sadtler IR - EPA Vapor

Raman - Forensic - HORIBA · 2020. 5. 19. · Raman - Forensic - HORIBA Product Code - This database is available only as part of the KnowItAll Raman Spectral Library Spectra - 575

KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.

ETV Resources Promoting Friendship and Civility with...Families, Feelings & Friendship on KIA A Knowitall Video Series From Scholastic Children's Stories Mike Mulligan and His Steam

KnowItAll ChemWindow › wp-content › uploads › 2020 › 06 … · Basic Toolbox SymAppsTM 3D Presentations & 3D Modeling SymApps is a professional symmetry analysis and 3D molecular

KnowItAll I D ú ÿ J - Bio-Rad Laboratories

KnowItAll Spectroscopy Edition · 2020. 7. 20. · Wiley’s KnowItAll Spectroscopy Edition offers solutions to identify, analyze, and manage spectral data. It supports multiple instrument

Validation of the KnowItAll Stereochemistry Toolkit · the KnowItAll Stereochemistry Toolkit. The toolkit identified the implicit stereochemistry from 2D representations and added

KnowItAll ChemWindow Edition - Bio-Rad · KnowItAll ® ChemWindow ®Edition Software for Structure Drawing, Data Management, & More

KnowItAll ChemWindow Edition

280021-Bio-Rad KnowItAll Software Spectroscopy Databases Software Catalog Chinese

Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.

KnowItAll ChemWindow · 3D Structure Drawing 3D ViewIt allows the input of and visualization of 3D structures. A rudimentary 2D to 3D conversion is included for 2D structure files.

KnowItAll Analytical Edition - Bio-Rad · KnowItAll®Analytical Edition Software Solutions for IR, NMR, NIR, MS, UV-Vis, & Chromatography . Whether you use one or more techniques,