Integrating Product Data from Websites offering Microdata Markup

38
Integrating Product Data from Websites offering Microdata Markup School of Business Informatics and Mathematics Petar Petrovski, Volha Bryl, Christian Bizer Data and Web Science Research Group University of Mannheim, Germany

description

Petar Petrovski, Volha Bryl, Christian Bizer. Integrating Product Data from Websites offering Microdata Markup.The 4th Workshop on Data Extraction and Object Search (DEOS) @ WWW 2014

Transcript of Integrating Product Data from Websites offering Microdata Markup

Page 1: Integrating Product Data from Websites offering Microdata Markup

Integrating Product Data from Websites offering Microdata

Markup

School of Business Informatics and Mathematics

Petar Petrovski, Volha Bryl, Christian Bizer

Data and Web Science Research Group University of Mannheim, Germany

Page 2: Integrating Product Data from Websites offering Microdata Markup

Outline

1. HTML-embedded Data on the Web

2. The Data Integration Pipeline

1. Microdata extraction

2. Classification

3. Feature extraction

4. Identity resolution

5. Data Fusion

3. Conclusions

2 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 3: Integrating Product Data from Websites offering Microdata Markup

HTML-embedded Data

More and more Websites semantically markup the content of their HTML pages.

Microformats

Microdata

RDFa

3 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 4: Integrating Product Data from Websites offering Microdata Markup

Schema.org

• ask site owners to embed data to enrich search results.

• 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, …

• Encoding: Microdata or RDFa

4 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 5: Integrating Product Data from Websites offering Microdata Markup

Usage of Schema.org Data @ Google

Data snippets

within

search results

Data snippets

within

info boxes

5 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 6: Integrating Product Data from Websites offering Microdata Markup

Websites Containing Structured Data (November 2013)

1.7 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13%)

585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26%).

http://webdatacommons.org/structureddata/

Google, October 2013: 15% of all websites provide structured data.

6 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 7: Integrating Product Data from Websites offering Microdata Markup

Top Classes, Microdata (2013)

• schema = Schema.org

• datavoc = Google‘s

Rich Snippet Vocabulary

7 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 8: Integrating Product Data from Websites offering Microdata Markup

Example: Microdata, Local Business

8 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 9: Integrating Product Data from Websites offering Microdata Markup

Example: Microdata, Product

School of Business Informatics and Mathematics

Page 10: Integrating Product Data from Websites offering Microdata Markup

The Data Integration Pipeline

• Objective: integrate all data found on the web describing a specific entity (e.g. product or organization)

• Motivation: enables creation of powerful applications, e.g. comparison shopping portals

• Use case: product data

• Implemented Pipeline:

10 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 11: Integrating Product Data from Websites offering Microdata Markup

Outline

1. HTML-embedded Data on the Web

2. The Data Integration Pipeline

1. Microdata extraction

2. Classification

3. Feature extraction

4. Identity resolution

5. Data Fusion

3. Conclusions

11 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 12: Integrating Product Data from Websites offering Microdata Markup

Web Data Commons Extraction Framework

• Web Data Commons project: extracts structured data from the Common Crawl – http://webdatacommons.org/ – http://commoncrawl.org/

• Code available at: – https://subversion.assembla.com/svn/commondata/ – Based on Anything To Triples (any23) library for extracting

structured data: http://any23.apache.org

• Common Crawl 2012

– 3 billion HTML pages, 40.6 million websites – 7.3 billion statements describing 1.15 billion things – 9.4 million product offers from 9240 e-shops

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 13: Integrating Product Data from Websites offering Microdata Markup

Looking Deeper into E-Commerce Data

Microdata Product (2013)

13 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 14: Integrating Product Data from Websites offering Microdata Markup

Looking Deeper into E-Commerce Data

Microdata Product (2012)

Page 15: Integrating Product Data from Websites offering Microdata Markup

Example: Title and Description

Title

Description

AppleMacBook Air MC968/A 11.6-Inch Laptop

Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000 enabling beautiful rendering and 4GB DDR3 RAM. 11.6” LED display with the best resolution…

Title

Description The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD and 4096 MB of DDR3 RAM. 29.464cm (11.6”) TFT 1366x768, Intel HD Graphics, IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon

Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4 GB, 64 GB, Mac OS X Lion 10.7

Various abbreviations can be found describing same features Often imprecise values due to rounding

in numeric values can be found

Different descriptions follow different levels of detail

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 16: Integrating Product Data from Websites offering Microdata Markup

Outline

1. HTML-embedded Data on the Web

2. The Data Integration Pipeline

1. Microdata extraction

2. Classification

3. Feature extraction

4. Identity resolution

5. Data Fusion

3. Conclusions

16 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 17: Integrating Product Data from Websites offering Microdata Markup

Product Classification

• Starting from 9.4 million products: • Products with English descriptions with length grater than 20 words

=> 1,986,359 products from 9,240 e-shops

• Training set – 18,000 labeled products, 9 classes

• Training the model – Naïve Bayes Classifier

• Features generation – 4 step process – tokenizing and removing stop words, pruning,

n-grams, TF-IDF

– ~3600 features

17 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 18: Integrating Product Data from Websites offering Microdata Markup

Classification Performance

Category Precision % Recall % #

Books 86.58 87.95 233,249

Movies, Music & Games 89.81 70.63 186,832

Electronics & Computers 92.98 88.00 219,118

Home, Garden & Tools 73.81 60.78 186,495

Grocery, Health & Beauty 70.20 72.86 120,573

Toys, Kids, Baby & Pets 75.00 64.85 114,236

Clothing, Shoes & Jewelry 88.56 89.93 206,315

Sports & Outdoors 72.83 67.90 143,156

Automotive & Industrial 73.06 65.50 168,567

Average 80.31 74.26 1,578,541

18 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

The offers originate from 9,240 e-shops

Page 19: Integrating Product Data from Websites offering Microdata Markup

Outline

1. HTML-embedded Data on the Web

2. The Data Integration Pipeline

1. Microdata extraction

2. Classification

3. Feature extraction

4. Identity resolution

5. Data Fusion

3. Conclusions

19 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 20: Integrating Product Data from Websites offering Microdata Markup

Product Feature Extraction

• Low precision (69%) for identity resolution without product feature extraction – Used later as a baseline for identity resolution

• We developed the Free Text Preprocessor

– Makes the data more structured by extracting new property-value pairs from free-text properties

– https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor

20 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 21: Integrating Product Data from Websites offering Microdata Markup

Free Text Preprocessor by Example

<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .

<http://wdc.org/resource/2> <http://schema.org/Product/description>

"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .

21 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 22: Integrating Product Data from Websites offering Microdata Markup

Free Text Preprocessor by Example

<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .

<http://wdc.org/resource/2> <http://schema.org/Product/description>

"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .

<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .

<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .

<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .

<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .

22 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 23: Integrating Product Data from Websites offering Microdata Markup

Silk Free Text Preprocessor by Example <http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .

<http://wdc.org/resource/2> <http://schema.org/Product/description>

"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .

<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .

<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .

<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .

<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .

Free Text Preprocessor Specification

23 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 24: Integrating Product Data from Websites offering Microdata Markup

Extractors – Bag-of-words

• Learning

• Creating a list of words for every feature in the training set

• Extraction

• Matching tokens against the learned lists

• Pros • Good for extracting nominal and numerical (with units of measurement) attributes

• Cons • Bad for extracting multi-token values • Inconclusive for values that refer to more than one feature

Brand Storage Display

Samsung Benq Apple Cannon … 64 GB megabytes 512GB … 42-inch 3.5-inches Inches 15.24cm …

24 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 25: Integrating Product Data from Websites offering Microdata Markup

Extractors – Feature-Value Pairs

Learns feature-value pairs from the structured data

Extraction

• Tagging – taking n-grams up to 4 and matching against the values from the training set

• Parsing – taking the combination of feature-value pairs that best describes an object from the training dataset

• Pros

• Extracting multi-token values

Cons

• Inconclusive for values that refer to more than one feature

<Model, Asus EEE 10.1 Inch> <Processor, 1.66 GHz Intel Atom N445> <Display, 10.1-inches> .. <Model, Panasonic Viera> <Display, 42-Inch>

25 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 26: Integrating Product Data from Websites offering Microdata Markup

Extractors – Manual Configuration

Manually configure features and extraction methods 1. Regular expressions

• E.g. Processor - \d*\.?\d+GHz

2. Dictionary search • E.g. Dictionary of brands (Samsung, Panasonic, Lenovo, Apple)

• Pros

• Extraction process can be fine-tuned according to the data • Good solution when no training (structured) data are available

• Cons • Needs domain knowledge • Non-trivial to efficiently pick extraction methods manually

26 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 27: Integrating Product Data from Websites offering Microdata Markup

Extraction Experiments

• Dataset for extraction 5,000 electronic products from WDC

• Training dataset (structured data)

– 20 electronics products Amazon dataset

27 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 28: Integrating Product Data from Websites offering Microdata Markup

Extraction Accuracy

Brand Model Storage Display Processor Dimension

iPod Nano .92 .98 .86 .49 .12 .78

Galaxy SII .72 .87 .89 .81 .40 .91

GalaxyTab 7.7 .80 .92 .89 .85 .72 .93

Ixus 120IS 1 .96 N/A .89 N/A .56

Vaio VPC .99 .65 .81 .77 .73 .32

Viera 42 .95 .72 N/A .82 N/A .64

Sandisk 1 1 .85 N/A N/A .31

• Extraction using Combination configuration (bag-of-words for Brand, Storage and Display; feature-value pairs for Model and Dimension; custom regular expression for the Processor)

28 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 29: Integrating Product Data from Websites offering Microdata Markup

Outline

1. HTML-embedded Data on the Web

2. The Data Integration Pipeline

1. Microdata extraction

2. Classification

3. Feature extraction

4. Identity resolution

5. Data Fusion

3. Conclusions

29 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 30: Integrating Product Data from Websites offering Microdata Markup

Identity Resolution

• We used Silk – a tool for discovering relationships between data items within different linked data sources

Provides a expressive language for defining linkage rules

Uses genetic programming to learn linkage rules

Has shown high performance on various datasets

https://www.assembla.com/spaces/silk/wiki/Home

30 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 31: Integrating Product Data from Websites offering Microdata Markup

Identity Resolution Experiments

• Gold standard: 5,000 links manually annotated

• 2,500 positive/2,500 negative

• 20 electronics products Amazon dataset (reference set)

• Experiment on 5 configurations

– Baseline (no feature extraction step)

– Bag-of-words

– Feature-value pairs

– Manual configuration

– Combinations

31 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 32: Integrating Product Data from Websites offering Microdata Markup

Silk Output: Learned Linkage Rule

:Property wdc:Model

:Transform

lowerCase

:Comparison

func = Levensthein threhold = 1.134

:Property wdc:Display

:Aggregation func= max

:Aggregation

func= average

:Transform

lowerCase

:Property amazon:Model

:Transform

tokenize

:Transform

tokenize

:Property amazon:Display

:Comparison

func = Jaccard threhold = 0.23

:Comparison

func = Jaccard threhold = 0.02

:Property amazon:Storage

:Property wdc:Storage

32 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 33: Integrating Product Data from Websites offering Microdata Markup

Identity Resolution Results

Precision % Recall % F-Measure %

Baseline 69 90 78.1

Bag-of-words 75 82 77.9

Feature-value pairs 80 77 78.4

Custom 82 80 80.9

Combination 85 80 82.4

33 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 34: Integrating Product Data from Websites offering Microdata Markup

Outline

1. HTML-embedded Data on the Web

2. The Data Integration Pipeline

1. Microdata extraction

2. Classification

3. Feature extraction

4. Identity resolution

5. Data Fusion

3. Conclusions

34 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 35: Integrating Product Data from Websites offering Microdata Markup

Data Fusion

• Input: clusters of products after identity resolution

• Properties worth fusing/combining – AggregateRating and Review

35 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 36: Integrating Product Data from Websites offering Microdata Markup

Fusion Results

Product Offers Reviews Ratings

iPod Nano 8GB 829 84 0

iPhone 4 16GB 624 35 52

Sony Ericsson Xperia Mini 450 31 12

iPad 16GB 423 40 48

Motorola XOOM 32GB 270 12 0

Samsun Galaxy SII 142 8 0

36 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 37: Integrating Product Data from Websites offering Microdata Markup

Conclusions

• By using Microdata, thousands of websites help us to understand their content

• We have implemented the 5-step data integration pipeline – From Microdata markup to an integrated dataset

• A newly introduced feature extraction step is crucial for the precision of data integration – Identity resolution precision increases from 69% to 85%

• Future work – Automatically learning regular expressions

– Automatically discovering combinations of extractors

37 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Page 38: Integrating Product Data from Websites offering Microdata Markup

Questions?

38 Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer