Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information...
Transcript of Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information...
![Page 1: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/1.jpg)
Web ScienceIntroduction to Information Integration
Julien Gaugaz, October 26, 2010
![Page 2: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/2.jpg)
Topics
2
![Page 3: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/3.jpg)
Topics
2
• 1. Information Integration
![Page 4: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/4.jpg)
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
![Page 5: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/5.jpg)
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
![Page 6: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/6.jpg)
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
![Page 7: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/7.jpg)
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
• 5. Collaborative Web
![Page 8: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/8.jpg)
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
• 5. Collaborative Web
• 6. Web Archiving
![Page 9: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/9.jpg)
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
• 5. Collaborative Web
• 6. Web Archiving
• 7. Medical Social Web
![Page 10: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/10.jpg)
Scenarios
Why Integrating Information?
![Page 11: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/11.jpg)
Company Mergers
4
![Page 12: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/12.jpg)
Company Mergers
4
![Page 13: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/13.jpg)
Company Mergers
4
![Page 14: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/14.jpg)
Company Mergers
4
![Page 15: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/15.jpg)
Travelling Agent
5
Agent
![Page 16: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/16.jpg)
Booking Flights
6
Agent
![Page 17: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/17.jpg)
Leveraging Wikipedia Infoboxes
7
Query
Data Contribution
![Page 18: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/18.jpg)
Evolution
8
Beginning ofDatabases
Wikipedia &Social Web
Rise of Internet & Wrapping Websites
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1960 1970 1980 1990 2000 2010
Num
ber
of S
ourc
es
![Page 19: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/19.jpg)
Kinds of discrepancies
What is the Problem?
![Page 20: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/20.jpg)
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 21: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/21.jpg)
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 22: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/22.jpg)
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 23: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/23.jpg)
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 24: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/24.jpg)
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 25: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/25.jpg)
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 26: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/26.jpg)
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
![Page 27: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/27.jpg)
Wikipedia Infoboxes
11
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605
| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000
![Page 28: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/28.jpg)
Wikipedia Infoboxes
11
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605
| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000
![Page 29: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/29.jpg)
Wikipedia Infoboxes
11
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605
| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000
![Page 30: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/30.jpg)
Causes of Discrepancies
12
![Page 31: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/31.jpg)
Causes of Discrepancies• Information sources are diverse
• Different cultural background
• Different domain of activity
• Different model of information
12
![Page 32: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/32.jpg)
Causes of Discrepancies• Information sources are diverse
• Different cultural background
• Different domain of activity
• Different model of information
• Typos and other kinds of errors
12
![Page 33: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/33.jpg)
Causes of Discrepancies• Information sources are diverse
• Different cultural background
• Different domain of activity
• Different model of information
• Typos and other kinds of errors
• Evolution over time
• Use, usage and users of one source may change of over time
12
![Page 34: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/34.jpg)
Places of Discrepancies
13
![Page 35: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/35.jpg)
Places of DiscrepanciesInformation level where discrepancies appear:
13
![Page 36: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/36.jpg)
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
13
![Page 37: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/37.jpg)
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
• Representational
• Lexical: word / term representing the meaning
• Structural: how are the terms arranged to represent the meaning
13
![Page 38: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/38.jpg)
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
• Representational
• Lexical: word / term representing the meaning
• Structural: how are the terms arranged to represent the meaning
• Syntactic: how is the lexical and structural encoded into characters (and bits)
13
![Page 39: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/39.jpg)
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
• Representational
• Lexical: word / term representing the meaning
• Structural: how are the terms arranged to represent the meaning
• Syntactic: how is the lexical and structural encoded into characters (and bits)
Discrepancies may concern:
• Schema elements (properties and structure) and values13
![Page 40: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/40.jpg)
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
![Page 41: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/41.jpg)
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”
![Page 42: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/42.jpg)
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”“Albert Einstein”
full_nameEinstein
![Page 43: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/43.jpg)
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”“Albert Einstein”
full_nameEinstein
<Einstein> <full_name> “Albert Einstein”.
![Page 44: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/44.jpg)
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”“Albert Einstein”
full_nameEinstein
<Einstein> <full_name> “Albert Einstein”.
<Einstein> <full_name>Albert Einstein</full_name></Einstein>
![Page 45: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/45.jpg)
SemanticRepresentational
Schema Ambiguity
15
“Prof. Dr. techn.”xyztitle
“The Theory of Relativity”xyztitle
![Page 46: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/46.jpg)
SemanticRepresentational
Schema Ambiguity
15
“Prof. Dr. techn.”xyztitle
“The Theory of Relativity”xyztitle
Person title
![Page 47: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/47.jpg)
SemanticRepresentational
Schema Ambiguity
15
Article title
“Prof. Dr. techn.”xyztitle
“The Theory of Relativity”xyztitle
Person title
![Page 48: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/48.jpg)
Value Discrepancies
16
SemanticRepresentational
Einstein’s full name is “Albert Einstein”
“Albert Einstein”“Albert Einstin”“A. Einstein”“Einstein, Albert”
full_nameEinstein
![Page 49: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/49.jpg)
Where discrepancies are addressed with standards
Syntactic Level
![Page 50: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/50.jpg)
Encoding Bytes
18
![Page 51: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/51.jpg)
Encoding Bytes
• Basic unit
• Universal standard: Bit (binary digit)
• Ternary digit (base 3, USSR 50’s, out of use)
18
![Page 52: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/52.jpg)
Encoding Bytes
• Basic unit
• Universal standard: Bit (binary digit)
• Ternary digit (base 3, USSR 50’s, out of use)
• Bits into bytes
• Big or small endian
• System wise convention, easily convertible, defined in communication protocols
18
![Page 53: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/53.jpg)
Encoding Characters
19
![Page 54: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/54.jpg)
Encoding Characters
• De facto standards:
• UTF-8/16
19
![Page 55: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/55.jpg)
Encoding Characters
• De facto standards:
• UTF-8/16
• Many others exist: ASCII, ISO-8859’s, KOI-8, ...
19
![Page 56: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/56.jpg)
Encoding Characters
• De facto standards:
• UTF-8/16
• Many others exist: ASCII, ISO-8859’s, KOI-8, ...
• Trivial dictionary-based translation
• When the corresponding code exists in the target character map...
19
![Page 57: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/57.jpg)
Encoding Lexico-Structural
20
![Page 58: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/58.jpg)
Encoding Lexico-Structural
• XML, XML Schema
• Structured document serialization format
• Base for:
• (X)HTML
• SVG: Scalable Vector Graphics
• DOCX: Microsoft Office Word 2007
20
![Page 59: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/59.jpg)
Resource Description FrameworkEncoding information
RDF
![Page 60: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/60.jpg)
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
![Page 61: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/61.jpg)
• <subject> <property> <object>
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
![Page 62: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/62.jpg)
• <subject> <property> <object>
• <subject>
• URI or blank node
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
![Page 63: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/63.jpg)
• <subject> <property> <object>
• <subject>
• URI or blank node
• <property>
• URI
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
![Page 64: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/64.jpg)
• <subject> <property> <object>
• <subject>
• URI or blank node
• <property>
• URI
• <object>
• URI or blank node or (typed) literal22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
![Page 65: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/65.jpg)
URI
23
![Page 66: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/66.jpg)
URI
• URI: Universal Resource Identifiers
• URL’s are URI’s
•scheme:scheme-specific-part
• RDF encourage using URL’s
23
![Page 67: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/67.jpg)
URI
• URI: Universal Resource Identifiers
• URL’s are URI’s
•scheme:scheme-specific-part
• RDF encourage using URL’s
• URL
• scheme://usr:passwd@domain:port/path?query_string#anchor
23
![Page 68: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/68.jpg)
RDF
24
![Page 69: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/69.jpg)
RDF• Resource Description Framework
24
![Page 70: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/70.jpg)
RDF• Resource Description Framework
• Data model specialized in conceptual information modeling
24
![Page 71: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/71.jpg)
RDF• Resource Description Framework
• Data model specialized in conceptual information modeling
• Supported by various serialization formats:
• XML
• Notation3 (N3)
• Turtle
• ...
24
![Page 72: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/72.jpg)
RDF Schema (RDF/S)
25
![Page 73: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/73.jpg)
RDF Schema (RDF/S)• Expressed in RDF
25
![Page 74: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/74.jpg)
RDF Schema (RDF/S)• Expressed in RDF
• Types subjects and objects with classes
• Class hierarchy (with multiple inheritance)
• Type of properties of a class
25
![Page 75: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/75.jpg)
RDF Schema (RDF/S)• Expressed in RDF
• Types subjects and objects with classes
• Class hierarchy (with multiple inheritance)
• Type of properties of a class
• Types properties
• Domain: type of property’s subject
• Range: type of property’s object
25
![Page 76: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/76.jpg)
RDF Schema (RDF/S)• Expressed in RDF
• Types subjects and objects with classes
• Class hierarchy (with multiple inheritance)
• Type of properties of a class
• Types properties
• Domain: type of property’s subject
• Range: type of property’s object
• OWL2 is more expressive: cardinality, etc...25
![Page 77: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/77.jpg)
When to use RDF?
26
![Page 78: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/78.jpg)
When to use RDF?• RDF is good at
• Modeling information
• Especially when schema is unknown or changing
• When there is multiple schemas
26
![Page 79: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/79.jpg)
When to use RDF?• RDF is good at
• Modeling information
• Especially when schema is unknown or changing
• When there is multiple schemas
• RDF is not for
• Representing documents (XHTML, CSS)
• Internal data management when schema is known and fixed (Relational Databases)
26
![Page 80: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/80.jpg)
Discrepancies between the representational and semantic levels in the schema
Schema Matching
![Page 81: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/81.jpg)
28
• name• boxer id• weight• birthdate• total fights• residence
• first name• last name• age• address
• street• city
• tax id
Boxer Taxpayer
• ...
Company• ...
Trainer
• ...
Tax Office
![Page 82: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/82.jpg)
28
• name• boxer id• weight• birthdate• total fights• residence
• first name• last name• age• address
• street• city
• tax id
• Input: Schemas to match
• Possibly data instantiating those schemas
Boxer Taxpayer
• ...
Company• ...
Trainer
• ...
Tax Office
![Page 83: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/83.jpg)
28
• name• boxer id• weight• birthdate• total fights• residence
• first name• last name• age• address
• street• city
• tax id
• Input: Schemas to match
• Possibly data instantiating those schemas
• Output: Mappings between schema elements
• Possibly with confidence values and alternatives
• Possibly with value conversion rules (matchings)
Boxer Taxpayer
• ...
Company• ...
Trainer
• ...
Tax Office
![Page 84: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/84.jpg)
Mappings or Matching?
29
![Page 85: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/85.jpg)
Mappings or Matching?
• Schema mapping identifies correspondences between schema elements
29
![Page 86: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/86.jpg)
Mappings or Matching?
• Schema mapping identifies correspondences between schema elements
• Schema matching actually transforms an instance of one schema into an instance of another schema
29
![Page 87: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/87.jpg)
General architectures
How to Use Mappings?
![Page 88: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/88.jpg)
Mediated Schemas
31
Schema1
Schema2
Schema3
![Page 89: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/89.jpg)
Mediated Schemas
31
Mediated Schema
Schema1
Schema2
Schema3
![Page 90: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/90.jpg)
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
![Page 91: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/91.jpg)
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
![Page 92: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/92.jpg)
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Schema1
Schema2
Schema3
![Page 93: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/93.jpg)
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Mediated Schema
Schema1
Schema2
Schema3
![Page 94: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/94.jpg)
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Mediated Schema
Schema1
Schema2
Schema3
![Page 95: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/95.jpg)
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Query
Mediated Schema
Schema1
Schema2
Schema3
![Page 96: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/96.jpg)
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Query
Mediated Schema
Schema1
Schema2
Schema3
Query
Schema x
![Page 97: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/97.jpg)
Peer Data Management
32
Local MappingLocal Source
Peer Schema
Peer Mapping
Local Schema
![Page 98: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/98.jpg)
Why not by hand?
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
![Page 99: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/99.jpg)
Why not by hand?• Size and complexity of source schemas
• Number of schemas sources
• Leveraging data instance values
• Schemas not known in advance
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
![Page 100: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/100.jpg)
Why not by hand?• Size and complexity of source schemas
• Number of schemas sources
• Leveraging data instance values
• Schemas not known in advance
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
![Page 101: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/101.jpg)
Why not by hand?• Size and complexity of source schemas
• Number of schemas sources
• Leveraging data instance values
• Schemas not known in advance
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
![Page 102: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/102.jpg)
Schema Matching Features
34
![Page 103: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/103.jpg)
Schema Matching Features
• Schema-only vs schema & instances
34
![Page 104: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/104.jpg)
Schema Matching Features
• Schema-only vs schema & instances
• Representational
• Lexical vs structural
34
![Page 105: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/105.jpg)
Schema Matching Features
• Schema-only vs schema & instances
• Representational
• Lexical vs structural
• Internal vs external
34
![Page 106: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/106.jpg)
Schema Matching Features
• Schema-only vs schema & instances
• Representational
• Lexical vs structural
• Internal vs external
34
More in:• Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The
VLDB Journal. 2001;10(4):334-350.• 1. Shvaiko P, Euzenat J. A Survey of Schema-Based Matching Approaches. Journal on
Data Semantics IV. 2005;3730:146-171.
![Page 107: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/107.jpg)
Schema Matching Techniques
35
![Page 108: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/108.jpg)
• String-based
Schema Matching Techniques
35
![Page 109: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/109.jpg)
• String-based
• Language-based
Schema Matching Techniques
35
![Page 110: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/110.jpg)
• String-based
• Language-based
• Linguistic resources
Schema Matching Techniques
35
![Page 111: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/111.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
Schema Matching Techniques
35
![Page 112: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/112.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
Schema Matching Techniques
35
![Page 113: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/113.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
![Page 114: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/114.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
![Page 115: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/115.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
![Page 116: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/116.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
![Page 117: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/117.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
![Page 118: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/118.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
![Page 119: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/119.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
![Page 120: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/120.jpg)
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
![Page 121: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/121.jpg)
Leveraging lexical features
A String-Based Technique
![Page 122: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/122.jpg)
Edit Distance
37
![Page 123: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/123.jpg)
Edit Distance• String distance: measures distance between
two strings
37
![Page 124: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/124.jpg)
Edit Distance• String distance: measures distance between
two strings
• Edit distance: number of operations needed to transform one string into the other
37
![Page 125: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/125.jpg)
Edit Distance• String distance: measures distance between
two strings
• Edit distance: number of operations needed to transform one string into the other
• Common basic operations:
• Insert, delete or substitute one character
• Possibly with different weights depending on the operation and characters involved
37
![Page 126: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/126.jpg)
Edit Distance• String distance: measures distance between
two strings
• Edit distance: number of operations needed to transform one string into the other
• Common basic operations:
• Insert, delete or substitute one character
• Possibly with different weights depending on the operation and characters involved
• Java libraries:
• SecondString, SimMetrics37
![Page 127: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/127.jpg)
Levenshtein Distance
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 128: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/128.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 129: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/129.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 130: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/130.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 131: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/131.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 132: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/132.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 133: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/133.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 134: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/134.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 135: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/135.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 136: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/136.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 137: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/137.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 138: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/138.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
![Page 139: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/139.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
Sundays
![Page 140: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/140.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSundays
![Page 141: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/141.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysS
undays
![Page 142: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/142.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundays
Sundays
![Page 143: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/143.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdays
Sundays
![Page 144: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/144.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdays
Sundays
![Page 145: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/145.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdays
Sundays
![Page 146: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/146.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdays
Sundays
![Page 147: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/147.jpg)
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdaysSaturdays
Sundays
![Page 148: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/148.jpg)
WordNet
A Linguistic Resource
![Page 149: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/149.jpg)
WordNet
40
![Page 150: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/150.jpg)
WordNet
• Fundamental components: Synonyn Sets (Synsets)
40
![Page 151: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/151.jpg)
WordNet
• Fundamental components: Synonyn Sets (Synsets)
• {car, auto, automobile, machine, motorcar}
• a motor vehicle with four wheels; usually propelled by an internal combustion engine
40
![Page 152: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/152.jpg)
WordNet
• Fundamental components: Synonyn Sets (Synsets)
• {car, auto, automobile, machine, motorcar}
• a motor vehicle with four wheels; usually propelled by an internal combustion engine
• {car, railcar, railway car, railroad car}
• a wheeled vehicle adapted to the rails of railroad
40
![Page 153: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/153.jpg)
Hypernyms / Hyponyms• Hypernyms: superordinates, isA relationships. A
synset may have more than one hypernym.
• Hyponyms: subordinates
41
{car, auto, automobile, machine, motorcar}
{motor vehicle, automotive vehicle}
{cab, hack, taxi, taxicab} {ambulance}
hypernym
hyponyms
![Page 154: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/154.jpg)
Holonym / Meronym• Meronym: name of a constituent part of, the
substance of, or a member of something. X is a meronym of Y if X is a part of Y.
• Holonym: name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y.
42
{car, auto, automobile, machine, motorcar}
{ accelerator, accelerator pedal, gas pedal, gas, throttle, gun}
holonym meronym
![Page 155: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/155.jpg)
Other relationships in WN
43
![Page 156: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/156.jpg)
Other relationships in WN
• Antonym
43
![Page 157: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/157.jpg)
Other relationships in WN
• Antonym
• Entailment (for verbs)
• A verb X entails Y if X cannot be done unless Y is, or has been, done.
43
![Page 158: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/158.jpg)
Other relationships in WN
• Antonym
• Entailment (for verbs)
• A verb X entails Y if X cannot be done unless Y is, or has been, done.
• Attribute (for adjectives)
• A noun for which adjectives express values. The noun weight is an attribute, for which the adjectives light and heavy express values.
43
![Page 159: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/159.jpg)
Leveraging structure
A Graph-Matching Technique
![Page 160: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/160.jpg)
Similarity Flooding
45
![Page 161: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/161.jpg)
Similarity Flooding• Uses structure of the data to help matching
schemas
45
![Page 162: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/162.jpg)
Similarity Flooding• Uses structure of the data to help matching
schemas
• Similarity Flooding in Melnik et al. (2002)
• First maps schema elements with lexical similarity
• Then improves matching assuming that:
• If two elements are similar, then the elements adjacent to them are more probable to be similar
45
![Page 163: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/163.jpg)
Similarity Flooding• Uses structure of the data to help matching
schemas
• Similarity Flooding in Melnik et al. (2002)
• First maps schema elements with lexical similarity
• Then improves matching assuming that:
• If two elements are similar, then the elements adjacent to them are more probable to be similar
45
Selected paper 1:Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.
![Page 164: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/164.jpg)
Detecting duplicate entries
Deduplication
![Page 165: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/165.jpg)
Why is there Duplicates?
47
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
Sport Authorities Taxes Authorities
Administration-wide database
![Page 166: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/166.jpg)
48
• Input: 2 entities with matched attributes
• Output: M for matched or U for unmatched.
• Possibly R for reject between M and U for cases where supervised decision is necessary.
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
• name: Muhammad Ali• address:
• city: Cairo• country: Egypt
• tax id: #8244361
M
UR
![Page 167: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/167.jpg)
Deduplication Features
![Page 168: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/168.jpg)
50
Field Distance Metrics
![Page 169: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/169.jpg)
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
![Page 170: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/170.jpg)
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
![Page 171: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/171.jpg)
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
Similar to Information Retrieval techniques (Topic 2 next week)
![Page 172: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/172.jpg)
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
Similar to Information Retrieval techniques (Topic 2 next week)
Not much techniques other than considering them as strings or direct difference
![Page 173: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/173.jpg)
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
Similar to Information Retrieval techniques (Topic 2 next week)
Not much techniques other than considering them as strings or direct difference
![Page 174: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/174.jpg)
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
![Page 175: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/175.jpg)
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson
![Page 176: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/176.jpg)
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson
![Page 177: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/177.jpg)
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson2.A2 26a132o5
![Page 178: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/178.jpg)
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o5
![Page 179: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/179.jpg)
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261
![Page 180: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/180.jpg)
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Rupert1.Rupert2.Ro1e633.Ro1e634.R163
Robert1.Robert2.Ro1e633.Ro1e634.R163
Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261
![Page 181: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/181.jpg)
Other Phonetic Codes
52
![Page 182: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/182.jpg)
Other Phonetic Codes
• NYSIIS
• Developed and still in use at the New York State Division of Criminal Justice Services
• Encodes vowels (mostly to A)
• Codes are letters instead of digits
• Longer codes (6 instead of 4)
52
![Page 183: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/183.jpg)
Other Phonetic Codes
53
![Page 184: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/184.jpg)
Other Phonetic Codes
• Metaphone
• Codes are letters instead of digits
• No maximum code length
• More elaborated coding rules
53
![Page 185: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/185.jpg)
Other Phonetic Codes
• Metaphone
• Codes are letters instead of digits
• No maximum code length
• More elaborated coding rules
• Double Metaphone
• Returns a secondary code to help disambiguate
53
![Page 186: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/186.jpg)
Detecting Duplicates
![Page 187: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/187.jpg)
Bayes Decision Rule
55
![Page 188: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/188.jpg)
• M: match, U: unmatch
Bayes Decision Rule
55
M if p(M |�x) ≥ p(U |�x)U otherwise
![Page 189: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/189.jpg)
• M: match, U: unmatch
• Using Bayes rule
Bayes Decision Rule
55
p(M |�x) ≥ p(U |�x)
⇔ p(M ∧ �x)
p(�x)≥ p(U ∧ �x)
p(�x)
⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)
⇔ l(�x) =p(�x|M)
p(�x|U)≥ p(U)
p(M)
M if p(M |�x) ≥ p(U |�x)U otherwise
![Page 190: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/190.jpg)
• M: match, U: unmatch
• Using Bayes rule
• Decision rule: likelihood ratio
Bayes Decision Rule
55
p(M |�x) ≥ p(U |�x)
⇔ p(M ∧ �x)
p(�x)≥ p(U ∧ �x)
p(�x)
⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)
⇔ l(�x) =p(�x|M)
p(�x|U)≥ p(U)
p(M)
M if p(M |�x) ≥ p(U |�x)U otherwise l(�x) =
p(�x|M)
p(�x|U)≥ p(U)
p(M)
![Page 191: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/191.jpg)
• M: match, U: unmatch
• Using Bayes rule
• Decision rule: likelihood ratio
• Using independence assumption
Bayes Decision Rule
55
p(M |�x) ≥ p(U |�x)
⇔ p(M ∧ �x)
p(�x)≥ p(U ∧ �x)
p(�x)
⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)
⇔ l(�x) =p(�x|M)
p(�x|U)≥ p(U)
p(M)
M if p(M |�x) ≥ p(U |�x)U otherwise l(�x) =
p(�x|M)
p(�x|U)≥ p(U)
p(M)
p(�x|M) =�
i
p(xi|M)
p(�x|U) =�
i
p(xi|U)
![Page 192: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/192.jpg)
Bayes Decision Rule
56
p(xi|M) p(xi|U)
![Page 193: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/193.jpg)
Bayes Decision Rule
56
• Priors ( and ) can be learned on a training set
p(xi|M) p(xi|U)
![Page 194: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/194.jpg)
Bayes Decision Rule
56
• Priors ( and ) can be learned on a training set
• Other methods based on Expectation-Maximisation (EM) algorithm can estimate priors without training set
p(xi|M) p(xi|U)
![Page 195: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/195.jpg)
Clustering-Based Decision
57
Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.
![Page 196: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/196.jpg)
Clustering-Based Decision• Using clustering techniques with appropriate parameters
• X-Means
• Variant of K-Means without a fixed K
• Chauduri et al. observed that duplicates tend
1. to have small distances from each other (compact set property), and
2. 2) to have only a small number of other neighbors within a small distance (sparse neighborhood property).
57
Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.
![Page 197: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/197.jpg)
Dealing with O(n2)
58
0E+00
2.5E+11
5E+11
7.5E+11
1E+12
0 200'000 400'000 600'000 800'000 1'000'000
Number of entities in repository
Num
ber
of c
ompa
riso
ns
![Page 198: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/198.jpg)
Canopies
59
●●
●
●● ●● ●
●
●
●
●
●●
●●
![Page 199: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/199.jpg)
Canopies
59
●●
●
●● ●● ●
●
●
●
●
●●
●●
• Create canopies using a cheap similarity metric
• Overlapping clusters
![Page 200: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/200.jpg)
Canopies
59
●●
●
●● ●● ●
●
●
●
●
●●
●●
• Create canopies using a cheap similarity metric
• Overlapping clusters
• Compare entities pairwise using a more expensive similarity metric
![Page 201: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/201.jpg)
Pay-as-you-go Information Integration
Dataspaces
![Page 202: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/202.jpg)
Dataspaces
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
![Page 203: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/203.jpg)
Dataspaces
• Note a data integration approach per se
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
![Page 204: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/204.jpg)
Dataspaces
• Note a data integration approach per se
• Data co-existence appraoch
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
![Page 205: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/205.jpg)
Dataspaces
• Note a data integration approach per se
• Data co-existence appraoch
• Pay-as-you-go data integration
• Leveraging human contributions for data integration in a non-invasive manner
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
![Page 206: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/206.jpg)
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
![Page 207: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/207.jpg)
• Are they duplicates?
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
![Page 208: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/208.jpg)
• Are they duplicates?
• To compare field values we need schema matches
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
![Page 209: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/209.jpg)
• Are they duplicates?
• To compare field values we need schema matches
• To find schema matches we need duplicates
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
![Page 210: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/210.jpg)
• Are they duplicates?
• To compare field values we need schema matches
• To find schema matches we need duplicates
• etc...
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
![Page 211: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.](https://reader034.fdocuments.us/reader034/viewer/2022050511/5f9ba04b1f849666992d55b6/html5/thumbnails/211.jpg)
Selected Topic Papers1. Schema Matching
• Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.
2. Deduplication• Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05.
2005:865-876.
3. Dataspaces• Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York,
NY, USA; 2006:1-9.
4. Interdependence between schema matching and deduplication
• Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
63