12 Knowledge Mining
-
Upload
harald-sack -
Category
Education
-
view
358 -
download
0
Transcript of 12 Knowledge Mining
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering Lecture 12: Knowledge Mining
Prof. Dr. Harald SackFIZ Karlsruhe - Leibniz Institute for Information InfrastructureAIFB - Karlsruhe Institute of Technology
Summer Semester 2017
This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLast lecture: Information Retrieval - 2
4.1 A Brief History of Libraries and IR
4.2 Fundamental Concepts of IR
4.3 Information Retrieval Models
4.4 Retrieval Evaluation
4.5 Web Information Retrieval
4.6 Indexing
4.7 Query Processing and Ranking
2
● Information Retrieval Models● Average Precision @ rank● Mean Average Precision (MAP)● Web Crawler● Inverted Index● TF-IDF
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● Term Frequency - Inverse Document Frequency is tf multiplied by idf
TF-IDF
Term frequency tf ● +1 to avoid negative results,● log to lower impact of
frequent words
Normalization usually via cosine similarity,But there are more weighting variants...
4. Information Retrieval / 4.7 Query Processing and Ranking
3
Invers document frequency idf
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
PageRank
● For Web Search besides the relevance of a search term for a document also the “popularity” of a document can be taken into account
● Google’s PageRank is a graph algorithm that determines the “importance” of a web page via its link graph
● PageRank basic assumptions:
○ Importance of a page is dependent on the number of incoming links, i.e. other web pages referring to the web page
○ If a web page links to another web page, the importance of a link is determined by the importance of the originating web page
○ The more outgoing links, the less important is a single outgoing link
4. Information Retrieval / 4.7 Query Processing and Ranking
4
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
PageRank4. Information Retrieval / 4.7 Query Processing and Ranking
5
damping factor
for all incoming links
importance of incoming link j
PRi
PageRank of page iPR
jPageRank of page j that is linking to page i
cj page j is linking to c
j different pages
n number of incoming links to page iN number of all pagesd damping factor d∈[0,1]
(Brin, Page, 1998)
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
PageRank4. Information Retrieval / 4.7 Query Processing and Ranking
6
1.0 1.0
1.0 1.0
DC
Iteration r(A) r(B) r(C) r(D)
1 1,0 1,0 1,0 1,0
2 1,0 0,575 2,275 0,15
3 2,083 0,575 1,1912 0,15
… … … … …
n 1,49 0,7833 1,577 0,15
BA
Initial state
1.49 0.78
1.57 0.15
DC
BA
Final stateIteration of PageRank computation
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLast Lecture: Information Retrieval - 2
4.1 A Brief History of Libraries and IR
4.2 Fundamental Concepts of IR
4.3 Information Retrieval Models
4.4 Retrieval Evaluation
4.5 Web Information Retrieval
4.6 Indexing
4.7 Query Processing and Ranking
7
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture Overview
1. Information, Natural Language and the Web
2. Natural Language Processing
3. Linked Data Engineering
4. Information Retrieval
5. Knowledge Mining
6. Exploratory Search and Recommender Systems
8
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering5. Kowledge Mining
9
5.1 From Data to Knowledge
5.2 Linked Data based Information Visualization
5.3 Linked Data based Knowledge Mining
5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
collective application of knowledge in context
understanding principles(why?, what is best?, doing things right)
experience, context, value applied to a message
understanding patterns(principles: how to?)
a message meant to change the receivers perception
understanding relations(description: what?)
discrete objective facts about event
DIKW Pyramid, Ackoff 1989 [1]
Knowledge Mining
evaluated understanding
information enriched with semantics
in usable form
raw characters and symbols
futurenovelty
pastexperience
5. IKnowledge Engineering/ 5.1 From Data to Knowledge
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Data
● Data is raw.
● It simply exists and has no significance beyond its existence (in and of itself).
● It can exist in any form, usable or not.
5. Knowledge Engineering/ 5.1 From Data to Knowledge
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information
● Information is data that has been given meaning by way of relational connection.
● This "meaning" can be useful, but does not have to be.
● Information is contained in descriptions, answers to questions that begin with such words as who, what, when, where, and how many
5. Knowledge Engineering/ 5.1 From Data to Knowledge
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Knowledge
● Knowledge is the appropriate collection of information, such that it's intent is to be useful.
● Understanding is a continuum that leads from data, through information and knowledge, and ultimately to wisdom
● Data transforms to information by convention, information to knowledge by cognition, and knowledge to wisdom by contemplation
5. Knowledge Engineering/ 5.1 From Data to Knowledge
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Knowledge5. Knowledge Engineering/ 5.1 From Data to Knowledge
● How do we transform data into knowledge? 1. Collect Data 2. Organize Data 3. Analyze Data
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering5. Kowledge Mining
15
5.1 From Data to Knowledge
5.2 Linked Data based Information Visualization
5.3 Linked Data based Knowledge Mining
5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology16
Charles Joseph Minard (1781-1870) - Napoleon’s Russian Campaign
http://patrimoine.enpc.fr/document/ENPC01_Fol_10975?image=54#bibnum
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology17
https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard_map_of_napoleon.png
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology18
A Picture is worth a 1000 words...
● Pictures have been used to convey information long before the development of writing
● A single picture can be processed (“understood”) much faster than a (linear) text page
● Human perception is processing in parallel, text analysis is limited by the sequential process of reading
https://commons.wikimedia.org/wiki/File%3AA_picture_is_worth_a_thousand_words.jpg
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology19
Information Visualization
● Information Visualization is the study of (interactive) visual representations of abstract data to reinforce human cognition
● Information graphics or infographics are graphic visual representations of information, data or knowledge intended to present information quickly and clearly
○ a static form of information visualization
○ aims to emphasize specific findings gained from the visualized data
○ Mandatory precondition: Data Analysis
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
SPARQL
e.g. via GoogleDoc
Hands On Data Visualization Example
● Dataset: DBpedia
● Task: Draw a map chart which indicates the number of soccer players per country
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● Look for a representative (prominent) example
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● Dataset: DBpedia
○ Examine a representative example from DBpedia, e.g. http://dbpedia.org/page/Lionel_Messi
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● SPARQL Query:
○ Country and Number of soccerplayers per country
■ ?s rdf:type dbo:SoccerPlayer .
■ ?s dbo:birthPlace ?birthplace .
■ ?birthplace dbo:country ?country .
■ ?country rdfs:label ?country_name
■ FILTER (lang(?countryLabel)=”en”)
■ GROUP BY ?countryLabel
■ COUNT(DISTINCT ?s)
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● SPARQL Query:
DBpedia SPARQL query
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● SPARQL Result:
○ Save as CSV
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● Import the csv data into a spreadsheet, as e.g. in GoogleDocs
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● Display spreadsheet data in some diagram
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● Important step: Data Cleansing
○ Is the displayed data correct?○ Outlier detection
○ In our example: are all displayed countries really existing countries?
○ Potential Solutions:■ Adapt your original query (new SPARQL query)■ Remove evidently wrong data
● manually or procedural
5. Knowledge Engineering/ 5.2 Data Mining
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Task: Draw a Map Chart...
● Redraw after Data Cleansing
5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering5. Kowledge Mining
30
5.1 From Data to Knowledge
5.2 Linked Data based Information Visualization
5.3 Linked Data based Knowledge Mining
5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Knowledge Mining and Knowledge Discovery5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining
Knowledge Discovery [in Databases] (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in (massive) data sources.(Fayyad et al, 1996 [2])
● valid: to a certain degree the discovered patterns should also hold for new, previously unseen problem instances.
● novel: at least to the system and preferable to the user.
● potentially useful: they should lead to some benefit to the user or task.
● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some postprocessing.
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Knowledge Mining and Knowledge Discovery5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining
Knowledge Discovery [in Databases] (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in (massive) data sources.(Fayyad et al, 1996 [2])
● Goals:
○ Descriptive Modelling: explains the characteristic and the behaviour of the observed data
○ Predictive Modelling: predicts the behaviour of new data based on some model
● Important:○ The extracted model/pattern does not have to apply in 100% of the cases
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
The Knowledge Discovery Process5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining
Data
Selection:Select a relevant dataset or focus on a subset of a dataset
Target Data
Preprocessing/Cleaning:Data integration from different sources, Data Cleaning
Preprocessed Data
Transformation:Select useful features, feature transformation, dimensionality reduction
Transformed Data
Data Mining:Search for patterns of interest
Patterns
Evaluation:Evaluate patterns based on interestingness measures, model validation
Knowledge
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Data Cleaning5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining
● “Dirty” Data:○ Dummy values, absence of data, contradicting data, etc.
● Steps in Data Cleaning○ Parsing: locates and identifies individual data elements in raw data
○ Correcting: corrects parsed individual data components using sophisticated data algorithms
○ Normalization: applies conversion routines to transform data into standard formats
○ Matching: searching and matching records within and across data based on predefined rules
○ Consolidating: merges data into one representation
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Knowledge Mining Functionality5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining
● Characterization: summarizing general features of objects in a target class (concept description)
● Discrimination: comparing general features of objects between a target class and a contrasting class (concept comparison)
● Association: studying the frequency of items occurring together
● Prediction: predicting some unknown or missing attribute values
● Classification: organizing data in given classes based on attribute values (supervised)
● Clustering: organizing data in classes based on attribute values (unsupervised)
● Outlier analysis: identifying and explaining exceptions (surprises)
● Time-series analysis: analyzing trends and deviations
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology36
Data Analysis in the Knowledge Mining Process
● Data Analysis is a fundamental iterative process:1. Formulation and execution of a query2. Analysis of the results 3. Formulation of a consecutive query based on the achieved results
● Goals of Data Analysis:○ maximize understanding of analyzed data○ uncover hidden structures/patterns○ extraction of important variables○ detection of anomalies and outliers○ testing of hypotheses○ development of a simple model
5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering5. Kowledge Mining
37
5.1 From Data to Knowledge
5.2 Linked Data based Information Visualization
5.3 Linked Data based Knowledge Mining
5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology38
HandsOn: Linked Data Analytics
● Available Toolset:○ Data 1: DBpedia SPARQL endpoint○ Data 2: Wikidata SPARQL endpoint
● Query language: ○ SPARQL
● simple statistics and visualization: ○ R
https://www.r-project.org/
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology39
HandsOn: Linked Data Analytics5. Knowledge Engineering/ 5.4 Linked Data Analytics
https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology40
Linked Data Analytics
● First example: Analyse the number of achieved goals of soccer players with DBpedia
● Solution1. Create appropriate SPARQL query2. Save result as csv file3. Analyze data with R
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology41
Linked Data Analytics
1. Create appropriate SPARQL query
SPARQL Query
SELECT ?goals WHERE { ?s rdf:type dbo:SoccerPlayer ; dbp:totalgoals ?goals FILTER (DATATYPE(?goals)=xsd:integer) .}
2. Save the result as csv file
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology42
Linked Data Analytics
● First example: Analyse the number of achieved goals of soccer players with DBpedia
3. Analyse data with R○ read data:
goals <- read.csv("dbpedia-goals", header=TRUE) ○ summarize data:
summary(goals)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology43
Linked Data Analytics
First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R
○ analyze data via boxplot:boxplot(goals)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology44
Linked Data Analytics
First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R
○ analyze data via boxplot:boxplot(goals)
○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))
○
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology45
Linked Data Analytics
First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R
○ analyze data via boxplot:boxplot(goals)
○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))
○
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology46
Linked Data Analytics
First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R
○ analyze data via boxplot:boxplot(goals)
○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))
○
InterQuartile Range (IQR)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology47
Linked Data Analytics
First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R
○ analyze data via boxplot:boxplot(goals)
○ adapt ranges for visibility:boxplot(goals,ylim=c(-90,200))
○
IQR
Whiskers: IQR x 1.5 = (Q3-Q
1) x 1.5
Q3+ 1.5 IQR
Q1- 1.5 IQR
Outliers
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology48
Linked Data Analytics
First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R
○ Data Cleaning■ remove negative values■ examine outliers (SPARQL query)
○ New data analysis
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology49
Linked Data Analytics
Second example: ● Is there a relationship between the number
of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology50
Linked Data Analytics
1. Create appropriate SPARQL query
SPARQL Query
SELECT SAMPLE(?goals) as ?goals SAMPLE(xsd:integer(?height)) as ?height SAMPLE(xsd:date(?bday)) as ?bday WHERE { ?s rdf:type dbo:SoccerPlayer ; <http://dbpedia.org/ontology/Person/height> ?height ; dbo:birthDate ?bday ; dbp:totalgoals ?goals FILTER (DATATYPE(?goals)=xsd:integer).} GROUP BY ?s
2. Save the result as csv file
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology51
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology52
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R
> boxplot(goals2$height)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology53
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R
> boxplot(goals2$bday)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology54
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R○ How are the values distributed?○ > hist(goals2$goals)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology55
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R○ How are the values distributed?○ > hist(goals2$goals)○ > hist(goals2$height)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology56
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R○ How are the values distributed?○ > hist(goals2$goals)○ > hist(goals2$height)○ > hist(goals2$bday)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology57
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R○ Data Cleaning and 2D plot
1. height vs. goals
> plot(goals2$height,goals2$goals)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology58
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R○ Data Cleaning and 2D plot
2. birthday vs. goals
> plot(goals2$bday,goals2$goals)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology59
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia)
3. Analyse data with R○ Data Cleaning and 2D plot
3. birthday vs. height
> plot(goals2$bday,goals2$height)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology60
Linked Data Analytics
● But, is there really a relationship between the number of achieved goals, the birthdate, or the height of a soccer player?
● Statistics to the Rescue: ○ Correlation Coefficient
determines correlation and dependence of two or more variables:
○ Applied to out problem in R: ■ cor(goals2$height,goals2$goals)= -0.04350283
■ cor(goals2$bday,goals2$goals)= -0.1098645
■ cor(goals2$bday,goals2$height)= 0.2881508
Covariance of X and Y
Standard deviations of X and Y
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology61
Linked Data Analytics
● Let’s have a closer look at the distribution of heights
looks like a Normal distribution
might probably be an anomaly
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology62
Linked Data Analytics
● Let’s have a closer look at the distribution of heights○ Use SPARQL for further analysis
SPARQL query
might probably be an anomaly
select ?s ?height WHERE {?s rdf:type dbo:SoccerPlayer ; dbo:height ?height .}ORDER BY ?height
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology63
Linked Data Analytics
● Let’s have a closer look at the distribution of heights○ Use SPARQL for further analysis ○ For DBpedia, you always have to consider
the extraction process from the originalWikipedia infoboxes...
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology64
Linked Data Analytics
Third example: ● Use another data source:
● Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in Wikidata)
https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology65
● WIKIDATA follows a different data schema
qualifiers
https://www.wikidata.org/wiki/Q39444
statement
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology66
Linked Data Analytics
● WIKIDATA uses a different data schema
https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444
Access via different namespaces for properties:
● wdt: connects an item to a valuewd:Q39444 wdt:P54 ?team .
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Object=
subject/context for statement
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology67
Linked Data Analytics
● WIKIDATA uses a different data schema
https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444
statementAccess via different namespaces for properties:
● wdt: connects an item to a valuewd:Q39444 wdt:P54 ?team .
● p: connects a subject to a statementwd:Q39444 p:P54 ?team_statement .
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology68
Linked Data Analytics
● WIKIDATA uses a different data schema
https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444
property and object/value of
statement
Access via different namespaces for properties:
● wdt: connects an item to a valuewd:Q39444 wdt:P54 ?team .
● p: connects a subject to a statementwd:Q39444 p:P54 ?team_statement .
● pq: connects statement to qualifier value?team_statement pq:1351 ?statement_value
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology69
Linked Data Analytics
https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444
pq:P1351 ?goals
Access via different namespaces for properties:
● pq: connects statement to qualifier value
wd:Q39444 p:P54 ?team_statement .?team_statement pq:P1351 ?goals .
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology70
Linked Data Analytics
1. Create appropriate SPARQL query
SPARQL Query
SELECT (SUM(?goals) as ?total_goals) (SAMPLE(?height) as ?height) (SAMPLE(xsd:date(?birthdate)) as ?bday) WHERE { ?s wdt:P106 wd:Q937857 ; p:P54 ?team_statement ; wdt:P2048 ?height ; wdt:P569 ?birthdate . ?team_statement pq:P1351 ?goals .} GROUP BY ?s
2. Save the result as csv file
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology71
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology72
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA)
● height vs. goals> plot(goals3$height,goals3$goals)
> cor(goals3$height,goals3$goals) [1] -0.046644459
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology73
Linked Data Analytics
Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA)
● Distribution of birthdays> hist(goals3$bday)
5. Knowledge Engineering/ 5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology74
Linked Data Analytics
Are there any relationships between the number of goals and other properties?
5. Knowledge Engineering/ 5.4 Linked Data Analytics
SPARQL query
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering5. Kowledge Mining
75
5.1 From Data to Knowledge
5.2 Linked Data based Information Visualization
5.3 Linked Data based Knowledge Mining
5.4 Linked Data Analytics
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
5. Knowledge Mining Bibliography
[1] Ackoff, R. L. (1989). From data to wisdom. Journal of Applied Systems Analysis 15: 3-9
[2] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. (1996). From data mining
to knowledge discovery: an overview. In Advances in knowledge discovery and data mining,
American Association for Artificial Intelligence, Menlo Park, CA, USA 1-34.
[3] The R Project for Statistical Computing, https://www.r-project.org/
76
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
5. Knowledge Mining Syllabus Questions
● Explain the concept of PageRank. What are its basic assumptions?
● What’s the difference: Data, Information, Knowledge, and Wisdom?
● What is Knowledge Discovery?
● What are the goals of Knowledge Discovery?
● Explain the process of Knowledge Discovery
● Why do we need a “Data Cleaning” step in Knowledge Mining?
77