Systematic Development of Data Mining-Based Data Quality Tools

51
Einführung Test Data Generator Data Auditing Tool Evaluation Literature Systematic Development of Data Mining-Based Data Quality Tools Dominik Luebbers, Udo Grimmer, Matthias Jarke Seminar Data Mining Prof. Dr. Thomas Hofmann Steffen Hartmann Xu Jia 12.Jul.2005 1 / 32

description

 

Transcript of Systematic Development of Data Mining-Based Data Quality Tools

Page 1: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Systematic Development ofData Mining-Based Data Quality Tools

Dominik Luebbers, Udo Grimmer, Matthias Jarke

Seminar Data MiningProf. Dr. Thomas Hofmann

Steffen HartmannXu Jia

12.Jul.2005

1 / 32

Page 2: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Überblick

1 Einführung

2 Test Data Generator

3 Data Auditing Tool

4 Evaluation

5 Literature

2 / 32

Page 3: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Einführung

Worum geht es?

3 / 32

Page 4: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Einführung

Motivation

41% der Data Warehousing Projekte fehlgeschlagen!Grund: mangelnde Data Quality (Garbage in, Garbage out)manuelle Inspektion ist fast unmöglichGrund: Daten über längere Zeit, verschiedene Generation vonDatenbanktechnologieLösung: (Semi-) automatische Data Auditing Tools.

4 / 32

Page 5: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Data Quality

Was ist Data Quality?

Data Quality ist zielorientiert ⇒ keine formale Definition. Literature sprechenfitness for use or meeting end-user expectations

Quality Dimensionsaccuracy or correctnesscompletenessconsistencyactualityrelevance

5 / 32

Page 6: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Data Quality

Was ist Data Quality?

Data Quality ist zielorientiert ⇒ keine formale Definition. Literature sprechenfitness for use or meeting end-user expectations

Quality Dimensionsaccuracy or correctnesscompletenessconsistencyactualityrelevance

5 / 32

Page 7: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Data Auditing

Was ist Data Auditing?

application of data mining-algorithms for measuring and (possiblyinteractive) improving of data quality.Wichtig: data mining-algorithms muss geeignet zur Appliaction-domainsein.

Idee

Data mining-algorithms sucht die Regularitäten in Daten.z.B. Preis>100Euro ⇒ Versandkosten=0Deviations (Abweichungen) als Errors.

Teilaufgaben

Structure inductionDeviation detectionBeide Teilaufgaben können asynchronisiert ausgeführt werden. Vorteil?

6 / 32

Page 8: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Data Auditing

Was ist Data Auditing?

application of data mining-algorithms for measuring and (possiblyinteractive) improving of data quality.Wichtig: data mining-algorithms muss geeignet zur Appliaction-domainsein.

Idee

Data mining-algorithms sucht die Regularitäten in Daten.z.B. Preis>100Euro ⇒ Versandkosten=0Deviations (Abweichungen) als Errors.

Teilaufgaben

Structure inductionDeviation detectionBeide Teilaufgaben können asynchronisiert ausgeführt werden. Vorteil?

6 / 32

Page 9: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Data Auditing

Was ist Data Auditing?

application of data mining-algorithms for measuring and (possiblyinteractive) improving of data quality.Wichtig: data mining-algorithms muss geeignet zur Appliaction-domainsein.

Idee

Data mining-algorithms sucht die Regularitäten in Daten.z.B. Preis>100Euro ⇒ Versandkosten=0Deviations (Abweichungen) als Errors.

Teilaufgaben

Structure inductionDeviation detectionBeide Teilaufgaben können asynchronisiert ausgeführt werden. Vorteil?

6 / 32

Page 10: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Data Auditing Tool Development Process

7 / 32

Page 11: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Test Environment

Warum Test Environment?

Daten generieren, um die Charakteristik der Datenbank zu simulieren.pollute die Daten ⇒ Vergleichung der clean und polluted Testdaten fürdie Evaluation des Data Auditing Tools.

8 / 32

Page 12: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Generieren von Testdaten

Rule-pattern-based date generation process

1 Datenbankschema feststellen (Anzahl und Typ der Attributen)2 TDG-Rule set generieren3 Data Records generieren

9 / 32

Page 13: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

Definition 1 (atomic TDG-formulae)

Let A and B be numerical or nominal attibutes and let a1 be a numerical ornominal domain value. Furthmore let N and M be numerical attibutes and let nbe a numerical domain value. Then

A = a1, A 6= a1, N < n, N > n, A isnull , A isnotnull (propositional)A = B, A 6= B, N < M, N > M (relational)

are called atomic TDG-formulae.

Definition 2 (TDG-formulae)

Each atomic TDG-formulae is a TDG-formulae.Let n ∈ N and α1, ..., αn be TDG-formulae. Then α1 ∨ ... ∨ αn andα1 ∧ ... ∧ αn are TDG-formulae

Definition 3 (TDG-rule)

Let α and β be TDG-formulae. Then α → β is a TDG-rule.

10 / 32

Page 14: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

Definition 1 (atomic TDG-formulae)

Let A and B be numerical or nominal attibutes and let a1 be a numerical ornominal domain value. Furthmore let N and M be numerical attibutes and let nbe a numerical domain value. Then

A = a1, A 6= a1, N < n, N > n, A isnull , A isnotnull (propositional)A = B, A 6= B, N < M, N > M (relational)

are called atomic TDG-formulae.

Definition 2 (TDG-formulae)

Each atomic TDG-formulae is a TDG-formulae.Let n ∈ N and α1, ..., αn be TDG-formulae. Then α1 ∨ ... ∨ αn andα1 ∧ ... ∧ αn are TDG-formulae

Definition 3 (TDG-rule)

Let α and β be TDG-formulae. Then α → β is a TDG-rule.

10 / 32

Page 15: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

Definition 1 (atomic TDG-formulae)

Let A and B be numerical or nominal attibutes and let a1 be a numerical ornominal domain value. Furthmore let N and M be numerical attibutes and let nbe a numerical domain value. Then

A = a1, A 6= a1, N < n, N > n, A isnull , A isnotnull (propositional)A = B, A 6= B, N < M, N > M (relational)

are called atomic TDG-formulae.

Definition 2 (TDG-formulae)

Each atomic TDG-formulae is a TDG-formulae.Let n ∈ N and α1, ..., αn be TDG-formulae. Then α1 ∨ ... ∨ αn andα1 ∧ ... ∧ αn are TDG-formulae

Definition 3 (TDG-rule)

Let α and β be TDG-formulae. Then α → β is a TDG-rule.

10 / 32

Page 16: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

sinnlose Rules

A = Val1 → A = Val2A = Val1 ∧ A = Val2 → B = Val1A = Val1 → A 6= Val2

⇒ Diese Rules sollen vermieden werden.⇒ Natural TDG-formulae and -rules

Definition 4 (Natural TDG-formulae)

Let α be a TDG-formulae. α is a natural TDG-formulae iff one of the followingholds:

α is an atomic TDG-formulae and α is satisfiable.α = α1 ∧ α2 ∧ ... ∧ αn and the following holds:

∀i : αi is a natural TDG-formulae,α is satisfiable and∀i : αi :

Vj,i 6=j αj

für Disjunktion analog.

11 / 32

Page 17: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

sinnlose Rules

A = Val1 → A = Val2A = Val1 ∧ A = Val2 → B = Val1A = Val1 → A 6= Val2

⇒ Diese Rules sollen vermieden werden.⇒ Natural TDG-formulae and -rules

Definition 4 (Natural TDG-formulae)

Let α be a TDG-formulae. α is a natural TDG-formulae iff one of the followingholds:

α is an atomic TDG-formulae and α is satisfiable.α = α1 ∧ α2 ∧ ... ∧ αn and the following holds:

∀i : αi is a natural TDG-formulae,α is satisfiable and∀i : αi :

Vj,i 6=j αj

für Disjunktion analog.

11 / 32

Page 18: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

Definition 5 (Natural TDG-rule)

A TDG-rule α → β is called a natural TDG-rule iffα and β are natural TDG-formulae,α ∧ β is satisfiable andα ; β

Widerspruch und Redudant

A = Val1 → B = Val1A = Val1 → B = Val2A = Val1 ∧ B = Val2 → C = Val1A = Val1 → C = Val1

⇒ to a given Rule set R the rule R = α → β should be added only if:R 2 RR∪ {α} is satisfiable

12 / 32

Page 19: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

Definition 5 (Natural TDG-rule)

A TDG-rule α → β is called a natural TDG-rule iffα and β are natural TDG-formulae,α ∧ β is satisfiable andα ; β

Widerspruch und Redudant

A = Val1 → B = Val1A = Val1 → B = Val2A = Val1 ∧ B = Val2 → C = Val1A = Val1 → C = Val1

⇒ to a given Rule set R the rule R = α → β should be added only if:R 2 RR∪ {α} is satisfiable

12 / 32

Page 20: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

Definition 6 (Natural rule set)

Let R = {α1 → β1, α2 → β2, ..., αn → βn} be a set of natural TDG-rulesαi → βi .R is called a natural rule set iff for two different rules αi → βi and αj → βj

with αj ⇒ αi the following holds:αj ∧ βi ∧ βj is satisfiable and(αj ∧ βi ) ; βj

Idea: Satisfiability Test for TDG-formulae

die TDG-formulae α in die disjunktive Form tranformieren.α ist satisfiable wenn einer diese disjunktiven Form satisfiable ist.

13 / 32

Page 21: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Induktive Definition von Rule Patterns

Definition 6 (Natural rule set)

Let R = {α1 → β1, α2 → β2, ..., αn → βn} be a set of natural TDG-rulesαi → βi .R is called a natural rule set iff for two different rules αi → βi and αj → βj

with αj ⇒ αi the following holds:αj ∧ βi ∧ βj is satisfiable and(αj ∧ βi ) ; βj

Idea: Satisfiability Test for TDG-formulae

die TDG-formulae α in die disjunktive Form tranformieren.α ist satisfiable wenn einer diese disjunktiven Form satisfiable ist.

13 / 32

Page 22: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Data Corruption

Verschiedene Variante auf date pollution

Wrong value polluterNull-value polluterLimiterSwitcherDuplicator

14 / 32

Page 23: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Error detection by a data auditing tool

Specificity and sensitivity

Specificity (True Negative Rate) := TN/(TN + FP) z.B. dieWahrscheinlichkeit dass ein Symptom NICHT existiert.Sensitivity (True Positive Rate) := TP/(TP + FN) z.B. dieWahrscheinlichkeit dass ein Sysmptom existiert.beide Werte = 1 ⇒ perfektes Data Auditing ToolFalse Negative: z.B. kranke Mensch als nicht krank diagnostiziertFalse Positive: z.B. gesunde Mensch als krank diagnostiziert

15 / 32

Page 24: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Ein kleines Beispiel

Begriffe

Class AttributeBasis AttributenTraining SetTest Set

16 / 32

Page 25: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy

Entropy

Entropy(S) = −P

p(I ) log2 p(I )Entropy(S) = − 9

14 log2(914 )− 5

14 log2(514 ) = 0.940

Wann ist Entropy=0? Wann ist Entropy=1?

17 / 32

Page 26: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy

Entropy

Entropy(S) = −P

p(I ) log2 p(I )Entropy(S) = − 9

14 log2(914 )− 5

14 log2(514 ) = 0.940

Wann ist Entropy=0? Wann ist Entropy=1?

17 / 32

Page 27: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy

Entropy

Entropy(S) = −P

p(I ) log2 p(I )Entropy(S) = − 9

14 log2(914 )− 5

14 log2(514 ) = 0.940

Wann ist Entropy=0? Wann ist Entropy=1?

17 / 32

Page 28: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy

Entropy

Entropy(S) = −P

p(I ) log2 p(I )Entropy(S) = − 9

14 log2(914 )− 5

14 log2(514 ) = 0.940

Wann ist Entropy=0? Wann ist Entropy=1?

17 / 32

Page 29: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy und Gain

Entropy und Gain

Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5

14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694

Entropy(Ssunny ) = − 25 log2(

25 )− 3

5 log2(35 )

Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)

18 / 32

Page 30: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy und Gain

Entropy und Gain

Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5

14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694

Entropy(Ssunny ) = − 25 log2(

25 )− 3

5 log2(35 )

Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)

18 / 32

Page 31: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy und Gain

Entropy und Gain

Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5

14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694

Entropy(Ssunny ) = − 25 log2(

25 )− 3

5 log2(35 )

Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)

18 / 32

Page 32: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Entropy und Gain

Entropy und Gain

Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5

14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694

Entropy(Ssunny ) = − 25 log2(

25 )− 3

5 log2(35 )

Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)

18 / 32

Page 33: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Der Basis Algorithmus - ID3

Attribute wählen

Gain(Outlook, S) = 0.246, Gain(Temperature, S) = 0.029Gain(Humidity , S) = 0.151, Gain(Wind , S) = 0.048

⇒ wähle die Attribute mit größter Gain als root des Entscheidungsbaums.

19 / 32

Page 34: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Der Basis Algorithmus - ID3

20 / 32

Page 35: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Der Basis Algorithmus - ID3

Decision tree ⇒ Rules

outlook = sunny ∧ humidity = high → playball = nooutlook = sunny ∧ humidity = normal → playball = yesoutlook = overcast → playball = yesoutlook = rain ∧ wind = true → playball = nooutlook = rain ∧ wind = false → playball = yes

21 / 32

Page 36: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Problem mit Information Gain

22 / 32

Page 37: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Verbesserung - C4.5

Information Gain Ratio

ID3 Information-Gain bevorzugt die Attributen, die viele Values haben.Attribute A hat nur Distinct value ⇒ Entropy(A, S)=0 ⇒ Gain(A,S) istmaximal.Verbessern durch Information gain ratioGainRatio(A, S) = Gain(A, S)/SplitInfo(A, S)

Beispiel: SplitInfo(Outlook, S) = − 514 log2(

514 )− 5

14 log2(514 )− 4

14 log2(414 )

Gain ratio ist groß, wenn daten ausbreiten (spread) und klein, wenn alledaten zu einem Ast gehört.

Attribute mit unbekanntem Wert

In building a decision tree: einfach diesen Record ignorierenIn using a decision tree: die Wahrscheinlichkeit möglicher Ergebnisseschätzen

23 / 32

Page 38: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Verbesserung - C4.5

Information Gain Ratio

ID3 Information-Gain bevorzugt die Attributen, die viele Values haben.Attribute A hat nur Distinct value ⇒ Entropy(A, S)=0 ⇒ Gain(A,S) istmaximal.Verbessern durch Information gain ratioGainRatio(A, S) = Gain(A, S)/SplitInfo(A, S)

Beispiel: SplitInfo(Outlook, S) = − 514 log2(

514 )− 5

14 log2(514 )− 4

14 log2(414 )

Gain ratio ist groß, wenn daten ausbreiten (spread) und klein, wenn alledaten zu einem Ast gehört.

Attribute mit unbekanntem Wert

In building a decision tree: einfach diesen Record ignorierenIn using a decision tree: die Wahrscheinlichkeit möglicher Ergebnisseschätzen

23 / 32

Page 39: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Verbesserung - C4.5

Pruning Decision Trees (Entscheidungsbaum beschneiden)

um Overfitting zu vermeidenMethode: subtree replacement - Teilbaum durch ein Blatt ersetzenBsp: Testdaten mit 3 (blue,success) und 2 (red, failure)⇒ Teilbaum durch Blatt mit <failure> ersetzen

24 / 32

Page 40: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Error Korrektur

Durch predicted values

die predicted Werte können direkt als Korrektur benutzen.

Interaktive Error Korrektur

manchmal liegt Fehler an Basis-attibutendie predicted Werte helfen bei der Suche nach Fehlerquelle.

25 / 32

Page 41: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Error Korrektur

Durch predicted values

die predicted Werte können direkt als Korrektur benutzen.

Interaktive Error Korrektur

manchmal liegt Fehler an Basis-attibutendie predicted Werte helfen bei der Suche nach Fehlerquelle.

25 / 32

Page 42: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Anpassung an C4.5 für Data Auditing

Error Confidence

Wie vertrauenswürdig ist der ermittelte Error?abhängig von der Anzahl der Recordsniedrige Error Confidence Wert ist nutzlos ⇒ minimale Error Confidence⇒ mininale Records

Adjustments of C4.5

minimale Anzahl von Instanzen für eine Partition um unnötigen Teilbaumzu vermeidenpessimistic classification error benutzt in C4.5 pruning Kriterium wirdersetzt durch expected error confidence, wenn expected error confidencegrößer nach der pruning ist, dann wird das Teilbaum durch ein einzelnesBlatt ersetzt.Entscheidungsbaum in einen äquivalenten Rule Set transformieren und dieRules, die für Error Detection unrelevant sind, werden gelöscht.

26 / 32

Page 43: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Anpassung an C4.5 für Data Auditing

Error Confidence

Wie vertrauenswürdig ist der ermittelte Error?abhängig von der Anzahl der Recordsniedrige Error Confidence Wert ist nutzlos ⇒ minimale Error Confidence⇒ mininale Records

Adjustments of C4.5

minimale Anzahl von Instanzen für eine Partition um unnötigen Teilbaumzu vermeidenpessimistic classification error benutzt in C4.5 pruning Kriterium wirdersetzt durch expected error confidence, wenn expected error confidencegrößer nach der pruning ist, dann wird das Teilbaum durch ein einzelnesBlatt ersetzt.Entscheidungsbaum in einen äquivalenten Rule Set transformieren und dieRules, die für Error Detection unrelevant sind, werden gelöscht.

26 / 32

Page 44: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Evaluation

Anzahl der Records vs. Sensitivity

27 / 32

Page 45: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Evaluation

Anzahl der Rules vs. Sensitivity

28 / 32

Page 46: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Evaluation

Pollutionfaktor vs. Sensitivity

29 / 32

Page 47: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Evaluation

Auditing Evaluation

Database that describes all industry engines manufactured byMercedes-Benzcontains 8 attibutes and about 200,000 recordsrunning 21 minutes on Athlon 900Mhzfound about 6000 suspicious records, that were ranked with their errorconfidence

For example

The following dependency between 2 attibutes BRV and GBM wasinductedBRV = 404 → GBM = 901based on 16118 records1 record got however a value of 911 for GBMthe data auditing tool give an error confidence of 99.95% to this record

30 / 32

Page 48: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Evaluation

Auditing Evaluation

Database that describes all industry engines manufactured byMercedes-Benzcontains 8 attibutes and about 200,000 recordsrunning 21 minutes on Athlon 900Mhzfound about 6000 suspicious records, that were ranked with their errorconfidence

For example

The following dependency between 2 attibutes BRV and GBM wasinductedBRV = 404 → GBM = 901based on 16118 records1 record got however a value of 911 for GBMthe data auditing tool give an error confidence of 99.95% to this record

30 / 32

Page 49: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Referenz

Literature und Links

Building Classification Models: ID3 and C4.5(http://www.cis.temple.edu/ ingargio/cis587/readings/id3-c45.html)The ID3 Algorithm(http://www.cise.ufl.edu/ ddd/cap6635/Fall-97/Short-papers/2.htm)Knowledge Discovery And Date Mining Techniques And Practice(http://www.netnam.vn/unescocourse/knowlegde/knowlegd.htm)Decision Trees (http://dms.irb.hr/tutorial/tut_dtrees.php)

31 / 32

Page 50: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Ende

:-)

Vielen Dank für Eure Aufmerksamkeit!

?!

Fragen und Diskussion...

32 / 32

Page 51: Systematic Development of Data Mining-Based Data Quality Tools

Einführung Test Data Generator Data Auditing Tool Evaluation Literature

Ende

:-)

Vielen Dank für Eure Aufmerksamkeit!

?!

Fragen und Diskussion...

32 / 32