BDM_19122013_3
-
Upload
dejan-babic -
Category
Documents
-
view
4 -
download
1
description
Transcript of BDM_19122013_3
-
ETL
PROF. DRAENA GAPAR19.12.2013
.
Upravljanje poslovnim podacima
-
TERMINSKI PLANDatum Tema09.01.2014. Dimenzijsko modeliranje
Skladite podataka vs skladite podataka 2.016.01.2014. Grupe prezentiraju: ETL + dimenzijski model
20.01.2014. Skladite podataka 2.0 komponente
30.01.2014. Data managementInformation Lifecycle managementKvaliteta podataka
06.02.2014. Grupe: zavrna obrana
-
Poslovna
Inteligencija
-
ARHITEKTURA
-
PODRUJE PROIAVANJA PODATAKA
-
ZATO JE NUNO PROIAVANJE PODATAKA?
| Podaci su u relnom svijetu prljaviy nepotpuni: nedostaju vrijednosti atributa, nedostaju potrebni
atributi, sadre samo agregirane podatke|Npr. zanimanje=
y Imaju umove: sadre greke ili su nepodobni (izvan granica)|Npr. plaa=-10
y nekonzistentni: neusklaenost u kodiranju ili nazivu|Npr. Starost=42 Roendan=03/07/1997|Npr. Bio je rejting 1,2,3, sada je A, B, C|Npr. Neusklaenost izmeu duplih slogova
-
ZATO SU PODACI PRLJAVI?| Nepotpuno podaci mogu proizii izy Nije primjenjivo kao podatkovna vrijednost kod prikupljanjay Razlika u razmiljanju u vrijeme kada su podaci prikupljani i
kada se analiziraju.y Ljudski/hardware/software problemi
| umovi (netone vrijednosti) podataka mogu proizii izy Neispravni ureaji prikupljanja podatakay Ljudske ili raunalne greke na unosu podatakay Greke u prijenosu podataka
| Nekonzistentnost podataka moe proizii izy Razliiti izvori podatakay Naruavanje funkcijskih ovisnosti (npr. Izmjene povezanih
podataka)| Dupli slogovi takoer trae ienje
-
ZATO JE NUNO PROIAVANJE PODATAKA?
| Bez kvalitetnih podataka nema ni kvalitetnih rezultata analize!y Kvalitetne odluke morju biti temeljene na
kvalitetnim podacima|Npr. Duplicirani ili nedostajui podaci mogu uzrokovati
pogrenu ili varljivu statistiku.y Skladite podataka treba konzistentnu integraciju
kvalitetnih podatakay Proces izdvajanja, ienja i transformiranja
podataka ini najvei dio posla izgradnje skladita podataka
-
VIEDIMENZIJSKA MJERA KVALITETEPODATAKA
| Ope prihvaeni viedimenzijski pogled:y Tonosty Potpunosty Dosljednost (konzistentnost)y Pravovremenosty Vjerodostojnosty Dodatna vrijednosty Interpretativnosty Pristup
| Ope kategorija:y Znaajnost, odgovara kontekstu, reprezentativan,
pristupaan
-
OSNOVNI ZADACI| ienje podatakay Popunjavanje vrijednosti koje nedostaju, rjeavanje umova,
identificiranje ili uklanjanje nepodobnih i razrjeavanje nekonzistentnosti
| Integriranje podatakay Integriranje viestrukih baza podataka, podatkovnih kocki ili
datoteka| Transformiranje podatakay Normalizacija i agregacija
| Smanjivanje (redukcija) podatakay Smanjivanje obima podatka uz zadravanje istih ili slinih
analitikih rezultata| Diskretizacija podatakay Dio smanjivanja (redukcije) podataka, ali posebice bitno za
numerike podatke
-
PRIMJERI OSNOVNIH ZADATAKA
-
IENJE PODATAKA
| Rjeavanje
y Nepotpunih podataka
y umova u podacima
y Nekonzistentnosti
-
NEPOTPUNI (NEDOSTAJUI) PODACI
| Data is not always availabley E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data| Missing data may be due to y equipment malfunctiony inconsistent with other recorded data and thus
deletedy data not entered due to misunderstandingy certain data may not be considered important at the
time of entryy not register history or changes of the data
| Missing data may need to be inferred
-
TO RADITI S NEPOTPUNIM PODACIMA?
| Ignore the tuple: usually done when class label is missing (when doing classification)not effective when the % of missing values per attribute varies considerably
| Fill in the missing value manually: tedious + infeasible?| Fill in it automatically withy a global constant : e.g., unknown, a new class?! y the attribute meany the attribute mean for all samples belonging to the
same class: smartery the most probable value: inference-based such as
Bayesian formula or decision tree
-
15
UMOVI U PODACIMA (NOISY DATA)
| Noise: random error or variance in a measured variable| Incorrect attribute values may be due toy faulty data collection instrumentsy data entry problemsy data transmission problemsy technology limitationy inconsistency in naming convention
| Other data problems which require data cleaningy duplicate recordsy incomplete datay inconsistent data
-
TO RADITI SA UMOVIMA U PODACIMA?
| Binningy first sort data and partition into (equal-frequency)
binsy then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.| Regressiony smooth by fitting the data into regression functions
| Clusteringy detect and remove outliers
| Combined computer and human inspectiony detect suspicious values and check by human (e.g.,
deal with possible outliers)
-
IENJE PODATAKA KAO PROCES
| Data discrepancy detectiony Use metadata (e.g., domain, range, dependency, distribution)y Check field overloading y Check uniqueness rule, consecutive rule and null ruley Use commercial tools
|Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
|Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)
| Data migration and integrationy Data migration tools: allow transformations to be specifiedy ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface| Integration of the two processesy Iterative and interactive (e.g., Potters Wheels)
-
18
INTEGRIRANJE PODATAKA
| Data integration: y Combines data from multiple sources into a coherent store
| Schema integration: e.g., A.cust-id { B.cust-#y Integrate metadata from different sources
| Entity identification problem: y Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton| Detecting and resolving data value conflictsy For the same real world entity, attribute values from different
sources are differenty Possible reasons: different representations, different scales, e.g.,
metric vs. British units
-
19
19
REDUNDANCIJA I INTEGRIRANJE PODATAKA
| Redundant data occur often when integration of multiple databasesy Object identification: The same attribute or object
may have different names in different databasesy Derivable data: One attribute may be a derived
attribute in another table, e.g., annual revenue| Redundant attributes may be able to be detected by
correlation analysis and covariance analysis| Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
-
PROBLEMI S INTEGRITETOM PODATAKA
| Ista osoba, razliito napisano ime Draena, Draana, Draenka ...
| Vie naina oznaavanja naziva kompanijey Hera, Hera d.o.o., Hera SW kompanija
| Uporaba razliitih nazivaMumbai, Bombay
| Razliite ifre generirane od strane raziitih aplikacija za istog kupca
| U obvezna polja uneen znak blank, . i sl.| Pogrena ifra proizvoda uneena na POS-uy Runi unosi dovode do greakay u sluaju problema koristiti use 9999999
-
Data Mining: Concept
s and Techniques
TRANSFORMIRANJE PODATAKA
| Smoothing: remove noise from data| Aggregation: summarization, data cube construction| Generalization: concept hierarchy climbing| Normalization: scaled to fall within a small, specified
rangey min-max normalizationy z-score normalizationy normalization by decimal scaling
| Attribute/feature constructiony New attributes constructed from the given ones
-
22
DATA TRANSFORMATION: NORMALIZATION
| Min-max normalization: to [new_minA, new_maxA]
y Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
| Z-score normalization (: mean, : standard deviation):
y Ex. Let = 54,000, = 16,000. Then| Normalization by decimal scaling
716.00)00.1(000,12000,98000,12600,73
AAA
AA
A minnewminnewmaxnewminmax
minvv _)__('
A
Avv VP '
j
vv10
' Where j is the smallest integer such that Max(||) < 1
225.1000,16
000,54600,73
-
TRANSFORMIRANJE PODATAKA -PRIMJER
appl A - zbrojappl B - zbrappl C - tekzbrappl D - zbrtek
appl A - cjevovod - cmappl B - cjevovod - inappl C - cjevovod - feetappl D - cjevovod - yds
appl A - m,appl B - 1,0appl C - x,yappl D - muko, ensko
Skladite podataka
-
24STRATEGIJE REDUKCIJE PODATAKA
| Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
| Why data reduction? A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.
| Data reduction strategiesy Dimensionality reduction, e.g., remove unimportant attributes
|Wavelet transforms|Principal Components Analysis (PCA)|Feature subset selection, feature creation
y Numerosity reduction (some simply call it: Data Reduction)|Regression and Log-Linear Models|Histograms, clustering, sampling|Data cube aggregation
y Data compression
-
SIMPLE DISCRETIZATION METHODS: BINNING| Equal-width (distance) partitioningy Divides the range into N intervals of equal size: uniform gridy if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.y The most straightforward, but outliers may dominate
presentationy Skewed data is not handled well
| Equal-depth (frequency) partitioningy Divides the range into N intervals, each containing
approximately same number of samplesy Good data scalingy Managing categorical attributes can be tricky
-
BINNING METHODS FOR DATASMOOTHING Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34
-
REGRESSION
x
y
y = x + 1
X1
Y1
Y1
-
CLUSTER ANALYSIS
-
PODRUJE PROIAVANJA PODATAKAETL (Extraction Transformation Loading)
| Otkrivanje promjena u izvornimpodacima potrebnim za skladitepodataka;
| Izdvajanje podataka iz izvornih sustava;| ienje i transformiranje podataka;| Restrukturiranje kljueva podataka;| Indeksiranje podataka;| Sumiranje podataka;| Odravanje metapodataka;| 8itavanje podataka u skladite
podataka.
-
POJMOVI TRANSFORMIRANJA PODATAKA
|Izdvajanje|Prilagodba|ienje/Ribanje|Mijeanje|Kuanstva
|Obogaivanje|Procjenjivanje|8itavanje|Validacija|Auriranje
-
POJMOVI TRANSFORMIRANJA PODATAKA
| Izdvajanjey Izdvaja podatke iz operativnih izvoru u as is
statusu|Prilagodbay Konverzija tipova podataka iz izvornih u krajnje
baze (skladite podataka)
-
DATA TRANSFORMATION TERMS
| Kuanstvay Identificiranje svih lanova kuanstva (koji ive na
istoj adresi)y Osigurava da se samo jedna potanjska poiljka alje
kuanstvuy Moe rezultirati znaajnim utedama papira,
potarine
-
DATA TRANSFORMATION TERMS
| Obogaivanjey Koritenje podataka iz vanjskih izvora kako bi se
obogatili operativni podaci.
| Procjenjivanjey Izraun vjerojatnosti dogaaja
npr. Vjerojatnost da e kupac kupiti novi proizvod, promijeniti marku proizvoda
-
UITAVANJE (LOAD)
| Nakon izdvajanja, ribanja, ienja, validiranja itd. potrebno je uitati podatke u skladite podataka
| Otvorena pitanjay Ogromne koliine podataka koje treba uitatiy Kratko vrijeme kada skladite podataka moe biti off
line (esto ne ni nou - web)y Kada praviti indekse i zbrojne tablice y Dozvoliti administratoru sustava nadzor, prekid,
nastavak, promjenu stope uitavnjay Skladan oporavak nastavak nakon ispada sustava
tamo gdje se stalo bez gubitka integriteta podataka
-
TEHNIKE UITAVANJA
| Koritenje SQL-a za dodavanje ili unos novih podatakay Slog u odreenom vremenuy Dovodi do random disk I/O
| Koritenje batch uitavanja
-
TAKSONOMIJA UITAVANJA
| Inkrementalni naspram potpunog uitavanja
| Online naspram Offline uitavanja
-
AURIRANJE (OSVJEAVANJE)
| Propagira auriranja nad izvornim podacima na skladite podataka
| Otvorena pitanja:y Kada osvjeiti (refresh)y Kako osvjeiti tehnike auriranja
-
KADA OSVJEITI?|periodino (npr. svaku veer, svaki tjedan)
ili nakon znaajnih dogaaja|Za svako auriranje: nije zajameno sve
dok DW ne zatrai auran podatak|Politika osvjeavanja postavljena od
strane administratora a bazirana na korisnikim potrebama i prometu
|Mogue razliite politike za razliite izvore podataka
-
TEHNIKE OSVJEAVANJA
| Potpuno izdvajanje iz osnovnih tablica
y ita itavu izvornu tablicu: preskupo
y Moda jedini izbor za nasljeene sustave
-
KAKO OTKRITI PROMJENE
| Kreirati snapshot log tablicu za biljeenje id-ijeva auriranih redaka izvornih podataka i timestamp-ova
| Otkrivanje promjena pomou:y Definiranja after row okidaa (triggers) za auriranje
snapshot loga kada se promijeni izvorna tablicay Koritenje regularnih transakcijskih logova za
otkrivanje promjena u izvornim podacima
-
IZDVAJANJE PODATAKA I IENJE
| Izdvajanje podataka iz postojeih operativnih i nasljeenih podataka
| Otvorena pitanja:y Izvori podataka za DWy Kvaliteta izvornih podatakay Mijeanje razliitih izvora podatakay Transformiranje podatakay Kako propagirati auriranja (na izvornim
podacima) u skladite podatakay Terabytes podataka za uitavanje
-
DESCRIPTIVE DATA SUMMARIZATION
| Motivationy To better understand the data: central tendency, variation
and spread| Data dispersion characteristicsy median, max, min, outliers, variance, etc.
| Numerical dimensions correspond to sorted intervalsy Data dispersion: analyzed with multiple granularities of
precisiony Boxplot or quantile analysis on sorted intervals
| Dispersion analysis on computed measuresy Folding measures into numerical dimensionsy Boxplot or quantile analysis on the transformed cube
-
MEASURING THE CENTRAL TENDENCY
| Mean (algebraic measure) (sample vs. population):y Weighted arithmetic mean:y Trimmed mean: chopping extreme values
| Median: A holistic measurey Middle value if odd number of values, or average of the middle
two values otherwisey Estimated by interpolation (for grouped data):
| Modey Value that occurs most frequently in the datay Unimodal, bimodal, trimodaly Empirical formula:
n
iixn
x1
1
ni
i
n
iii
w
xwx
1
1
cf
lfnLmedian
median
))(2/
(1
)(3 medianmeanmodemean u
Nx P
-
SYMMETRIC VS. SKEWED DATA
| Median, mean and mode of symmetric, positively and negatively skewed data
-
MEASURING THE DISPERSION OF DATA| Quartiles, outliers and boxplotsy Quartiles: Q1 (25th percentile), Q3 (75th percentile)y Inter-quartile range: IQR = Q3 Q1 y Five number summary: min, Q1, M, Q3, maxy Boxplot: ends of the box are the quartiles, median is marked, whiskers,
and plot outlier individuallyy Outlier: usually, a value higher/lower than 1.5 x IQR
| Variance and standard deviation (sample: s, population: )y Variance: (algebraic, scalable computation)
y Standard deviation s (or ) is the square root of variance s2 (or 2)
n
i
n
iii
n
ii xn
xn
xxn
s1 1
22
1
22 ])(1[1
1)(1
1
n
ii
n
ii xN
xN 1
22
1
22 1)(1 PPV
-
PROPERTIES OF NORMAL DISTRIBUTION CURVE
| The normal (distribution) curvey From to : contains about 68% of the
measurements (: mean, : standard deviation)y From 2 to +2: contains about 95% of ity From 3 to +3: contains about 99.7% of it
-
BOXPLOT ANALYSIS
| Five-number summary of a distribution:Minimum, Q1, M, Q3, Maximum
| Boxploty Data is represented with a boxy The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQy The median is marked by a line within the boxy Whiskers: two lines outside the box extend to
Minimum and Maximum
-
VISUALIZATION OF DATA DISPERSION: BOXPLOTANALYSIS
-
HANDLING REDUNDANCY IN DATAINTEGRATION
| Redundant data occur often when integration of multiple databasesy Object identification: The same attribute or object
may have different names in different databasesy Derivable data: One attribute may be a derived
attribute in another table, e.g., annual revenue| Redundant attributes may be able to be detected by
correlation analysis| Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
-
Questions..