BDM_19122013_3

50
ETL PROF. DRAŽENA GAŠPAR 19.12.2013 . Upravljanje poslovnim podacima

description

upp slajdovi 4

Transcript of BDM_19122013_3

  • ETL

    PROF. DRAENA GAPAR19.12.2013

    .

    Upravljanje poslovnim podacima

  • TERMINSKI PLANDatum Tema09.01.2014. Dimenzijsko modeliranje

    Skladite podataka vs skladite podataka 2.016.01.2014. Grupe prezentiraju: ETL + dimenzijski model

    20.01.2014. Skladite podataka 2.0 komponente

    30.01.2014. Data managementInformation Lifecycle managementKvaliteta podataka

    06.02.2014. Grupe: zavrna obrana

  • Poslovna

    Inteligencija

  • ARHITEKTURA

  • PODRUJE PROIAVANJA PODATAKA

  • ZATO JE NUNO PROIAVANJE PODATAKA?

    | Podaci su u relnom svijetu prljaviy nepotpuni: nedostaju vrijednosti atributa, nedostaju potrebni

    atributi, sadre samo agregirane podatke|Npr. zanimanje=

    y Imaju umove: sadre greke ili su nepodobni (izvan granica)|Npr. plaa=-10

    y nekonzistentni: neusklaenost u kodiranju ili nazivu|Npr. Starost=42 Roendan=03/07/1997|Npr. Bio je rejting 1,2,3, sada je A, B, C|Npr. Neusklaenost izmeu duplih slogova

  • ZATO SU PODACI PRLJAVI?| Nepotpuno podaci mogu proizii izy Nije primjenjivo kao podatkovna vrijednost kod prikupljanjay Razlika u razmiljanju u vrijeme kada su podaci prikupljani i

    kada se analiziraju.y Ljudski/hardware/software problemi

    | umovi (netone vrijednosti) podataka mogu proizii izy Neispravni ureaji prikupljanja podatakay Ljudske ili raunalne greke na unosu podatakay Greke u prijenosu podataka

    | Nekonzistentnost podataka moe proizii izy Razliiti izvori podatakay Naruavanje funkcijskih ovisnosti (npr. Izmjene povezanih

    podataka)| Dupli slogovi takoer trae ienje

  • ZATO JE NUNO PROIAVANJE PODATAKA?

    | Bez kvalitetnih podataka nema ni kvalitetnih rezultata analize!y Kvalitetne odluke morju biti temeljene na

    kvalitetnim podacima|Npr. Duplicirani ili nedostajui podaci mogu uzrokovati

    pogrenu ili varljivu statistiku.y Skladite podataka treba konzistentnu integraciju

    kvalitetnih podatakay Proces izdvajanja, ienja i transformiranja

    podataka ini najvei dio posla izgradnje skladita podataka

  • VIEDIMENZIJSKA MJERA KVALITETEPODATAKA

    | Ope prihvaeni viedimenzijski pogled:y Tonosty Potpunosty Dosljednost (konzistentnost)y Pravovremenosty Vjerodostojnosty Dodatna vrijednosty Interpretativnosty Pristup

    | Ope kategorija:y Znaajnost, odgovara kontekstu, reprezentativan,

    pristupaan

  • OSNOVNI ZADACI| ienje podatakay Popunjavanje vrijednosti koje nedostaju, rjeavanje umova,

    identificiranje ili uklanjanje nepodobnih i razrjeavanje nekonzistentnosti

    | Integriranje podatakay Integriranje viestrukih baza podataka, podatkovnih kocki ili

    datoteka| Transformiranje podatakay Normalizacija i agregacija

    | Smanjivanje (redukcija) podatakay Smanjivanje obima podatka uz zadravanje istih ili slinih

    analitikih rezultata| Diskretizacija podatakay Dio smanjivanja (redukcije) podataka, ali posebice bitno za

    numerike podatke

  • PRIMJERI OSNOVNIH ZADATAKA

  • IENJE PODATAKA

    | Rjeavanje

    y Nepotpunih podataka

    y umova u podacima

    y Nekonzistentnosti

  • NEPOTPUNI (NEDOSTAJUI) PODACI

    | Data is not always availabley E.g., many tuples have no recorded value for several

    attributes, such as customer income in sales data| Missing data may be due to y equipment malfunctiony inconsistent with other recorded data and thus

    deletedy data not entered due to misunderstandingy certain data may not be considered important at the

    time of entryy not register history or changes of the data

    | Missing data may need to be inferred

  • TO RADITI S NEPOTPUNIM PODACIMA?

    | Ignore the tuple: usually done when class label is missing (when doing classification)not effective when the % of missing values per attribute varies considerably

    | Fill in the missing value manually: tedious + infeasible?| Fill in it automatically withy a global constant : e.g., unknown, a new class?! y the attribute meany the attribute mean for all samples belonging to the

    same class: smartery the most probable value: inference-based such as

    Bayesian formula or decision tree

  • 15

    UMOVI U PODACIMA (NOISY DATA)

    | Noise: random error or variance in a measured variable| Incorrect attribute values may be due toy faulty data collection instrumentsy data entry problemsy data transmission problemsy technology limitationy inconsistency in naming convention

    | Other data problems which require data cleaningy duplicate recordsy incomplete datay inconsistent data

  • TO RADITI SA UMOVIMA U PODACIMA?

    | Binningy first sort data and partition into (equal-frequency)

    binsy then one can smooth by bin means, smooth by bin

    median, smooth by bin boundaries, etc.| Regressiony smooth by fitting the data into regression functions

    | Clusteringy detect and remove outliers

    | Combined computer and human inspectiony detect suspicious values and check by human (e.g.,

    deal with possible outliers)

  • IENJE PODATAKA KAO PROCES

    | Data discrepancy detectiony Use metadata (e.g., domain, range, dependency, distribution)y Check field overloading y Check uniqueness rule, consecutive rule and null ruley Use commercial tools

    |Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections

    |Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

    | Data migration and integrationy Data migration tools: allow transformations to be specifiedy ETL (Extraction/Transformation/Loading) tools: allow users to

    specify transformations through a graphical user interface| Integration of the two processesy Iterative and interactive (e.g., Potters Wheels)

  • 18

    INTEGRIRANJE PODATAKA

    | Data integration: y Combines data from multiple sources into a coherent store

    | Schema integration: e.g., A.cust-id { B.cust-#y Integrate metadata from different sources

    | Entity identification problem: y Identify real world entities from multiple data sources, e.g., Bill

    Clinton = William Clinton| Detecting and resolving data value conflictsy For the same real world entity, attribute values from different

    sources are differenty Possible reasons: different representations, different scales, e.g.,

    metric vs. British units

  • 19

    19

    REDUNDANCIJA I INTEGRIRANJE PODATAKA

    | Redundant data occur often when integration of multiple databasesy Object identification: The same attribute or object

    may have different names in different databasesy Derivable data: One attribute may be a derived

    attribute in another table, e.g., annual revenue| Redundant attributes may be able to be detected by

    correlation analysis and covariance analysis| Careful integration of the data from multiple sources

    may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

  • PROBLEMI S INTEGRITETOM PODATAKA

    | Ista osoba, razliito napisano ime Draena, Draana, Draenka ...

    | Vie naina oznaavanja naziva kompanijey Hera, Hera d.o.o., Hera SW kompanija

    | Uporaba razliitih nazivaMumbai, Bombay

    | Razliite ifre generirane od strane raziitih aplikacija za istog kupca

    | U obvezna polja uneen znak blank, . i sl.| Pogrena ifra proizvoda uneena na POS-uy Runi unosi dovode do greakay u sluaju problema koristiti use 9999999

  • Data Mining: Concept

    s and Techniques

    TRANSFORMIRANJE PODATAKA

    | Smoothing: remove noise from data| Aggregation: summarization, data cube construction| Generalization: concept hierarchy climbing| Normalization: scaled to fall within a small, specified

    rangey min-max normalizationy z-score normalizationy normalization by decimal scaling

    | Attribute/feature constructiony New attributes constructed from the given ones

  • 22

    DATA TRANSFORMATION: NORMALIZATION

    | Min-max normalization: to [new_minA, new_maxA]

    y Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

    | Z-score normalization (: mean, : standard deviation):

    y Ex. Let = 54,000, = 16,000. Then| Normalization by decimal scaling

    716.00)00.1(000,12000,98000,12600,73

    AAA

    AA

    A minnewminnewmaxnewminmax

    minvv _)__('

    A

    Avv VP '

    j

    vv10

    ' Where j is the smallest integer such that Max(||) < 1

    225.1000,16

    000,54600,73

  • TRANSFORMIRANJE PODATAKA -PRIMJER

    appl A - zbrojappl B - zbrappl C - tekzbrappl D - zbrtek

    appl A - cjevovod - cmappl B - cjevovod - inappl C - cjevovod - feetappl D - cjevovod - yds

    appl A - m,appl B - 1,0appl C - x,yappl D - muko, ensko

    Skladite podataka

  • 24STRATEGIJE REDUKCIJE PODATAKA

    | Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results

    | Why data reduction? A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.

    | Data reduction strategiesy Dimensionality reduction, e.g., remove unimportant attributes

    |Wavelet transforms|Principal Components Analysis (PCA)|Feature subset selection, feature creation

    y Numerosity reduction (some simply call it: Data Reduction)|Regression and Log-Linear Models|Histograms, clustering, sampling|Data cube aggregation

    y Data compression

  • SIMPLE DISCRETIZATION METHODS: BINNING| Equal-width (distance) partitioningy Divides the range into N intervals of equal size: uniform gridy if A and B are the lowest and highest values of the attribute, the

    width of intervals will be: W = (B A)/N.y The most straightforward, but outliers may dominate

    presentationy Skewed data is not handled well

    | Equal-depth (frequency) partitioningy Divides the range into N intervals, each containing

    approximately same number of samplesy Good data scalingy Managing categorical attributes can be tricky

  • BINNING METHODS FOR DATASMOOTHING Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,

    28, 29, 34* Partition into equal-frequency (equi-depth) bins:

    - Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

    * Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

    * Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

  • REGRESSION

    x

    y

    y = x + 1

    X1

    Y1

    Y1

  • CLUSTER ANALYSIS

  • PODRUJE PROIAVANJA PODATAKAETL (Extraction Transformation Loading)

    | Otkrivanje promjena u izvornimpodacima potrebnim za skladitepodataka;

    | Izdvajanje podataka iz izvornih sustava;| ienje i transformiranje podataka;| Restrukturiranje kljueva podataka;| Indeksiranje podataka;| Sumiranje podataka;| Odravanje metapodataka;| 8itavanje podataka u skladite

    podataka.

  • POJMOVI TRANSFORMIRANJA PODATAKA

    |Izdvajanje|Prilagodba|ienje/Ribanje|Mijeanje|Kuanstva

    |Obogaivanje|Procjenjivanje|8itavanje|Validacija|Auriranje

  • POJMOVI TRANSFORMIRANJA PODATAKA

    | Izdvajanjey Izdvaja podatke iz operativnih izvoru u as is

    statusu|Prilagodbay Konverzija tipova podataka iz izvornih u krajnje

    baze (skladite podataka)

  • DATA TRANSFORMATION TERMS

    | Kuanstvay Identificiranje svih lanova kuanstva (koji ive na

    istoj adresi)y Osigurava da se samo jedna potanjska poiljka alje

    kuanstvuy Moe rezultirati znaajnim utedama papira,

    potarine

  • DATA TRANSFORMATION TERMS

    | Obogaivanjey Koritenje podataka iz vanjskih izvora kako bi se

    obogatili operativni podaci.

    | Procjenjivanjey Izraun vjerojatnosti dogaaja

    npr. Vjerojatnost da e kupac kupiti novi proizvod, promijeniti marku proizvoda

  • UITAVANJE (LOAD)

    | Nakon izdvajanja, ribanja, ienja, validiranja itd. potrebno je uitati podatke u skladite podataka

    | Otvorena pitanjay Ogromne koliine podataka koje treba uitatiy Kratko vrijeme kada skladite podataka moe biti off

    line (esto ne ni nou - web)y Kada praviti indekse i zbrojne tablice y Dozvoliti administratoru sustava nadzor, prekid,

    nastavak, promjenu stope uitavnjay Skladan oporavak nastavak nakon ispada sustava

    tamo gdje se stalo bez gubitka integriteta podataka

  • TEHNIKE UITAVANJA

    | Koritenje SQL-a za dodavanje ili unos novih podatakay Slog u odreenom vremenuy Dovodi do random disk I/O

    | Koritenje batch uitavanja

  • TAKSONOMIJA UITAVANJA

    | Inkrementalni naspram potpunog uitavanja

    | Online naspram Offline uitavanja

  • AURIRANJE (OSVJEAVANJE)

    | Propagira auriranja nad izvornim podacima na skladite podataka

    | Otvorena pitanja:y Kada osvjeiti (refresh)y Kako osvjeiti tehnike auriranja

  • KADA OSVJEITI?|periodino (npr. svaku veer, svaki tjedan)

    ili nakon znaajnih dogaaja|Za svako auriranje: nije zajameno sve

    dok DW ne zatrai auran podatak|Politika osvjeavanja postavljena od

    strane administratora a bazirana na korisnikim potrebama i prometu

    |Mogue razliite politike za razliite izvore podataka

  • TEHNIKE OSVJEAVANJA

    | Potpuno izdvajanje iz osnovnih tablica

    y ita itavu izvornu tablicu: preskupo

    y Moda jedini izbor za nasljeene sustave

  • KAKO OTKRITI PROMJENE

    | Kreirati snapshot log tablicu za biljeenje id-ijeva auriranih redaka izvornih podataka i timestamp-ova

    | Otkrivanje promjena pomou:y Definiranja after row okidaa (triggers) za auriranje

    snapshot loga kada se promijeni izvorna tablicay Koritenje regularnih transakcijskih logova za

    otkrivanje promjena u izvornim podacima

  • IZDVAJANJE PODATAKA I IENJE

    | Izdvajanje podataka iz postojeih operativnih i nasljeenih podataka

    | Otvorena pitanja:y Izvori podataka za DWy Kvaliteta izvornih podatakay Mijeanje razliitih izvora podatakay Transformiranje podatakay Kako propagirati auriranja (na izvornim

    podacima) u skladite podatakay Terabytes podataka za uitavanje

  • DESCRIPTIVE DATA SUMMARIZATION

    | Motivationy To better understand the data: central tendency, variation

    and spread| Data dispersion characteristicsy median, max, min, outliers, variance, etc.

    | Numerical dimensions correspond to sorted intervalsy Data dispersion: analyzed with multiple granularities of

    precisiony Boxplot or quantile analysis on sorted intervals

    | Dispersion analysis on computed measuresy Folding measures into numerical dimensionsy Boxplot or quantile analysis on the transformed cube

  • MEASURING THE CENTRAL TENDENCY

    | Mean (algebraic measure) (sample vs. population):y Weighted arithmetic mean:y Trimmed mean: chopping extreme values

    | Median: A holistic measurey Middle value if odd number of values, or average of the middle

    two values otherwisey Estimated by interpolation (for grouped data):

    | Modey Value that occurs most frequently in the datay Unimodal, bimodal, trimodaly Empirical formula:

    n

    iixn

    x1

    1

    ni

    i

    n

    iii

    w

    xwx

    1

    1

    cf

    lfnLmedian

    median

    ))(2/

    (1

    )(3 medianmeanmodemean u

    Nx P

  • SYMMETRIC VS. SKEWED DATA

    | Median, mean and mode of symmetric, positively and negatively skewed data

  • MEASURING THE DISPERSION OF DATA| Quartiles, outliers and boxplotsy Quartiles: Q1 (25th percentile), Q3 (75th percentile)y Inter-quartile range: IQR = Q3 Q1 y Five number summary: min, Q1, M, Q3, maxy Boxplot: ends of the box are the quartiles, median is marked, whiskers,

    and plot outlier individuallyy Outlier: usually, a value higher/lower than 1.5 x IQR

    | Variance and standard deviation (sample: s, population: )y Variance: (algebraic, scalable computation)

    y Standard deviation s (or ) is the square root of variance s2 (or 2)

    n

    i

    n

    iii

    n

    ii xn

    xn

    xxn

    s1 1

    22

    1

    22 ])(1[1

    1)(1

    1

    n

    ii

    n

    ii xN

    xN 1

    22

    1

    22 1)(1 PPV

  • PROPERTIES OF NORMAL DISTRIBUTION CURVE

    | The normal (distribution) curvey From to : contains about 68% of the

    measurements (: mean, : standard deviation)y From 2 to +2: contains about 95% of ity From 3 to +3: contains about 99.7% of it

  • BOXPLOT ANALYSIS

    | Five-number summary of a distribution:Minimum, Q1, M, Q3, Maximum

    | Boxploty Data is represented with a boxy The ends of the box are at the first and third

    quartiles, i.e., the height of the box is IRQy The median is marked by a line within the boxy Whiskers: two lines outside the box extend to

    Minimum and Maximum

  • VISUALIZATION OF DATA DISPERSION: BOXPLOTANALYSIS

  • HANDLING REDUNDANCY IN DATAINTEGRATION

    | Redundant data occur often when integration of multiple databasesy Object identification: The same attribute or object

    may have different names in different databasesy Derivable data: One attribute may be a derived

    attribute in another table, e.g., annual revenue| Redundant attributes may be able to be detected by

    correlation analysis| Careful integration of the data from multiple sources

    may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

  • Questions..