MC0088 Data Warehousing & Data Mining

download MC0088 Data Warehousing & Data Mining

of 10

Transcript of MC0088 Data Warehousing & Data Mining

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    1/10

    Que 1 - Diferentiate between Data Mining and Data Warehousing.

    Ans: Data Mining: Data Mining is actually the analysis o data. It is the comuter-assistedrocess o digging through and analy!ing enormous sets o data that ha"e either beencomiled by the comuter or ha"e been inutted into the comuter. In data mining# thecomuter will analy!e the data and e$tract the meaning rom it. It will also loo% or hiddenatterns within the data and try to redict uture beha"ior. Data Mining is mainly used to &ndand show relationshis among the data.

    'he urose o data mining# also %nown as %nowledge disco"ery# is to allow businesses to "iewthese beha"iors# trends and(or relationshis and to be able to actor them within theirdecisions. 'his allows the businesses to ma%e roacti"e# %nowledge-dri"en decisions.

    'he term )data mining* comes rom the act that the rocess o data mining# i.e. searching orrelationshis between data# is similar to mining and searching or recious materials. Datamining tools use arti&cial intelligence# machine learning# statistics# and database systems to&nd correlations between the data. 'hese tools can hel answer business +uestions thattraditionally were too time consuming to resol"e.

    Data Mining includes "arious stes# including the raw analysis ste# database and datamanagement asects# data rerocessing# model and inerence considerations# interestingnessmetrics# comle$ity considerations# ost-rocessing o disco"ered structures# "isuali!ation# andonline udating.

    ,$amle: redit card coanies ha"e a history o your urchases rom the ast and %nowgeograhically where those urchases ha"e been made. I all o a sudden some urchases aremade in a city ar rom where you li"e# the credit card comanies are ut on alert to a ossibleraud since their data mining shows that you don*t normally ma%e urchases in that city. 'hen#the credit card comany can disable your card or that transaction or ust ut a /ag on yourcard or susicious acti"ity.

    Data Warehousing:In contrast# data warehousing is comletely diferent. 0owe"er# data

    warehousing and data mining are interrelated. Data warehousing is the rocess o comilinginormation or data into a data warehouse. A data warehouse is a database used to store data.It is a central reository o data in which data rom "arious sources is stored. 'his datawarehouse is then used or reorting and data analysis. It can be used or creating trendingreorts or senior management reorting such as annual and +uarterly comarisons.

    'he urose o a data warehouse is to ro"ide /e$ible access to the data to the user. Datawarehousing generally reers to the combination o many diferent databases across an entireenterrise.

    'he main diference between data warehousing and data mining is that data warehousing isthe rocess o comiling and organi!ing data into one common database# whereas data mining

    is the rocess o e$tracting meaningul data rom that database. Data mining can only be doneonce data warehousing is comlete.

    ,$amle: aceboo% gathers all o your data 2 your riends# your li%es# who you stal%# etc. 2 andthen stores that data into one central reository. ,"en though aceboo% most li%ely stores yourriends# your li%es# etc.# in searate databases# they do want to ta%e the most rele"ant andimortant inormation and ut it into one central aggregated database.

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    2/10

    Que 3 - ,$lain brie/y about 4usiness Intelligence.

    Ans: ,"ery business intelligence 54I6 deloyment has an underlying architecture. 'he 4Iarchitecture is much li%e the engine o a car 2 a necessary comonent# oten owerul# but onethat users# li%e dri"ers# don*t always understand. or some comanies new to businessintelligence# the 4I architecture may rimarily be the oerational systems and the 4I ront-endtools. or more mature 4I deloyments and articularly or enterrise customers# it will in"ol"e,'7 5e$tract# transorm# and load6 tools# a data warehouse# data marts# 4I ront-end tools# andother such comonents.

    When I' discusses 4I with users# we readily all into techno babble# and senseless acronymsabound. Most car dri"ers %now that cars ha"e a battery# a transmission# a uel tan% 2 anade+uate le"el o %nowledge or ha"ing a con"ersation with a mechanic or saleserson butarguably not so much e$ertise to begin rebuilding an engine. In this chater# then# I*ll resentthe maor architectural technical comonents that ma%e u 4I and that business users shouldha"e at least a high-le"el understanding o to articiate in discussions about building andle"eraging a 4I solution. I you are a technical e$ert# you might &nd this chater to be o"erlysimli&ed and it is. I you are loo%ing or a reerence on any one o these comonents# consultthe list o resources in Aendi$ 4 o 8uccessul 4usiness Intelligence.

    9erational and 8ource 8ystems9erational systems are the starting oint or most +uantitati"e data in a comany. 9erationalsystems may also be reerred to as transaction rocessing systems#; source systems#; andenterrise resource lanning; 5,tion order is entered in the manuacturing system.

    'he +uantity o raw material used and the &nished roduct roduced are recorded.

    8ales systemWhen a customer laces an order# the order details are entered in an order entry system.

    8uly chain systemWhen the roduct is a"ailable# the roduct is shied and order ul&llment details are entered.

    Accounting systemAccounting then in"oices the customer and collects ayment. 'he in"oices and ayments maybe recorded in an oerational system that is diferent rom the order entry system.

    In each ste in this rocess# users are creating data that can e"entually be used or businessintelligence. As well# to comlete a tas%# oerational users may need business intelligence.=erhas in order to accet an order# the roduct must be a"ailable in in"entory. As is the casewith many online retailers# customers cannot lace an order or a roduct combination 5color#si!e6 that is not a"ailable? a reort immediately aears with a list o alternati"e si!es or colors.

    A better aroach is to systematically transer data between the systems or modules. 0owe"er#e"en when data is systematically transerred# the ustomer ID entered in the order system maynot# or e$amle# be the same ustomer ID entered in the accounting system 2 e"en thoughboth IDs reer to the same customer@

    Ideally# consistent inormation /ows through the rocess seamlessly# ,nterrise resourcelanning 5,

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    3/10

    While much o the data warehouse is oulated by oerational systems# data may also comerom additional data sources such as:

    Distributors who suly sales and in"entory inormation.

    lic%-stream data rom web logs that show the most re+uently "iewed roducts or onlineshoing cart analysis or artially comleted orders.

    Whether this additional data gets loaded into a central data warehouse will deend on how

    consistently it can be merged with cororate data# how common the re+uirement is# andolitics. I the data is not hysically stored in the data warehouse# it may be integrated withcororate data in a seci&c data mart. Disarate data sources may# in some cases# also beaccessed or combined within the 4I ront-end tool.

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    4/10

    Que B - ,$lain the concets o Data Integration and 'ransormation

    Ans:

    Data Integration

    Data Integration is the rocess o combining heterogenous data sources in to a single +ueriableschema so as to get an uni&ed "iew o these data.

    9ten large comanies and enterrises maintain searate deartmental databases to store the

    data ertaining to the seci&c deartment. Although such searations o the data ro"ide thembetter manageability and security# erorming any cross deartmental analysis on thesedatasets becomes imossible.

    or e$amle# i mar%eting deartment and sales deartment maintain two secluded databases#then it might not be ossible to analy!e the efect o a certain ad"ertising camaign by themar%eting deartment on sales o a roduct. 8imilarly# i 0< deartment and roductiondeartment maintain their indi"idual databases# it might not be ossible to analy!e thecorrelation between yearly incenti"es and emloyeeCs roducti"ity.

    Data integration ro"ides a mechanism to integrate these data rom diferent deartments intoa single +ueriable schema.

    4elow is a list o e$amles where data integration is re+uired. 'he list# howe"er# is notcomrehensi"e

    ross unctional analysis - as discussed in the abo"e e$amle

    inding correlation - 8tatistical intelligence ( scienti&c alication

    8haring inormation - legal or regulatory re+uirements e.g. sharing customersC credit

    inormation among ban%s

    Maintaining single oint o truth - 0igher management toing o"er se"eral deartments

    may need to see a single icture o the business

    Merger o 4usiness - ater merger two comanies want to aggregate their indi"idual data

    assets

    Data integration can be done by 3 maor aroaches or data integration:

    'ight ouling: Data Warehousing

    In case o tight couling aroach - which is oten imlemented through data warehousing# data is ulled

    o"er rom diserate sources into a single hysical location through the rocess o ,'7 - ,$traction# 'ransormation

    and 7oading. 'he single hysical location ro"ides an uniorm interace or +uerying the data. ,'7 layer hels to ma

    the data rom the sources so as to ro"ide a semantically uniorm data warehouse.

    'his aroach is called tight couling since in this aroach the data is tightly couled with the hysical reository atthe time o +uery.

    7oose ouling: irtual Mediated 8chema

    In contrast to tight couling aroach# a "irtual mediated schema ro"ides a interacethat ta%es the +uery inut rom the user# transorm the +uery in the way source database canunderstand and then sends the +uery directly to the source databases to obtain the result. Inthis aroach# the data does not really remain in the schema and only remain in the actualsource databases. 0owe"er# mediated schema contains se"eral EadatersE or EwraersE thatcan connect bac% to the source systems in order to bring the data to the ront end. 'hisaroach is oten imlemented through middleware architecture 5,AI6.

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    5/10

    Data Transformation

    In data transormation# the data are transormed or consolidated into orms aroriate ormining. Data transormation can in"ol"e the ollowing:

    1. 8moothing# which wor%s to remo"e the noise rom data. 8uch techni+ues include binning#clustering# and regression.

    3. Aggregation# where summary or aggregation oerations are alied to the data. or

    e$amle# the daily sales data may be aggregated so as to comute monthly and annual totalamounts. 'his ste is tyically used in constructing a data cube or analysis o the data atmultile granularities.

    B. Fenerali!ation o the data# where low le"el or Grimiti"eE 5raw6 data are relaced by higherle"el concets through the use o concet hierarchies. or e$amle# categorical attributes# li%estreet# can be generali!ed to higher le"el concets# li%e city or county. 8imilarly# "alues ornumeric attributes# li%e age# may be maed to higher le"el concets# li%e young# middle-aged#and senior.

    H. ormali!ation# where the attribute data are scaled so as to all within a small seci&edrange# such as -1.J to 1.J# or J to 1.J.

    K. Attribute construction 5or eature construction6# where new attributes are constructed andadded rom the gi"en set o attributes to hel the mining rocess.

    8moothing is a orm o data cleaning. Aggregation and generali!ation also ser"e as orms odata reduction. In this section# we thereore discuss normali!ation and attribute construction.

    An attribute is normali!ed by scaling its "alues so that they all within a small seci&ed range#such as J to 1.J.

    ormali!ation is articularly useul or classiying algorithms in"ol"ing neural networ%s# ordistance measurements such as nearest-neighbor classi&cation and clustering. I using theneural networ% bac%-roagation algorithm or classi&cation mining# normali!ing the inut"alues or each attribute measured in the training samles will hel seed u the learninghase. or distance-based methods# normali!ation hels re"ent attributes with initially largeranges 5e.g.# income6 rom outweighing attributes with initially smaller ranges 5e.g.# binaryattributes6.

    'here are many methods or data normali!ation. We study three: min-ma$ normali!ation# !-score normali!ation# and normali!ation by decimal scaling.

    Min-ma$ normali!ation erorms a linear transormation on the original data. 8uose thatminA and ma$A are the minimum and ma$imum "alues o an attribute A. Min-ma$normali!ation mas a "alue " o A to "J in the range Lnew minA? new ma$AN by comuting.

    Min-ma$ normali!ation reser"es the relationshis among the original data "alues. It willencounter an out o boundsE error i a uture inut case or normali!ation alls outside o theoriginal data range or A.

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    6/10

    Que H - Diferentiate between database management systems 5D4M86 and data mining.

    Ans: A D4M8 5Database Management 8ystem6 is a comlete system used or managing digitaldatabases that allows storage o database content# creation(maintenance o data# search andother unctionalities. 9n the other hand# Data Mining is a &eld in comuter science# which dealswith the e$traction o re"iously un%nown and interesting inormation rom raw data. Osually#the data used as the inut or the Data mining rocess is stored in databases. Osers who areinclined toward statistics use Data Mining. 'hey utili!e statistical models to loo% or hiddenatterns in data. Data miners are interested in &nding useul relationshis between diferentdata elements# which is ultimately ro&table or businesses.

    D4M8

    D4M8# sometimes ust called a database manager# is a collection o comuter rograms that isdedicated or the management 5i.e. organi!ation# storage and retrie"al6 o all databases thatare installed in a system 5i.e. hard dri"e or networ%6. 'here are diferent tyes o DatabaseManagement 8ystems e$isting in the world# and some o them are designed or the roermanagement o databases con&gured or seci&c uroses. Most oular commercial DatabaseManagement 8ystems are 9racle# D43 and Microsot Access. All these roducts ro"ide meanso allocation o diferent le"els o ri"ileges or diferent users# ma%ing it ossible or a D4M8 to

    be controlled centrally by a single administrator or to be allocated to se"eral diferent eole.'here are our imortant elements in any Database Management 8ystem. 'hey are themodeling language# data structures# +uery language and mechanism or transactions. 'hemodeling language de&nes the language o each database hosted in the D4M8. urrentlyse"eral oular aroaches li%e hierarchal# networ%# relational and obect are in ractice. Datastructures hel organi!e the data such as indi"idual records# &les# &elds and their de&nitionsand obects such as "isual media. Data +uery language maintains the security o the databaseby monitoring login data# access rights to diferent users# and rotocols to add data to thesystem. 8Q7 is a oular +uery language that is used in

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    7/10

    Que K - Diferentiate between P-means and 0ierarchical clustering

    Ans: K-means clustering

    Algorithm

    8lit the data into % random clusters

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    8/10

    deterministic 2 cannot correct early ;mista%es;

    P-means:

    comutationally eScient -R large data sets

    rede&ned no. o clusters

    non-deterministic -R should be run se"eral times

    iterati"e imro"ement

    0ierarchical %-means: to-down hierarchical clustering using %-means iterati"ely with %T3 -R best o both worlds.

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    9/10

    Que U - Diferentiate between Web content mining and Web usage mining.

    Ans: Web ontent Mining

    Web content mining targets the %nowledge disco"ery# in which the main obects are thetraditional collections o multimedia documents such as images# "ideo# and audio# which areembedded in or lin%ed to the web ages.

    It is also +uite diferent rom Data mining because Web data are mainly semi-structured and(or

    unstructured# while Data mining deals rimarily with structured data. Web content mining isalso diferent rom 'e$t mining because o the semi-structure nature o the Web# while 'e$tmining ocuses on unstructured te$ts. Web content mining thus re+uires creati"e alicationso Data mining and ( or 'e$t mining techni+ues and also its own uni+ue aroaches. In the astew years# there was a raid e$ansion o acti"ities in the Web content mining area. 'his is notsurrising because o the henomenal growth o the Web contents and signi&cant economicbene&t o such mining. 0owe"er# due to the heterogeneity and the lac% o structure o Webdata# automated disco"ery o targeted or une$ected %nowledge inormation still resent manychallenging research roblems.

    Web content mining could be diferentiated rom two oints o "iew: Agent-based aroach orDatabase aroach. 'he &rst aroach aims on imro"ing the inormation &nding and &ltering.

    'he second aroach aims on modeling the data on the Web into more structured orm in orderto aly standard database +uerying mechanism and data mining alications to analy!e it.

    Web ontent Mining =roblems(hallenges

    Data(Inormation ,$traction: ,$traction o structured data rom Web ages# such as roductsand search results is a diScult tas%. ,$tracting such data allows one to ro"ide ser"ices. 'womain tyes o techni+ues# machine learning and automatic e$traction are used to sol"e thisroblem.

    Web Inormation Integration and 8chema Matching: Although the Web contains a huge amounto data# each web site 5or e"en age6 reresents similar inormation diferently. Identiying or

    matching semantically similar data is a "ery imortant roblem with many racticalalications.

    9inion e$traction rom online sources: 'here are many online oinion sources# e.g.# customerre"iews o roducts# orums# blogs and chat rooms. Mining oinions 5esecially consumeroinions6 is o great imortance or mar%eting intelligence and roduct benchmar%ing.

    Pnowledge synthesis: oncet hierarchies or ontology are useul in many alications.0owe"er# generating them manually is "ery time consuming. A ew e$isting methods thate$lores the inormation redundancy o the Web will be resented. 'he main alication is tosynthesi!e and organi!e the ieces o inormation on the Web to gi"e the user a coherenticture o the toic domain.

    8egmenting Web ages and detecting noise: In many Web alications# one only wants themain content o the Web age without ad"ertisements# na"igation lin%s# coyright notices.Automatically segmenting Web age to e$tract the main content o the ages is interestingroblem.

    All these tas%s resent maor research challenges and their solutions.

  • 8/10/2019 MC0088 Data Warehousing & Data Mining

    10/10

    Web Osage Mining

    Web Osage Mining ocuses on techni+ues that could redict the beha"ior o users while theyare interacting with the WWW. Web usage mining# disco"er user na"igation atterns rom webdata# tries to disco"ery the useul inormation rom the secondary data deri"ed rom theinteractions o the users while sur&ng on the Web. Web usage mining collects the data romWeb log records to disco"er user access atterns o web ages. 'here are se"eral a"ailableresearch roects and commercial tools that analy!e those atterns or diferent uroses. 'heinsight %nowledge could be utili!ed in ersonali!ation# system imro"ement# site modi&cation#business intelligence and usage characteri!ation.

    'he only inormation let behind by many users "isiting a Web site is the ath through theages they ha"e accessed. Most o the Web inormation retrie"al tools only use the te$tualinormation# while they ignore the lin% inormation that could be "ery "aluable. In general# thereare mainly our %inds o data mining techni+ues alied to the web mining domain to disco"erthe user na"igation attern:

    Association