    (Weather Analysis andPrediction)

    End Semester MINI PROJECT report submitted in the proposal of the

    requirements for the completion of the seventh semester of the

    UND! G!ADUAT P!"G!A# in lectronics and

    Co$$%nication Technolo&y (B'Tech in C)'


    Ga%rav Satish Aash %$ar Salil%$ar

    "IEC#$%#$#%& "IEC#$%#$''&"IEC#$%#$()&

    *atsal #ishra *arsheindra Ga%ta$

    "IEC#$%#$*+& "IEC#$%#$,%&

      -nder the Super.ision of 

      Dr' Satish %$ar Sin&h and  Dr' !a+at %$ar Sin&h

      Indian Instit%te o, In,or$ationTechnolo&y- Allaha.ad

      Nove$.er- /012

    /e hereb declare that the 0or1 presented in this pro2ect report entitled

    4BIG DATA ANALYTICS (Weather Analysis and Prediction)5- submitted

    in the proposal of the requirements for the completion of the 6th semesterof the UND! G!ADUAT P!"G!A# (B'Tech in C)3 is an authenticated

    record of our ori4inal 0or1 carried out from Jul #$%5 to No.ember #$%5

    under the 4uidance of Dr' Satish %$ar Sin&h 7 Dr' !a+at %$ar

    Sin&h' 6ue ac1no0led4ements ha.e been made in the te7t to all other

    material used8 The pro2ect 0as done in full compliance 0ith the requirements

    and constraints of the prescribed curriculum8

    Place! 9llahabad

    6ate! %+:%%:#$%5

    ;aura. Satish "IEC#$%#$*+&


    C ! TI8ICAT 8!"# T9 S U P ! * IS"!

    I do hereb recommend that the mini pro2ect report prepared under m

    super.ision b ;aura. Satish3 91ash

     A C N" W L DG#NT

    /e o0e special debt of 4ratitude to Dr ' Satish %$ar Sin&h 7 Dr' !a+at

    %$ar Sin&h for their constant support and 4uidance throu4hout the courseof our 0or18 Their sincerit3 thorou4hness and perse.erance ha.e been a

    constant source of inspiration for us8 It is onl their co4ni?ant e@orts that our

    endea.oursA ha.e seen li4ht of the da8

    TABL "8 C"NT NTS

    %8 Introduction

    #8 Moti.ation

    '8 Problem definition and scope

    (8 iterature Sur.e and and analsis of recent similar 0or1

    58 9pproach and Proposed methodolo4

    *8 ard0are and Soft0are Requirements

    ,8 References


    INT ! "DUCTI"N

    /e li.e in the data a4e8 It is not eas to measure the total .olume of datastored electronicall3 but an I6C estimate put the si?e of the Ddi4italuni.erse at $8%+ ?ettabtes in #$$*3 and is forecastin4 a tenfold 4ro0th b#$%% to %8+ ?ettabtes8 9 ?ettabte is %$#% btes3 or equi.alentl onethousand E7abteAs3 one million petabtes3 or one billion terabtes8 That isrou4hl the same order of ma4nitude as one dis1 dri.e for e.er person inthe 0orld8 Online searches3 store purchases3 Faceboo1 posts3 T0eets orFoursquare chec1Gins3 cell phone usa4e3 etc8 are creatin4 a Hood of datathat3 0hen or4ani?ed and cate4ori?ed and anal?ed3 re.eals trends andhabits about oursel.es and societ at lar4e8

     This Hood of data is comin4 from man sources8 Consider thefollo0in4!

    •  The Ne0 or1 Stoc1 E7chan4e 4enerates about % terabte of ne0trade data per da8

    • Faceboo1 hosts appro7imatel %$ billion photos3 ta1in4 up %petabte of stora4e8

    • 9ncestr8com3 the 4enealo4 site3 stores around #85 petabtesof data8

     The Internet 9rchi.e stores around # petabtes of data3 and is4ro0in4 at a rate of #$ terabtes per month8•  The ar4e adron Collider near ;ene.a3 S0it?erland3 0ill produce

    about %5 petabtes of data per ear8

    Bi4 data is the term for a collection of data sets so lar4e and comple7 that itbecomes dicult to process usin4 onGhand database mana4ement tools ortraditional data processin4 applications8 The challen4es include capture3curation3 stora4e3 search3 sharin43 transfer3 analsis and .isuali?ation8 Bi46ata refers to the e7plosion in the quantit "and sometimes3 qualit& of a.ailable and potentiall rele.ant data3 lar4el the result of recent and

    unprecedented ad.ancements in data recordin4 and stora4e technolo48

     To deKne bi4 data in competiti.e terms3 0e must thin1 about 0hat it ta1es

    to compete in the business 0orld8 Bi4 data is traditionall characteri?ed as a

    rushin4 ri.er! lar4e amounts of data Ho0in4 at a rapid pace8 To be

    competiti.e 0ith customers3 bi4 data creates products 0hich are .aluable

    and unique8 To be competiti.e 0ith suppliers3 bi4 data is freel a.ailable

    0ith no obli4ations or constraints8 To be competiti.e 0ith ne0 entrants3 bi4

    data is dicult for ne0comers to tr8 To be competiti.e 0ith substituteAs bi4

    data creates products 0hich preclude other products from satisfin4 the

    same need8

    #" T I *AT I"N

     The use of bi4 data 0ill become a 1e basis of competition and 4ro0th forindi.idual Krms8 From the standpoint of competiti.eness and the potentialcapture of .alue3 all companies need to ta1e bi4 data seriousl8 In most

    industries3 established competitors and ne0 entrants ali1e 0ill le.era4edataGdri.en strate4ies to inno.ate3 compete3 and capture .alue from deepand upGtoGrealGtime information8 Indeed3 0e found earl e7amples of suchuse of data in e.er sector 0e e7amined8

     The use of bi4 data 0ill underpin ne0 0a.es of producti.it 4ro0th andconsumer surplus8 For e7ample3 0e estimate that a retailer usin4 bi4 data tothe full has the potential to increase its operatin4 mar4in b more than *$percent8 Bi4 data o@ers considerable beneKts to consumers as 0ell as tocompanies and or4ani?ations8 For instance3 ser.ices enabled b personalGlocation data can allo0 consumers to capture L*$$ billion in economic


    /hile the use of bi4 data 0ill matter across sectors3 some sectors are set for4reater 4ains8 /e compared the historical producti.it of sectors in the-nited States 0ith the potential of these sectors to capture .alue from bi4data "usin4 an inde7 that combines se.eral quantitati.e metrics&3 and foundthat the opportunities and challen4es .ar from sector to sector8 Thecomputer and electronic products and information sectors3 as 0ell as Knanceand insurance3 and 4o.ernment are poised to 4ain substantiall from the useof bi4 data8

     There 0ill be a shorta4e of talent necessar for or4ani?ations to ta1ead.anta4e of bi4 data8 B #$%+3 the -nited States alone could face ashorta4e of %($3$$$ to %)$3$$$ people 0ith deep analtical s1ills as 0ell as%85 million mana4ers and analsts 0ith the 1no0Gho0 to use the analsis of bi4 data to ma1e effecti.e decisions8

    Se.eral issues 0ill ha.e to be addressed to capture the full potential of bi4

    data8 Policies related to pri.ac3 securit3 intellectual propert3 and e.en

    liabilit 0ill need to be addressed in a bi4 data 0orld8 Or4ani?ations need

    not onl to put the ri4ht talent and technolo4 in place but also structure

    0or1flo0s and incenti.es to optimi?e the use of bi4 data8 9ccess to data is

    criticalcompanies 0ill increasin4l need to inte4rate information from

    multiple data sources3 often from third parties3 and the incenti.es ha.e to

    be in place to enable this8

    /hat matters 0hen dealin4 0ith data Bi46ata

    • Smart Samplin4 of data

    • Reducin4 the ori4inal data 0hile not losin4 the statistical propertiesof data

    • Findin4 similar terms

    • Ecient multiGdimensional inde7in4

    • Incremental updatin4 of models• Crucial for streamin4 data

    • 6istributed linear al4ebra• 6ealin4 0ith lar4e sparse matrices

    In this pro2ect 0e deal 0ith the 0eather prediction3 0e follo0 the mentionedapproach!

    8i&' 1; Map:Reduce lo4ical data Ho0


     9adoo: BlocDia&ra$;?

    8i&' / ; adoop Bloc1 dia4ram8

    #a:?red%ce Bloc Dia&ra$;?

    8i&' @ ; #a:?red%ce dia&ra$

    C%rve ttin&;?

    Capturin4 the trend in the data b assi4nin4 a sin4le function across the entireran4e8 The e7ample belo0 uses a strai4ht line function

    9 strai4ht line is described 4enericall b f"7& a7 b8 The 4oal is to identifthe coecients QaA and QbA such that f"7& QKtsA the data 0ell

    Polyno$ial C%rve 8ittin&;enerali?in4 from a strai4ht line "i8e83 Krst de4ree polnomial& to a thde4ree polnomial

    the residual is 4i.en b

     The partial deri.ati.es "a4ain droppin4 superscripts& are

     These lead to the equations

    or3 in matri7 form

     This is a =andermonde matri78 /e can also obtain the matri7 for a leastsquares Kt b 0ritin4

    Premultiplin4 both sides b the transpose of the Krst matri7 then 4i.es

    9s before3 4i.en points and Kttin4 0ith polnomial coecients  3 88834i.es

    In matri7 notation3 the equation for a polnomial Kt is 4i.en b

     This can be sol.ed b premultiplin4 b the transpose  3

     This matri7 equation can be sol.ed numericall3 or can be in.erted directl if itis 0ell formed3 to ield the solution .ector

    Settin4 in the abo.e equations reproduces the linear solution8


    9nal?in4 bi4 data allo0s analsts3 researchers3 and business users to ma1ebetter and faster decisions usin4 data that 0as pre.iousl inaccessible orunusable8 The 5

    Processin4 lar4e .olumes of data has been around for decades "such as in0eather3 astronom3 and ener4 applications&8 It required speciali?ed ande7pensi.e hard0are "supercomputers&3 soft0are3 and de.elopers 0ith distinctpro4rammin4 and analtical s1ills8 In the %)+$s3 the database mana4ementsstems of IBM3 Oracle3 Cullinet3 and Sbase could ha.e been .ie0ed asBi4 6ata tools of that era8 But the 0ere not desi4ned to handle theunforeseen e7plosion of data brou4ht on b the Internet3 mobilecommunications3 and sensor net0or1s8

     The Internet companies 0ere the Krst to be hit b the Ddata tsunami8 Their

    needs 0ere so pressin4 that ;oo4le3 Faceboo13 ahoo3 eBa3 and T0itterde.eloped their o0n database infrastructures and technolo4ies8 The popularadoop Bi4 6ata application3 no0 maintained as an Open Source pro2ect bthe 9pache Soft0are Foundation3 is the most prominent e7ample8 ahoo hasbeen the lar4est contributor to the pro2ect and has launched orton /or1sto commerciali?e its adoop implementation8 Faceboo1 is also a prominentadoop user8

    If the Krst decade of the #%st centur belon4ed to the Internet re.olution3social media and cloud computin43 then the second decade for sure is 4oin4to be the decade of Bi4 6ata analtics8 /hile e.erone is bus tal1in4 about

    ho0 Bi4 6ata can re.olutioni?e the 0a businesses compete3 there is anotherinterestin4 an4le to loo1 at this re.olution the e.olution of data analticso.er the past decades8

    Collection of data b states started centuries a4o and that is ho0 the namestatistics 0as deri.ed from the atin 0ord Status or Italian 0ord statista or;erman 0ord statistic each of3 0hich means a political state8 The statescollected data to calculate the man po0er a.ailable and to decide ta7es"based on propert and 0ealth o0ned b citi?ens&8 This is the earliest ori4inof 0hat is 1no0n toda as Qdata analsesA8 6ata analtics has e.ol.ed a loto.er the ears and Bi4 6ata analtics is the latest in this e.olution of data

    analtics8 Toda Bi4 6ata has become the most tal1ed about technolo4icalphenomena after cloud8

     The Krst data sstems 0ere desi4ned 0ith the 4oal of accuratel capturin4transactions 0ithout losin4 an data8 The relational database sstems usedfor operational data stora4e could be cate4ori?ed as the Krst 0a.e of datamana4ement and analsis8 etAs call this 6ata Stac1 %8$8 The data 0asqueried usin4 SU and the architecture of these sstems 0ould be theapplication lo4ic 0ritten on top of the relational databases 0ith a presentationlaer for 4eneratin4 static reports 0hich 0ould then be anal?ed b theanalsts8 The focus of these sstems remained on capturin4 the

    transactional data accuratel and storin4 it efficientl for all mission criticalsstems8 o0e.er3 analtical capabilities of the sstem 0ere limited8

    APP!"AC9 AND P!"P"SD


    9fter doin4 the careful stud of the Map Reduce techniques 0e 0ill use thefollo0in4 approach to implement the temperature problem8

    %8 Settin4 up .irtual machines 0ith -buntu OS8

    #8 Settin4 the net0or1 so that the are connected to each other and

    can pin4 each other8

    '8 Installin4 and confi4urin4 the adoop Cluster on these machines8

    (8 Findin4 the appropriate datasets8

    58 Runnin4 the openl on this cluster 0ith the data set to

    chec1 the confi4uration and 4et an estimate of run time8

    *8 Implementin4 the ma7:min techniques one b one in J9=9 usin4adoop

    Map Reduce 9PI8

    8i&' ; #a: red%ced lo&ical data =o>

    The ty:es o, tools ty:ically %sed in Bi& Data Scenario;?

    • /here the processin4 is hosted 6istributed ser.er:cloud

    • /here data is stored 6istributed Stora4e "E848 9ma?on s'&

    • /here is the pro4rammin4 model 6istributed processin4 "Map Reduce&

    • o0 data is stored and inde7ed i4h performance schema free

    database8• /hat operations are performed on the data 9naltic:Semantic

    Processin4 "E48 


    Pro&ra$$in& #odel;?

    Input > Output! each a set of 1e:.alue pairsPro4rammer speciKes t0o functions!

    map "inV1e3 inV.alue& GW list"outV1e3 intermediateV.alue&• Processes input 1e:.alue pair

    • Produces set of intermediate pairsreduce "outV1e3 list"intermediateV.alue&& GW list"outV.alue&

    • Combines all intermediate .alues for a particular 1e

    • Produces a set of mer4ed output .alues "usuall 2ust one&

      "Fi4& This is the Ma7 temperature cur.e of Ber1ele

     ear Predicted Ma7imum Temperature#$%( (#8,,+5',#$%5 (#85'')),#$%* (#8#5+)+$#$%, (%8)5%+%##$%+ (%8*%$,,+#$%) (%8#'(%%*

    #$#$ ($8+#$$%,#$#% ($8'***#5#$## ')8+,#$(%#$#' ')8''('%'


      "Fi4& This is the Ma7 temperature cur.e of 6elhi

     ear Predicted Ma7imum Temperature#$$% (%8+'$555#$$# (#8#*+##,#$$' (#8+*'(5,#$$( ('8*(()'*#$$5 ((8*(('%+#$$* (58+)5),'#$$, (,8(',#''#$$+ ()8'$+*($#$$) 5%855'+,5#$%$ 5(8#%)+,#

    9A!DWA! AND S"8TWA! ! UI! # N TS


    Processor G core i5

    Speed G #8(' ;h?

    R9M G (;B

    ard 6is1 G +$ ;B

