Introduction to Big Data. World Cup soccer 2014.07.05 (Money Today) : IoT + Bigdata German soccer...

download Introduction to Big Data. World Cup soccer 2014.07.05 (Money Today) : IoT + Bigdata German soccer Team.

If you can't read please download the document

Transcript of Introduction to Big Data. World Cup soccer 2014.07.05 (Money Today) : IoT + Bigdata German soccer...

  • Slide 1
  • Introduction to Big Data
  • Slide 2
  • World Cup soccer 2014.07.05 (Money Today) : IoT + Bigdata German soccer Team
  • Slide 3
  • What is big data? Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • Slide 4
  • Big Data is Every Where! Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Social Network
  • Slide 5
  • Slide 6
  • How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/20 09) Facebook has 2.5 PB of user data + 15 TB/day (4/ 2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009 ) 640K ought to be en ough for anybody.
  • Slide 7
  • What does big data do?
  • Slide 8
  • Government In 2012, the Obama administration announced the Big Data Research and Development Initiative, which explored how big data could be used to address important problems faced by the government.The initiative was composed of 84 different big data programs spread across six departments.Obama administration Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign.Barack Obama2012 re-election campaign The United States Federal Government owns six of the ten most powerful supercomputers in the world.United States Federal Government The Utah Data Center is a data center currently being constructed by the United States National Security Agency. When finished, the facility will be able to handle yottabytes of information collected by the NSA over the Internet.Utah Data CenterUnited StatesNational Security Agencyyottabytes
  • Slide 9
  • Business Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the worlds three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.Amazon.com Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data the equivalent of 167 times the information contained in all the books in the US Library of Congress.WalmartLibrary of Congress Facebook handles 50 billion photos from its user base. FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.FICO The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates. Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.Windermere Real Estate
  • Slide 10
  • Examples of free big data use sites Google trends Google flue Google correlate Social metrics insight
  • Slide 11
  • Bigdata in google trend
  • Slide 12
  • Movement of carts: Product display Bigdata case 12
  • Slide 13
  • Wild Fire in Korea(1991 2011 ) 13
  • Slide 14
  • Google Flue Service 14
  • Slide 15
  • Find Location for your business busienss 15
  • Slide 16
  • Crime Mapping in Sanfrancisco : 71% accuracy 16
  • Slide 17
  • Similar names for bigdata: Data sciences Business analytics Data analytics Data mining business intelligence Machine Learning
  • Slide 18
  • Slide 19
  • Slide 20
  • Case 1: A case on bigdata analysis MBA (Market Basket Analysis)
  • Slide 21
  • 1). POS Data (1000 data) bananas plums, lettuce, tomatoes celery, bean bean apples, carrots, tomatoes, potatoes potatoes bean carrots bean apples, oranges, lettuce, tomatoes peaches, oranges, celery, potatoes, bean beans oranges, lettuce, carrots, tomatoes apples, bananas, plums, carrots, tomatoes, onions, bean apples, potatoes lettuce, peas, beans.
  • Slide 22
  • 2). Association Rules as Output (Model) Only 55 rules satisfy the specified constraints. tomatoes -> lettuce [Coverage=0.263 (263); Support=0.111 (111); Strength=0.422; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019] lettuce -> tomatoes [Coverage=0.217 (217); Support=0.111 (111); Strength=0.512; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019] tomatoes -> carrots [Coverage=0.263 (263); Support=0.085 (85); Strength=0.323; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012] carrots -> tomatoes [Coverage=0.175 (175); Support=0.085 (85); Strength=0.486; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012].
  • Slide 23
  • 3). Graphic Representation
  • Slide 24
  • Relationship graph when the link is set to 0
  • Slide 25
  • Association Rule : Relationship graph when the link is set to 0 Graphic Representations of Association Rules
  • Slide 26
  • 6 Relationship graph when the distance is set by value - network form
  • Slide 27
  • Application of MBA : product recommendation system
  • Slide 28
  • Case 2: SNS analysis
  • Slide 29
  • Social Network (http ://nexus.ludios.net/view/demo)
  • Slide 30
  • Analysis of Human Relations (NodeXL)
  • Slide 31
  • Friends Networks
  • Slide 32
  • Case 3. Bankruptcy Prediction The yearly financial data collected by the Korea Credit Guarantee Fund. The data consist of 944 bankrupted corporations and 944 healthy (non- bankrupted) corporations from the fiscal year 1999 to 2002. 32
  • Slide 33
  • List of financial variables selected VariableDefinition X13: interest expenses to sales (interest expenses / sales) 100 X17:profit to sales (profit / sales) 100 X24:operating profit to sales (operating profit / sales) 100 X27:ordinary profit to total capital (ordinary profit / total capital) 100 X28:current liabilities to total capital (current liabilities / total capital) 100 X103:growth rate of tangible assets (tangible assets at the end of the year / tangible assets at the beginni ng of the 100) 100 X108: turnover of managerial assets sales / {total assets (construction in progress + investment assets)} net financing cost interest expenses interest incomes X127: net working capital to total capital {(current assets current liabilities) / total capital} 100 X129:growth rate of current assets (current assets at the end of the year / current assets at the beginnin g of the year 100) 100 X140:ordinary income to net worth (ordinary income / net worth) 100 33
  • Slide 34
  • Decision Tree Analysis 34
  • Slide 35
  • Case 4. Income Prediction For our study we selected the United States Census (5%) 1990 Public Use Microsample data (Census 1990). This data, which was divided into 18 files, contained the entire 5% sample made public domain from the 1990 U.S. Census in STATA 6.0 format. Combined, these 18 files included about 4.5 million males and 5 million females, totaling to 9.1 million records. Census 1990 - http://www.macalester.edu/econdata/United_State s/pums.html http://www.macalester.edu/econdata/United_State s/pums.html 35
  • Slide 36
  • Data Sampling we converted the 18 data files into flat files; then, using Java code, we merged these 18 flat files into a singe file consisting of 9.1 million records with 85 variables (approximately 1.5 GB in size). 36
  • Slide 37
  • Algorithm Analogy of Discovering the Complete Set of Rules (Drawing the Perfect Picture via Coin Scrubbing) 37
  • Slide 38
  • The Repetitive Methodology of Merging New Rules into the Domain Knowledge Base 38
  • Slide 39
  • The Relationship Between IRAs Accuracy Level and Number of Iterations for This Study 39
  • Slide 40
  • Performance Comparison CHAIDCARTANNLRDASee5 This st udy Tool UsedAnswer Tree (SPSS) Answer Tree (SPSS) Neural Conn ection (SPSS ) SPSS See5 (with default rul e) IRA Training Sa mple size 3.24m 10000300k Accuracy (2/3-1/3) 80.19580.30RBF:76.12 MLP 80.68 81.178.382.382.7 40
  • Slide 41
  • Mining tools Enterprise Miner (SAS) Clementine (SPSS) R Python Many visualisation tools: Infographics etc Rapid miner Hadoop Rhive
  • Slide 42
  • Future direction of bigdata
  • Slide 43
  • bigdata 2013 bigdata 2014
  • Slide 44
  • Google glass Mashup, bigdata, visualisation -> analysis of commerce area
  • Slide 45
  • IoT Key: Smart & Intelligence
  • Slide 46
  • 3D Printer Healthy food, organ, face recommended?
  • Slide 47
  • Evolution of bigdata
  • Slide 48
  • cup
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Cup with Art
  • Slide 59
  • Slide 60
  • Cup with emotion
  • Slide 61
  • Slide 62
  • Slide 63
  • Cup without cup