Introduction to big data
-
Upload
partagetransparents -
Category
Data & Analytics
-
view
197 -
download
2
description
Transcript of Introduction to big data
Table of contents
① Definition : what is big data?
② Dimensionning uses
③ Why should I be interested? • Market study
④ How can it benefit • Companies
• Consumers/citizen
• Society
⑤ Infrastructure • Gathering
• Storage
• Networks
• Processing
⑥ Tools
⑦ Data models and predictive analytics
⑧ From big data to smart data
DEFINITIONS
Co
st
Time
Use
rs
Time-keeper genesis
Hype curve 2014 (copyright Gartner)
Is big fata the last buzzword bingo?
① http://www.bullshitbingo.net/cards/buzzwor
d/
Why data science?
http://www.personalizemedia.com/gary
s-social-media-count/
3V by SAP
CRM* data
GP
S
Demand
Sp
ee
d
Velocity
Transactions
Op
po
rtu
nitie
s
Se
rvic
e c
alls
Customer
Sales orders
Inventory
E-m
ails
Twe
ets
Planning
Things
Mobile
Insta
nt m
essa
ge
s
Velocity
Volume Variety
What is big data? (original idea Gartner)
① Volume
② Variety • Structured/unstructured
• Public/Private
• Text/image/sound…
③ Velocity • Generated
• Captured
• Shared
④ Veracity
Four « V » by IBM
Volume
① Since first wave 10.000 years ago until 1950,
mankind created only 5 exabytes
② 1 EB = 1 000 000 000 000 000 000B =
③ 1 000 000 000 gigabytes =
④ 1 000 000 terabytes =
⑤ 1 000 petabytes...
⑥ Nowaday we produce 5 exabytes every 2
days!
Volume (cf. Wikipedia)
① According to an IDC study sponsorised by
EMC Gartner, digital data created in the
world were
② 1,2 zettabytes/year in 2010 to
③ 1,8 zettabytes in 2011, and
④ 2,8 zettabytes in 2012 and up to
⑤ 40 zettabytes in 2020.
Volume : storage
Variety : data classifications
① By structure
• Structured (SQL like data bases) # 20%
• Unstructured # 80%
② By source
• Human originated
• Non-human originated
• In-house
• From outside
③ By movement
• Data in motion
• Data at rest
Velocity : example of cellular data rates
① 2G • GPRS : 140,6 Kbps
• Edge : 473,6 Kbps
② 3G • UMTS : 384 Kbps
• HSPA : 14,4 Mbps
③ 4G • HSPA+ : 42,2 Mbps
• LTE : 173 Mbps
④What’s next?
Big data cycle management
Capture
Organize
Integrate Analyze
Act
Intelligence cycle (source C.I.A.) : similarity?
① UKUSA agreement shares
• facilities,
• tasks and
• product
② between participating governments.
③ What about analysis?
2014 Hear & Know 2014 18
Intelligence cycle applied to Sigint
Interception of messages and
communications data (meta data)
Processing
• Traffic analysis of communications data (who is communicating with who)
• Cryptanalysis
• Analysis of the content of messages
Analysis with the use of other sources, for
example Open Source Intelligence
(OSINT)
Dissemination
Planning & direction
2014 Hear & Know 2014 19
Big data « food chain »
Personal data
Contacts/Calendar Audio/Voice/Music Mails/Notes Photos/Videos Identifiers/Metadata Positions Navigation history Biometric data
(fingerprints, voice…) Games data
Storage
Terminal Servers Data centers Cloud
Accessibility
Operators radio (short or long range) API OS Development kit
Applications
Millions of applications in Apple store and Google Play
Structured data: human generated
① Input
② Click stream
③Gaming related (moves)
④Quantified self
Structured data : machine/computer generated
① Sensor
② Smart meters
③Weblog
④ Point of sale
⑤ Financial
WHY BIG DATA?
Unsolved problems with « classical » means
① Search engines with RBDMS -> Google own
solutions
Moore’s law miniaturisation and its limits
① « number of integrated transistors on a
silicon is multiplied by 4 every 3 years »
Moore 1965
② Roch’s law : chip manufacturing costs
double every 4 years
③ Below 20 nm : quantum effects
After the end of Moore’s law
① #2020 limits of classical « engraving » physics = > necessary evolutions • Pessimistic scenario
o Innovation applicative/architecture o Cost/price erosion
• Substitution technologies o Biology/électronique moléculaire organique o ADN o Neuronal/analog o supraconductors o Optics o Components with one or few electrons
o Quantum computers o Nano-technologies o …
DIMENSIONNING EXAMPLES
The « historical » big data crunchers
① Simulation (nuclear…)
②Meteo
③ Sigint
④ The National Security Agency is building the
biggest building on earth : the Utah Data
Center. Scheduled for yottabytes of
internet collected data.
⑤ Cryptoanalysis
Amazon
① From on-line library to global IT provider
② Big data :
• User initially then
• Provider
Cellular : potential analysis
① Cell activity for urbanism and network
planification
② Policy makers
③ Urban planners
④ Traffic engineers
⑤Weather forecast
⑥…
Quantified self/Lifelogging
① Position
② Sleeping hours
③ Tension/ cardio frequency
④ Podometer
⑤ Accelerometer/speedometer/distance
⑥ Food/beverage
⑦ Temperature
⑧ Weigh
⑨ Size
⑩ Photo
11 Voice recording
Trading
① High frequency/Speed trading (non
distributed)
EDF and metering
① Previous situation
• 35 M houses in France
• 2 « relevés » / year
② Remote metering
• Every 30 mn
③ Result
• Expected spare : xxx MW
Cybersecurity Hypervision
① Digitalattackmaps
② norse
Big data applications
① Social media analytics : impressive
example : Linked in « people you may
know »
② Voice analytics : call centers, mobile
phones (SIRI)
③ Text analytics
④ Video analytics
⑤ Telecom : customer churn
⑥ Behavioural analytics
Marketing
① Knowledge
• Brand
• Competitors
• Customers
• Anticipate the market
• A/B testing
Marketing
Geolocation
① Skyhook (Google, Apple…) data base
may be used to observe people
movement
Intelligence family
Int
Hum
OS
Im
Sig
2014 Hear & Know 2014 39
Sigint and family
Techint
Sigint
Comint Elint Masint
Imint …
2014 Hear & Know 2014 40
Politics
①Obama re-election
② In 2013, Big Data is one of the « 7 ambitions
stratégiques de la France » according to
the Commission innovation 2030
Science
① “Square Kilometre Array” radiotelescope
will deliver 50 terabytes analyzed
data/day, with 7 000 raw data terabytes/s
② Large Hadron Collider has around 150
millions sensors producing data 40
millions/s.
③ # 600 millions collisions/s , after filters, 100
interesting collisions remains /s. There are
25 Pbytes to store/year
IOT/M2M/IOO
① According to Yole, Internet of Things will
represent 15% of processed data in 2024
② Electronics components will jump from 9,5
G$ in 2014 to 46 G$ in 2024
IoO
« In 2020, there will be 80 billions, according to Samuel Ropert from Idate.
IoO alone will count for , 85% of IoT,
11% for terminals and 4% for M2M.
Expected annual growth between 2010 and 2020
IoO 41%,
terminals 22%
M2M 16%.
Roadmap of the Internet of things
MARKET STUDY
Big data growth
① Annual growth for Big Data for 2011-2016 is
expected 31.7%.
②Market should reach 23,8 G$ in 2016
(source : IDC march 2013).
③ Big Data should be 8% of european GNP in
2020 (AFDEL february 2013).
Risks and opportunities
① How to create value with this data
flooding?
② If you don’t do it yourself on your market :
advantage to the first mover.
③ Democracy risks : end of privacy ?
Dictature based on data ?
INFRASTRUCTURE
Software approach
Traditional
Monolithic
Centralised storage
RDBMS
Data frame/format
preliminary
Proprietary
Big data
Distributed
Storage and execution at
node level
Brute data processing
opensource
Hardware approach
Traditional hardware
Specific hardware
Big central server
NAS
Raid
Expensive
Uneasy evolution
Big data
Basic hardware
Pizza boxes
Ethernet
JBOD
Unexpensive
Easy evolution
Big data stack (copyright big data for
dummies)
Infrastructure criteria
① Performance
② Availability
③ Scalability
④ Flexibility
⑤ Cost
⑥ Redundancy + resiliance
Storage caracteristics
Caracteristics RDBMS Big Data
Data size Giga bytes Peta bytes
Access Interactive Near real time or
batch
Scheme/structure Static Dynamic
Language SQL UQL/Procedural (Java,
C++…)
Job scheduling Hard Simple
Integrity High High
Scaling Non linear linear
TOOLS
Tools
① Hadoop
• MapReduce
② PostgreSQL (www.postgresql.org)
③ R
④Matlab
⑤ Analyst Notebook
⑥Watson
Hadoop
①Opensource
② Fast (parallel processing)
③Main components
• Distributed file system
• MapReduce engine
Mapreduce (made in Google)
①Map
② Reduce
Why R?
①Open
② www.r-project.org
③ www.rstudio.com
Big data, and after?
①Open data
② Smart data
③ Linked data
BACK-UP
The cloud
① Shared resources
② Applications
③ Computing
④ Storage
⑤ Networking
⑥ Development and deployment platforms
Cloud vocabulary for delivery models
① IaaS : infrastructure as a service
② PaaS : platform as a service
③ SaaS : software as a service
④ And specially useful for big data
⑤ DaaS : data as a service
Cloud players
① Worldwide
• Apple
• Microsoft
• Amazon
• Openstack
• Dropbox…
② France
• Cloudwatt
• Numergy
Improvement
① Data modelling
② Data management
At stake
31/08/2014 68
① To further the cause of promoting
awareness to the future impact of IoT, let’s
answer these three key questions:
• What kind of data are these devices collecting?
• What are the different types of “Things” or
categories that are getting connected?
• What are the different use cases that are driving
the revenue predictions?