Languages, Systems and Paradigms for Big Datacolazzo/BD/partie1.pdfExtract, transform, and load...
Transcript of Languages, Systems and Paradigms for Big Datacolazzo/BD/partie1.pdfExtract, transform, and load...
Languages, Systems and Paradigms for
Big Data Dario COLAZZO
Plan
• Introduction to Big Data
• MapReduce
• Hadoop and its ecosystem,
• RDD et Spark
• Pig Latin & Hive
Modalités de contrôle
• Projet :
• analytics via MapReduce + Spark
• experimental evaluation on the cluster @Dauphine
• report
• Written exam
• Final grade = 0.7*Ex + 0.3*Pr
About me• PhD in Computer Science at Università di Pisa
• Post-docs: Università di Venezia, Université Paris-Sud
• MdC at Université Paris-Sud, LRI-INRIA Saclay
• Professor at Univ. Paris-Dauphine since September 2013
• Research and teaching topics:
• languages and systems for databases
• focus on Web data (XML, RDF, JSON)
• expressive and efficient processing of Big Data
• Publications : http://dblp.uni-trier.de/pers/hd/c/Colazzo:Dario
Big data and Data science: an introduction
Contents
• What?
• Where?
• Why?
• How?
Contents
2
y What?y Where?y Why?y How?
Some numbers
An introductory videoA short video
3(http://www.intel.it/content/www/it/it/big-data/big-data-101-animation.html)http://www.intel.fr/content/www/fr/fr/big-data/big-data-101-animation.html?
How many, how much?How many data in the world?
44 Zettabytes by 2020
800 Terabytes, 2000
160 Exabytes, 2006 (1EB = 1018B)
4.5 Zettabytes, 2012 (1ZB = 1021B)
How much is a zettabyte?
1,000,000,000,000,000,000,000 bytes
A stack of 1TB hard disks that is 25,400 km high
How many data in a day?
7 TB, Twitter
10 TB, Facebook
90% of world's data:
generated over last two years!
640K ought to be enough for anybody.
Data grows fast!Data grows fast!
How much fast?Growth rate
8http://www.sec.gov/Archives/edgar/data/51143/000110465913015636/a13-6155_18k.htm
Proliferation of data sourcesProliferation of data sources
Proliferation of devicesGlobal Internet device forecast
Users generate dataUser-generated content
Internet of ThinksInternet of Things
Internet of ThingsInternet of Things
Types of dataData types
16
What is more important?• The ‘Big’?
• The ‘Data’?
• Both?
• Neither
What organisations do with data
What is more important?
19
y The “Big”y The “Data” y Bothy Neither
What organizations do with big data
"Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom"
Cliff Stoll
“Data is not information, information is not knowledge,
knowledge is not understanding, understanding is not wisdom”
Cliff Stoll
Not just a matter of VolumeThe four "V’s" of Big Data
17
y Not just a matter of volume..
Big Data: V4 +ValueVolume:Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabyte (1021)
Variety: Structured, semi-structured, unstructured; Text, image, audio, video, record
Velocity: Periodic, Near Real Time, Real Time
Veracity: Quality of the data can vary greatly
Value: Big data can generate huge competitive advantages
Big Data: V4+VALUE
21
y Volume:Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabyte (1021)
y Variety: Structured, semi-structured, unstructured; Text, image, audio, video, record
y Velocity: Periodic, Near Real Time, Real Timey Veracity: Quality of the data can vary greatlyy Value: Big data can generate huge competitive advantages
What’s new?
22
The wide availability of data allows us to apply more sophisticated models and you get much more accurate results than in the past!
It is a capital mistake to theorize before one has data
The bigger the data set you have, the more accurate the predictions about the future will be
AnthonyGoldbloom
The wide availability of data allows us to apply more sophisticated
models and you get much more accurate results than in the past!
Bigger = Smarter?YES
algorithms work better
tolerate errors
discover the ‘long tail’ and ‘corner case’
BUT
more heterogeneity
data grows very fast
still need humans to ask right questions
Bigger = Smarter?y YES!y algorithms work much bettery tolerate errorsy discover the "long tail" and "corner cases"
y BUT:y more heterogeneityy data grows faster than energy on chipy still need humans to ask right questions
23
Why now?• Because we have large amount of data
• Data originates already in digital form
• ~40% growth per year
• Because we can
• 500$ for a drive in which to store all the music of the world
• 40 years of Moore's Law
• large computational resources
• 64% of organizations have invested in big data in 2013
• 34 billion $ invested in big data in 2013
• “Because we reached dead end with logic”
Why now?
25
y Because we have datay Data born already in digital form y 40% of data growth per year
y Because we can y 500$ for a drive in which to store all the music of the world y 40 years of Moore's Law o large computational resources y 64% of organizations have invested in big data in 2013 y 34 billion $ invested in big data in 2013
y “Because we reached dead end with logic”
Why now?
25
y Because we have datay Data born already in digital form y 40% of data growth per year
y Because we can y 500$ for a drive in which to store all the music of the world y 40 years of Moore's Law o large computational resources y 64% of organizations have invested in big data in 2013 y 34 billion $ invested in big data in 2013
y “Because we reached dead end with logic”
Why now?
25
y Because we have datay Data born already in digital form y 40% of data growth per year
y Because we can y 500$ for a drive in which to store all the music of the world y 40 years of Moore's Law o large computational resources y 64% of organizations have invested in big data in 2013 y 34 billion $ invested in big data in 2013
y “Because we reached dead end with logic”
bigger=smarter, an example
Google Translate you collect snippets of translations
you match sentences to snippets
you continuously debug your system
Why does it work? there are tons of snippets on the Web
the accuracy improves as the training set grows
A simple example of bigger=smarter
26
y Google Translate y you collect snippets of translationsy you match sentences to snippetsy you continuously debug your system
y Why does it work?y there are tons of snippets on the Weby the accuracy improves as the training
set grows
Other success storiesMore success stories
Other casesCrime Prevention in Los Angeles
Diagnosis and treatment of genetic diseases
Investments in the financial sector
Generation of personalized advertising
Astronomical discoveries
...
Use casesUse cases
30
Today’s Challenge New Data What’s Possible
HealthcareExpensive office visits
Remote patientmonitoring
Preventive care, reduced hospitalization
ManufacturingIn-person support Product sensors Automated diagnosis,
support
Location-Based ServicesBased on position Real time location data Geo-advertising, traffic,
local search
Public SectorStandardized services Citizen surveys Tailored services,
cost reductions
RetailOne size fits all
marketingSocial media Sentiment analysis
segmentation
The Big Data ProcessThe big data process
31
Acquisition
Extraction
Integration
Analysis
Interpretation
Decision
Goal: to make effective strategic decisions exploiting the availability of big data
Big Data in actionBig Data in action
32
Requires:y selectiony filteringy Metadata
generation y managing
provenance
Acquisition
Extraction
Integration
Analysis
Interpretation
Decision
Big Data in actionBig Data in action
33
Requires: y transformation y normalization y cleaning y aggregation y error handling
Acquisition
Extraction
Integration
Analysis
Interpretation
Decision
Big Data in actionBig Data in action
34
Requires:y standardization y conflict
management y reconciliation y mapping definition
Acquisition
Extraction
Integration
Analysis
Interpretation
Decision
Big Data in actionBig Data in action
35
Requires:y exploration y data mining y machine learning y visualization
Acquisition
Extraction
Integration
Analysis
Interpretation
Decision
Big Data in actionBig Data in action
36
Requires:y Knowledge of the
domain y Knowledge of the
provenancey Identification of
patterns of interest y Flexibility of the
process
Acquisition
Extraction
Integration
Analysis
Interpretation
Decision
Big Data in actionBig Data in action
37
Requires:y managerial skillsy continuous
improvement of the process
Acquisition
Extraction
Integration
Analysis
Interpretation
Decision
A simple exampleProblem: The sale of lollipops is going down! Acquisition:
Sales by customer region and time Surveys of users Social networks
Extraction: Data loading from receipts Automatic reading of questionnaires Data extraction from twitter
Integration: On the basis of user types
Analysis : lollipops bought by people older than 25 lollipops preferred by people younger than 10
Interpretation: Moms believe: lollipops = bad teeth Boys and girls believe that lollipops are for babies
Decision: We make lollipops without sugar We ask dentists to advertise our lollipops We make commercials targeted to boys and girls
A simple example of a big data process
38
y Problem: The sale of lollipops is going down! y Acquisition:
y Sales by customer, region and time y Surveys of users y Social networks
y Extraction: y Data loading from receipts y Automatic reading of questionnaires y Data extraction from twitter
y Integration: y On the basis of user types
y Analysis: y lollipops bought by people older than 25y lollipops preferred by people younger than 10
y Interpretation: y Moms believe: lollipops = bad teethy Boys and girls believe that lollipops are for babies
y Decision: y We make lollipops without sugar y We ask dentists to advertise our lollipops y We make commercials targeted to boys and girls
Risks and Challenges Performance, performance, performance!
Data grows faster than energy on chip
Efficiency
Scalability
Effectiveness
Heterogeneity
Flexibility
Privacy
Costs
Risks and Challenges of Big Data
39
y Performance, performance, performance!y Data grows faster than energy on chipy Efficiencyy Scalability
y Effectivenessy Heterogeneityy Flexibility y Privacyy Costs
Effectiveness : a failure story Google Flu Trends:
over-estimated the relevance of flu for 100 of 108 weeks
‘’Google would not comment on thisyear’s difficulties. But several researchers suggest that the problems may be due to widespread media coverage of this year’s severe US flu season, including the declaration of a public-health emergency by New York state last month. The press reports may have triggered many flu-related searches by people who were not ill.’’ Nature (http://www.nature.com/news/when-google-got-flu-wrong-1.12413)
Effectiveness: a failure storyy Google Flu Trendsy over-estimated the prevalence of flu for 100 out of 108 weeks
Risk of bad interpretationsRisks of bad interpretation
Privacy: unpleasant drawbacksAOL search data leak (NYT, 8/9/2006)
Anonymous Netflix vs IMDb database (Wired, 12/13/2007)
Why Johnny Can’t Browse The Internet In Peace (Forbes, 8/1/2012)
How Companies Learn Your Secrets (N Y T, 16/2/2012)
Privacy: Unpleasant drawbacks
46
y AOL search data leak (NYT, 8/9/2006)y Anonymous Netflix vs IMDb database (Wired, 12/13/2007)y Why Johnny Can’t Browse The Internet In Peace (Forbes,
8/1/2012)y How Companies Learn Your Secrets (NYT, 16/2/2012)
PerformancePerformance: taming Big Data..
Distribution and parallelism Distributed Architecture
Clusters of computers that work together to a common goal
Scale out not up!
Fault- tolerance
Resource replication
Eventual consistency
Distributed processing
Shared-nothing model
New programming paradigms
Distribution of resources and services
48
y Distributed Architecture y Clusters of computers that work together to a common goaly Scale out not up!
y Fault- tolerance y Resource replicationy Eventual consistency
y Distributed processingy Shared-nothing modely New programming paradigms
Distribution of resources and services
48
y Distributed Architecture y Clusters of computers that work together to a common goaly Scale out not up!
y Fault- tolerance y Resource replicationy Eventual consistency
y Distributed processingy Shared-nothing modely New programming paradigms
The Big Data landscapeThe Big Data Landscape
49
The new software stack
New programming environments designed to get their parallelism not from a supercomputer, but from computing clusters
Bottom of the stack: distributed file system (DFS)
We have a winner!
The New Software Stacky New programming environments designed to get their parallelism
not from a supercomputer but from computing clustersy Bottom of the stack: distributed file system (DFS)y We have a winner!
50
I think there is a world market for about five computers.
On top of Hadoop Hundreds of different (high-level) programming solutions
Two main scenarios:
Analytics (mainly batch)
collecting, transforming, and modelling data with the goal of discovering useful information and supporting decision-making
Real-time processing
processing data and returning the results sufficiently quickly to affect the environment at that time
e-commerce, search engines, booking,…
On the top of Hadoopy Hundreds of different (high-level)
programming solutions
y Two main scenarios:y Analytics (mainly batch)y collecting, transforming, and modeling data with the goal of discovering useful
information and supporting decision-making
y Real-time processingy processing data and returning the results sufficiently quickly to affect the
environment at that timey e-commerce, search engines, booking, …
The Big Data flow
52
The Big Data flow
ETL
Real TimeStreams
Distributed File System (HDFS)
NoSQL(HBase,
Cassandra,MongoDB)
New SQL(Oracle,VoltDB,
Teradata)
Real TimeProcessing
Analytics (Vertica,
Cloudera,Greenplum)
BatchProcessing
Techniques for Big Data analysis
Extract, transform, and load (ETL) Data fusion and data integration Distributed file system NoSQL database systems Cloud computing Analytics
Data mining Association rule learning Classification Cluster analysis Regression
Machine learning Supervised learning Unsupervised learning
Crowdsourcing …..
Techniques for big data analysis
53
y Extract, transform, and load (ETL)y Data fusion and data integrationy Distributed file systemy NoSQL database systemsy Cloud computingy Analytics
y Data miningy Association rule learningy Classification y Cluster analysisy Regression
y Machine learningy Supervised learningy Unsupervised learning
y Crowdsourcing y …
Goal of AnalyticsGoals of analytics
Data scientist: a brand new profession
Data Scientist: The Sexiest Job of the 21st Century [Harward Business Review 2013]
Data scientist? A guide to 2015's hottest profession [Mashable 2015]
Data scientist: a brand new profession
57
y Data Scientist: The Sexiest Job of the 21st Century [HarwardBusiness Review 2013]
y Data scientist? A guide to 2015's hottest profession [Mashable 2015]
Skills of data scientistsSkills of data scientists
58
Conclusions We live in the era of Big Data
Wide range of availability in different areas
Big opportunities to solve big problems
They can create value
The challenge is how to manage and use them
New technologies are needed
Methodological aspects are important
A rapidly evolving area
Data scientists: the current hottest profession in IT
Conclusions
61
y We live in the era of Big Datay Wide range of availability in different areas y Big opportunities to solve big problemsy They can create value y The challenge is how to manage and use themy New technologies are neededy Methodological aspects are important y A rapidly evolving areay Data scientists: the current hottest profession in IT
So, let us face big data projects…So, let us face big data projects..
62
…with a Bruce Willis’ attitude!..with a Bruce Willis attitude!
63
References
"Big Data: The next frontier for innovation, competition, and productivity". McKinsey&Company, 2012.
"Challenges and Opportunities with Big Data". A community white paper developed by leading researchers across the United States, 2012.
"Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics". Bill Franks, John Wiley & Sons, 2012.
Questions?
Photo credit: Jimmy Lin
Thanks and credits for this introductory part
Jimmy Lin (University of Maryland)
Riccardo Torlone (Università di Roma)