Post on 11-Sep-2021
Data Science Starter ProgramIntroduction to Data Science
E. Le Pennec, A. Fermin
Spring 2015
Introduction to Data ScienceOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Data Science in the mediaOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Data Science in the mediaLe Monde
Data Science in the mediaNY Times
Data Science in the mediaWorld Bank
Data Science in the mediaCriteo
Data Science in the mediaWe are in the press as well...
Data Science in the mediaData is the new oil?
From Data to ProductOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
From Data to ProductWeb search
From Data to ProductRecommendation system
From Data to ProductAdvertisement
From Data to ProductIntrusion detection
From Data to ProductCrime prevention
From Data to ProductMarketing
Marketing Technology Landscape January 2014
INFRA&'
STRU
CTURE
'BA
CKBO
NE'
PLATFORM
S'MIDDLE&'
WARE
'
Databases' Big'Data'
by'Sco?'Brinker'''@chiefmartec'''h?p://chiefmartec.com'
Cloud'
CRM' MarkeNng'AutomaNon'/'Integrated'MarkeNng' Web'Site'/'WCM'/'WEM' E&commerce'
User'Mgmt' Cloud'Connectors' APIs'
MARKETING'EXPERIENCES'
Channel/Local'Mktg'
MarkeNng'Resource'Mgmt'
MARKETING'OPERATIONS'
Agile'&'Project'Mgmt'
Dashboards'
MarkeNng'AnalyNcs'
Business'Intelligence'
Digital'Asset'Mgmt'
MarkeNng'Data'
Sales'Enablement'
Content'MarkeNng'PersonalizaNon'
TesNng'&'OpNmizaNon'
SEO'
MarkeNng'Apps'
Customer'Experience/VoC'
Calls'&'Call'Centers'
Events'&'Webinars'
Loyalty'&'GamificaNon'
Social'Media'MarkeNng'
CommuniNes'&'Reviews'
Video'Ads'&'MarkeNng'
Email'MarkeNng'
Display'AdverNsing'
Search'&'Social'Ads'
Tag'Management'
INTERN
ET'Web'Dev' MarkeNng'Environment'
Data'Management'PlaYorms/Customer'Data'PlaYorms'
Web'&'Mobile'AnalyNcs'
Mobile'App'Dev'
Mobile'MarkeNng'
CreaNve'&'Design'
From Data to ProductHealth
From Data to ProductLinkedin
From Data to ProductSmart city
From Data to ProductSports
From Data to ProductGenomics
From Data to ProductPhysics
An example: Real Time BiddingOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
An example: Real Time BiddingAn example: Real Time Bidding
A customer visits a webpage with his browser: a complex processof content selection and delivery begins.
An advertiser might want to display an ad to this customer on thewebpage he is going to
The webpage belongs to a publisher. The publisher deliverscontents: news, music, information, sports, etc. This content drawsan audience
The publisher sells ad space to advertisers who want to reach thataudience
An example: Real Time BiddingAn example: Real Time Bidding
1. 2. The customer visits a publisher’s webpage: the browser opensa connection to the publisher’s content server. It returns thecontent for the page (html code).
The html code describing this content is retrieved by the browser,and it starts to render an interpret it.
But... there is a line in this html code that says “follow this URL toretrieve ad content”
An example: Real Time BiddingAn example: Real Time Bidding
3. The publisher has an ad server: it answers the request byconsidering possibilities: can I put an ad for my premium buyers ?Do I have data about this consumer viewing my content? (it couldhelp me to decide to which buyer I could give this displayopportunity). Only logical rules apply (no machine-learning here).
The ad display opportunity is not premium, and this space or type ofcustomer is not already reserved by a buyer. Publisher’s ad serverputs this opportunity of ad display in the open ad market.
An example: Real Time BiddingAn example: Real Time Bidding
4. The publisher ad server connects to an SSP (Supply-SidePlatform). This platform monetizes its programmable displayinventory.
The SSP asks: have I already seen this consumer before ? Do I haveadditional data on him? The SSP requests extra information to aDMP (Data-Management Platform) about the user: profiling,audience segments, etc. Here machine learning is applied.
5. Using this information, the SSP sends the ad request to anad-exchange
An example: Real Time BiddingAn example: Real Time Bidding
6. Meanwhile, the ad-exchange is connected and exchanges withmany potential buying systems: DSP (Demand-Side Platform),ad-networks, even other ad-exchange networks.
Ad-network and DSP can have pre-cached bid: I’m paying 1$ for1000 displays of 25years-old males in France, I buy 100 displays assoon as the price is below some threshold (like a broker).
If no pre-cached bids, the ad-exchange says: no direct buyer for thisdisplay. Let’s us an auction rule! The RTB (Real Time Bidding)begins.
An example: Real Time BiddingAn example: Real Time Bidding
6. RTB: buyers have 10ms (!) to give a price to thead-exchange. Buyers assess in real-time how willing they are todisplay an ad to this customer.
Machine learning is used here, but only the prediction step, e.g. toassess the probability that the customer will click on some ads. Themodel must contain few parameters to answer quickly: the use offeature selection in the training step is crucial here.
An example: Real Time BiddingAn example: Real Time Bidding
7. The ad-exchange selects the highest bidder. The winning DSPgives instruction to the ad-exchange to retrieve the ad creative.
8. The ad-exchange passes these instructions to the SSP
9. The SSP send the request to the publisher ad server
10. The publisher ad server responds to the still existing httpconnection of the browser,11. 12. and tells to the browser to go to the agency’s ad serverto download the ad.
An example: Real Time BiddingAn example: Real Time Bidding
Now the ad can be displayed in the browser.
Full process takes < 100ms !Where is data science:
DMP side, to cluster audience into marketing segments and toprofile customers: clustering and classificationBuyer’s side (DSP, ad-network) to compute the price proposedfor RTB. Need to estimate the probability of a click on ads:regression and classification
An example: Real Time BiddingAn example: Real Time Bidding
Some numbers for a large web-advertisement company:10 million prediction of click probability per secondanswers within 10msstores 20Terabytes of data daily
Data Science ecosystemOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Data Science ecosystemA new Context
Data everywhereHuge volume,Huge variety...
Affordable computation unitsCloud computingGraphical Processor Units (GPU)...
Growing academic and industrial interest!
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar
cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv
cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv
⇒ Logistic regression for arbitrary large dataset!
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar
cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv
cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv
⇒ Logistic regression for arbitrary large dataset!
Data Science ecosystemA Complex Ecosystem!
Data Science ecosystemA Complex Ecosystem!
Data cycleOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Data cycleData Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow vision
Goal orientedIterative and interactive process
Data cycleData Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow visionGoal oriented
Iterative and interactive process
Data cycleData
Raw material:Structured and unstructured data (Variety)Data quality issue (Veracity)Quantity (Volume and Velocity)
Various sources:Open data,Proprietary data
Data cycleAcquisition, Cleaning, Storage and Integration
Get the data from the sources.Storage issue and availability for processing.Cleaning and formatingIntegration: Data preparation for analysisTime consuming!
Data cycleAnalysis
Extract information from the dataStatistics/Machine learningBig Data: hardware is the limit (time/volume)
Data cycleVisualization and Interface
Reporting part: Visualization, text...Also used for data explorationVery important aspect!
Data cycleDecision and goal oriented analysis
Better decisions: ValueNeed to answer a problem/question!Need to formalize the problem: no answerwithout a question!Feedback everywhere...
Data cycleReal Data Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow vision
Goal orientedIterative and interactive process
Data cycleReal Data Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow visionGoal oriented
Iterative and interactive process
Data cycleReal Data Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow visionGoal orientedIterative and interactive process
Data cycleDoing Data Science
Doing Data Science: Straight talk from the frontline.Rachel Schutt, Cathy O’NeilO’Reilly
Data Science projectOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Data Science projectA 7 step program
1. Identify the problemType of problems and metric used to measure successIdentify key people within your organization and outsideGet specifications, requirements, priorities, budgetsHow accurate the solution needs to be?Do we need all the data?Outsourcing?
Data Science projectA 7 step program
2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?
Data Science projectA 7 step program
2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?
3. Identify if additional data sources are neededWhat? How much? How to?Real time?Do we need experimental design?
Data Science projectA 7 step program
4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback
Data Science projectA 7 step program
4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback
5. Implementation, developmentFSSRR: Fast, simple, scalable, robust, re-usableDebuggingNeed to create an API to communicate with other apps?
Data Science projectA 7 step program
6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation
Data Science projectA 7 step program
6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation
7. MaintenanceTest the model or implementation; stress testsRegular updatesOutsourcing?
Data scientistsOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Data scientistsSkills
Business:Business analysis, market knowledge, product usage,. . .
Data Management:Data collection, storage, cleaning, filtering,integration,. . .
Statistic and Machine Learning:Data modeling, inference, prediction, patternrecognition,. . .
Programming:Software development, Large-scale or parallel dataprocessing,. . .
Interface and Data Visualization:HCI design, visualization, story-telling,. . .
Data scientistsProfiles
No one masters all the skills!
Data scientistsData science team
Gather people having different skills
Data scientistsMain types of data scientists
There are the ones...
Strong in statistics: develop new statistical theories for bigdata: statistical modeling, experimental design, sampling,clustering, data reduction, confidence intervals, testing,modeling, predictive modeling, etc.
Strong in mathematics: operations research, analyticbusiness (inventory management and forecasting, pricingoptimization, supply chain, quality control, yield optimization)
Strong in data engineering, Hadoop, database/memory/filesystems optimization and architecture, API’s, Analytics as aService, optimization of data flows
Data scientistsMain types of data scientists
Strong in computer science (algorithms, computationalcomplexity, optimization)
Strong in business, ROI optimization, decision sciences(dashboards design, metric mix selection and metricdefinitions, ROI optimization, high-level database design)
Strong in production code development, software engineering
Data scientistsMore than data scientists?
Big Data, Data Science, StatisticsOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Big Data, Data Science, StatisticsWikipedia
Big Data, Data Science, StatisticsWikipedia
Big data is an all-encompassing term for any collection ofdata sets so large and complex that it becomes difficult toprocess using traditional data processing applications.Data science is the study of the generalizable extraction ofknowledge from data, yet the key word is science.Statistics is the study of the collection, analysis,interpretation, presentation and organization of data.
Big Data, Data Science, StatisticsData science evolution
Main Paradigmatic Changes in Big Data Analytics Environment
Big Analytics >2008 -up to now
(Unconstrained Data Mining)
Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs
Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes
No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.
Basic Analytical Principles
Hypotheses driven mode: Power use of sampling Techniques
Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations
Full Data driven mode:Power use of learning techniques, mainly unsupervised
Main Algorithmic approaches
Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.
Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.
Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.
New types of Business deliverables
Score Cards, Decisional Models based on sampling
Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling
Data types Homogeneous Structured Data (proprietary)
Homogeneous Structured & Homogeneous Unstructured Data, separately
Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)
VolumeCost/volume Exponential volume increase
Statistical Data Analysis<1985
(Pure Statistical Inference)
Business Intelligence 1985-2008
(Constrained Data Mining)
Exponential cost decrease
Big Data, Data Science, Statistics3Vs of Big Data
Big Data, Data Science, Statistics5Vs of Big Data
Big Data, Data Science, StatisticsData science or statistics?
A vocabulary problem:
data scientist or statistician?
statistics or data science?
Big Data, Data Science, StatisticsData science or statistics?
A possible answer:
Computing and Distributed ComputingOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Computing and Distributed ComputingComputer Architecture
Everything should go through the CPU...
Computing and Distributed ComputingMemories
CPU register 64 b × 16Level 1 cache access 8-128 kbLevel 2 cache access 32-1024 kbLevel 3 cache access 1-8 MBMain memory access 2-16 GBSolid-state disk I/O 250 GB-1 TB (4TB)Rotational disk I/O 500 GB-4 TB
Computing and Distributed ComputingMemories
1 CPU cycle 0.3 ns 1 sLevel 1 cache access 0.9 ns 3 sLevel 2 cache access 2.8 ns 9 sLevel 3 cache access 12.9 ns 43 sMain memory access 120 ns 6 minSolid-state disk I/O 50-150 µs 2-6 daysRotational disk I/O 1-10 ms 1-12 monthsInternet: SF to NYC 40 ms 4 yearsInternet: SF to UK 81 ms 8 yearsInternet: SF to Australia 183 ms 19 yearsOS virtualization reboot 4 s 423 yearsSCSI command time-out 30 s 3000 yearsHardware virtualization reboot 40 s 4000 yearsPhysical system reboot 5 m 32 millenia
Computing and Distributed ComputingDistributed/Parallel Computing
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor Processor
(a)
(b)
(c)
Distributed (a/b)Parallel (c)
Computing and Distributed ComputingMultiCore
Several processors/cores with the same shared ram.No too expensive transfer between core.Strategies:
Independent batchParallelization technique limiting information transfer...
System memory limitation!
Computing and Distributed ComputingHadoop and Map/Reduce
Data transfer through disk and networked file system!Hadoop: Node failure handling and ecosystem.
Computing and Distributed ComputingSpark
Strategy: keep everything as much as possible in memory...
Computing and Distributed ComputingGP-GPU
Combine different processor types...CPU < DSP < FPGA < ASICS
Data Science ChallengesOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Data Science ChallengesNew Interdisciplinary Challenges
Applied math AND Computer scienceHuge importance of domain specific knowledge: physics,signal processing, biology, health, marketing...
Some joint math/computer science challengesData acquisitionUnstructured data and their representationHuge dataset and computationHigh dimensional data and model selectionLearning with less supervisionVisualization
Data Science ChallengesData acquisition
Some challengesHow to measure new things?How to choose what to measure?How to deal with distributed sensors?How to look for new sources of informations?
Data Science ChallengesUnstructured Data
Some challengesHow to store efficiently the data?How to describe (model) them to be able to process them?How to combine data of different nature?How to learn dynamics?
Data Science ChallengesHuge Dataset
Some challengesHow to take into account the locality of the data?How to construct distributed architectures?How to design adapted algorithms?
Data Science ChallengesHigh Dimensional Data
Main Paradigmatic Changes in Big Data Analytics Environment
Big Analytics >2008 -up to now
(Unconstrained Data Mining)
Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs
Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes
No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.
Basic Analytical Principles
Hypotheses driven mode: Power use of sampling Techniques
Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations
Full Data driven mode:Power use of learning techniques, mainly unsupervised
Main Algorithmic approaches
Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.
Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.
Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.
New types of Business deliverables
Score Cards, Decisional Models based on sampling
Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling
Data types Homogeneous Structured Data (proprietary)
Homogeneous Structured & Homogeneous Unstructured Data, separately
Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)
VolumeCost/volume Exponential volume increase
Statistical Data Analysis<1985
(Pure Statistical Inference)
Business Intelligence 1985-2008
(Constrained Data Mining)
Exponential cost decrease
Some challengesHow to describe (model) the data?How to reduce the data dimensionality?How to select/mix models?
Data Science ChallengesLearning and Supervision
Some challengesHow to learn with the less possible interactions?How to learn simultaneously several related tasks?
Data Science ChallengesVisualization
Some challengesHow to look at the data?How to present results?How to help taking better informed decision?
Vocabulary of Data ScienceOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
Vocabulary of Data ScienceLots of words
Vocabulary of Data ScienceLots of words
Vocabulary of Data ScienceMain Fields
Data mining. Extract patterns from data by combining methodsfrom statistics, machine learning and data processing technologies.Example: market basket analysis to model the purchase behaviorof customers.
Machine learning. Design and develop algorithms allowingcomputers to learn from data, in order to take intelligent decisionsautomatically. Example: Natural language processing.
Statistics. Collection, organization, and interpretation of data.Mathematical methods to construct quantitative assessments oferrors and risks when taking decisions, estimating parameters anddoing predictions. Example: quantitative assessments ofrelationships between variables, computing confidence intervals formodel parameters, hypothesis testing.
Vocabulary of Data ScienceMain Fields
Natural language processing (NLP). Specialization of machinelearning and linguistics that builds algorithms to analyze human(natural) language. Example: sentiment analysis on socialnetworks.
Network analysis. Characterize relationships among nodes in agraph or a network, understand the communities, the influence ofnodes on the others, understand how information travels in thenetwork. Example: identify key opinion leaders in a social network,identify the information flows in a large company
Predictive modeling. Use of a mathematical model to predict anoutcome, e.g. regression, classification, etc. Example. Predict theprobability that a customer will churn.
Vocabulary of Data ScienceMain Fields
Supervised learning. Machine learning techniques that infer afunction or a relationship from a set of training data. Examples:classification, regression
Unsupervised learning. Machine learning techniques that findsstructure in unlabeled data. Example: clustering is a part ofunsupervised learning
Visualization. Techniques used for representation of data bycreating images, diagrams, animations, in order to communicate,understand, explore and improve understanding of data.
Vocabulary of Data ScienceMachine Learning
Labels. Characteristics / categories of interest in points of data.This is the information one wants to predict in supervised learning.
Features. A set of information about a point of data (a customer,a company, a country, etc.)
Clustering. Algorithms used in unsupervised learning to assign agroup to each data point. Groups are called clusters. Example:customer segmentation in a e-commerce platform
Classification. Algorithms used in supervised learning to predictthe labels of data points. It relies on the training of a model oralgorithm using labeled data. Keywords: Logistic Regression,SVM, CART, Boosting, etc.
Feature selection. Algorithms in machine learning that selectfeatures that best explain an outcome. Example. In biology, findthe genetic informations that best explains a patient’s response toa drug.
Vocabulary of Data ScienceMathematical Concepts
Generalization. Ability of a predictive algorithm to generalize:give good predictive results on a sample different than to one usedto train the algorithm.
Parameters. A set of coefficients (vectors, matrices), thatspecifies a model. Example: the mean and standard deviation of aGaussian distribution
Statistical model. A mathematical formulation that attempts toexplain how data is generated. Example: data is generated by amultivariate Gaussian distribution
Likelihood. The probability that data is generated by a model forsome parameters choice
Vocabulary of Data ScienceMathematical Concepts
Optimization. Study and design of numerical algorithms used to(but not only) minimize or maximize functions. Example:optimization of a likelihood is called the training step in machinelearning.
Goodness-of-fit. A quantity that assesses the closeness of a(trained) model to data. Example. The least-squares error forlinear regression.
Over-fitting. Something that must be avoided to have a goodpredictive performance on new data.
Cross-validation. Splitting of data into several subsets. A modelis trained on a subset and tested on another, to check itsgoodness-of-fit both on data used for training, but on new data aswell.
Vocabulary of Data ScienceData Engineering
Structured data. Data structured in fixed fields. Example:relational databases or excel spreadsheets.
Semi-structured data. Data not structured in fixed fields butcontain markers to separate data elements. Example: XML orHTML-tagged text.
Unstructured data. Data not structured in fixed fields. Example:books, articles, body of e-mail messages, audio, image and videodata, etc.
Metadata. Data that describes the content and context of data:creation, purpose, time and date, author, etc.
Vocabulary of Data ScienceData Engineering
Data fusion and data integration. Set of techniques thatintegrate and analyze data from multiple sources, instead ofanalyzing single sources of data. Example: combine analysis ofsocial network data with NLP and real-time sales data, to assessthe effect of a marketing campaign on customer sentiment andpurchases
Vocabulary of Data ScienceDatabases and Data Procesing
Cloud computing. A computing paradigm where computingresources are configured as a distributed system, which provides aservice through a network.
Distributed system. Several computers, communicating througha network, used to solve a common storing or computationalproblem. Aim is higher performance at a lower cost, higherreliability and scalability.
Relational database. A database consisting of collections oftables (relations), namely data are stored in rows and columns.SQL is the most widely used language for managing relationaldatabases.
Non-relational database. A database that does not store data intables (rows and columns).
Vocabulary of Data ScienceDatabases and Data Procesing
SQL. Acronym for Structured Query Language. It is a computerlanguage designed for managing data in relational databases.Example. insert, query, update, delete data, manage databasestructures, and control access to data in the database.
NoSQL. A group of database management systems. Data is notstored in tables like in Relational database. It does not rely on themathematical relationship between tables. It gives a way of storingand retrieving unstructured data quickly.
Hbase. A distributed, non-relational database. It is managed as aproject of the Apache Software foundation and a part of Hadoop.
Vocabulary of Data ScienceDatabases and Data Procesing
Hadoop. A framework that supports large scale data processingby allowing the decomposition of large tasks into smaller tasks,that are executed in parallel, on independently slices of the dataand then finally merged to answer to the task.
MapReduce. A software framework introduced by Google forprocessing huge datasets on a distributed system. Implemented inHadoop. It supports large scale data processing by decomposinglarge tasks into smaller tasks, executed in parallel, on independentparts of data and finally merged to answer to the task.
Stream processing. Technologies designed to process largereal-time streams of event data. Example: high-frequencyalgorithmic trading, analysis of Twitter data streams
BibliographyOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
BibliographyBibliography
T. Hastie, R. Tibshirani, and J. Friedman (2009)The Elements of Statistical LearningSpringer Series in Statistics.
G. James, D. Witten, T. Hastie and R. Tibshirani (2013)An Introduction to Statistical Learning with Applications in RSpringer Series in Statistics.
B. Schölkopf, A. Smola (2002)Learning with kernels.The MIT Press
R. Schutt, and C. O’Neil (2014)Doing Data Science: Straight talk from the frontlineO’Reilly