Data Science Starter Program Introduction to Data Science

Post on 11-Sep-2021

21 views 1 download

Transcript of Data Science Starter Program Introduction to Data Science

Data Science Starter ProgramIntroduction to Data Science

E. Le Pennec, A. Fermin

Spring 2015

Introduction to Data ScienceOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Data Science in the mediaOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Data Science in the mediaLe Monde

Data Science in the mediaNY Times

Data Science in the mediaWorld Bank

Data Science in the mediaCriteo

Data Science in the mediaWe are in the press as well...

Data Science in the mediaData is the new oil?

From Data to ProductOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

From Data to ProductWeb search

From Data to ProductRecommendation system

From Data to ProductAdvertisement

From Data to ProductIntrusion detection

From Data to ProductCrime prevention

From Data to ProductMarketing

Marketing Technology Landscape January 2014

INFRA&'

STRU

CTURE

'BA

CKBO

NE'

PLATFORM

S'MIDDLE&'

WARE

'

Databases' Big'Data'

by'Sco?'Brinker'''@chiefmartec'''h?p://chiefmartec.com'

Cloud'

CRM' MarkeNng'AutomaNon'/'Integrated'MarkeNng' Web'Site'/'WCM'/'WEM' E&commerce'

User'Mgmt' Cloud'Connectors' APIs'

MARKETING'EXPERIENCES'

Channel/Local'Mktg'

MarkeNng'Resource'Mgmt'

MARKETING'OPERATIONS'

Agile'&'Project'Mgmt'

Dashboards'

MarkeNng'AnalyNcs'

Business'Intelligence'

Digital'Asset'Mgmt'

MarkeNng'Data'

Sales'Enablement'

Content'MarkeNng'PersonalizaNon'

TesNng'&'OpNmizaNon'

SEO'

MarkeNng'Apps'

Customer'Experience/VoC'

Calls'&'Call'Centers'

Events'&'Webinars'

Loyalty'&'GamificaNon'

Social'Media'MarkeNng'

CommuniNes'&'Reviews'

Video'Ads'&'MarkeNng'

Email'MarkeNng'

Display'AdverNsing'

Search'&'Social'Ads'

Tag'Management'

INTERN

ET'Web'Dev' MarkeNng'Environment'

Data'Management'PlaYorms/Customer'Data'PlaYorms'

Web'&'Mobile'AnalyNcs'

Mobile'App'Dev'

Mobile'MarkeNng'

CreaNve'&'Design'

From Data to ProductHealth

From Data to ProductLinkedin

From Data to ProductSmart city

From Data to ProductSports

From Data to ProductGenomics

From Data to ProductPhysics

An example: Real Time BiddingOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

An example: Real Time BiddingAn example: Real Time Bidding

A customer visits a webpage with his browser: a complex processof content selection and delivery begins.

An advertiser might want to display an ad to this customer on thewebpage he is going to

The webpage belongs to a publisher. The publisher deliverscontents: news, music, information, sports, etc. This content drawsan audience

The publisher sells ad space to advertisers who want to reach thataudience

An example: Real Time BiddingAn example: Real Time Bidding

1. 2. The customer visits a publisher’s webpage: the browser opensa connection to the publisher’s content server. It returns thecontent for the page (html code).

The html code describing this content is retrieved by the browser,and it starts to render an interpret it.

But... there is a line in this html code that says “follow this URL toretrieve ad content”

An example: Real Time BiddingAn example: Real Time Bidding

3. The publisher has an ad server: it answers the request byconsidering possibilities: can I put an ad for my premium buyers ?Do I have data about this consumer viewing my content? (it couldhelp me to decide to which buyer I could give this displayopportunity). Only logical rules apply (no machine-learning here).

The ad display opportunity is not premium, and this space or type ofcustomer is not already reserved by a buyer. Publisher’s ad serverputs this opportunity of ad display in the open ad market.

An example: Real Time BiddingAn example: Real Time Bidding

4. The publisher ad server connects to an SSP (Supply-SidePlatform). This platform monetizes its programmable displayinventory.

The SSP asks: have I already seen this consumer before ? Do I haveadditional data on him? The SSP requests extra information to aDMP (Data-Management Platform) about the user: profiling,audience segments, etc. Here machine learning is applied.

5. Using this information, the SSP sends the ad request to anad-exchange

An example: Real Time BiddingAn example: Real Time Bidding

6. Meanwhile, the ad-exchange is connected and exchanges withmany potential buying systems: DSP (Demand-Side Platform),ad-networks, even other ad-exchange networks.

Ad-network and DSP can have pre-cached bid: I’m paying 1$ for1000 displays of 25years-old males in France, I buy 100 displays assoon as the price is below some threshold (like a broker).

If no pre-cached bids, the ad-exchange says: no direct buyer for thisdisplay. Let’s us an auction rule! The RTB (Real Time Bidding)begins.

An example: Real Time BiddingAn example: Real Time Bidding

6. RTB: buyers have 10ms (!) to give a price to thead-exchange. Buyers assess in real-time how willing they are todisplay an ad to this customer.

Machine learning is used here, but only the prediction step, e.g. toassess the probability that the customer will click on some ads. Themodel must contain few parameters to answer quickly: the use offeature selection in the training step is crucial here.

An example: Real Time BiddingAn example: Real Time Bidding

7. The ad-exchange selects the highest bidder. The winning DSPgives instruction to the ad-exchange to retrieve the ad creative.

8. The ad-exchange passes these instructions to the SSP

9. The SSP send the request to the publisher ad server

10. The publisher ad server responds to the still existing httpconnection of the browser,11. 12. and tells to the browser to go to the agency’s ad serverto download the ad.

An example: Real Time BiddingAn example: Real Time Bidding

Now the ad can be displayed in the browser.

Full process takes < 100ms !Where is data science:

DMP side, to cluster audience into marketing segments and toprofile customers: clustering and classificationBuyer’s side (DSP, ad-network) to compute the price proposedfor RTB. Need to estimate the probability of a click on ads:regression and classification

An example: Real Time BiddingAn example: Real Time Bidding

Some numbers for a large web-advertisement company:10 million prediction of click probability per secondanswers within 10msstores 20Terabytes of data daily

Data Science ecosystemOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Data Science ecosystemA new Context

Data everywhereHuge volume,Huge variety...

Affordable computation unitsCloud computingGraphical Processor Units (GPU)...

Growing academic and industrial interest!

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar

cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv

cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv

⇒ Logistic regression for arbitrary large dataset!

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar

cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv

cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv

⇒ Logistic regression for arbitrary large dataset!

Data Science ecosystemA Complex Ecosystem!

Data Science ecosystemA Complex Ecosystem!

Data cycleOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Data cycleData Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow vision

Goal orientedIterative and interactive process

Data cycleData Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow visionGoal oriented

Iterative and interactive process

Data cycleData

Raw material:Structured and unstructured data (Variety)Data quality issue (Veracity)Quantity (Volume and Velocity)

Various sources:Open data,Proprietary data

Data cycleAcquisition, Cleaning, Storage and Integration

Get the data from the sources.Storage issue and availability for processing.Cleaning and formatingIntegration: Data preparation for analysisTime consuming!

Data cycleAnalysis

Extract information from the dataStatistics/Machine learningBig Data: hardware is the limit (time/volume)

Data cycleVisualization and Interface

Reporting part: Visualization, text...Also used for data explorationVery important aspect!

Data cycleDecision and goal oriented analysis

Better decisions: ValueNeed to answer a problem/question!Need to formalize the problem: no answerwithout a question!Feedback everywhere...

Data cycleReal Data Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow vision

Goal orientedIterative and interactive process

Data cycleReal Data Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow visionGoal oriented

Iterative and interactive process

Data cycleReal Data Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow visionGoal orientedIterative and interactive process

Data cycleDoing Data Science

Doing Data Science: Straight talk from the frontline.Rachel Schutt, Cathy O’NeilO’Reilly

Data Science projectOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Data Science projectA 7 step program

1. Identify the problemType of problems and metric used to measure successIdentify key people within your organization and outsideGet specifications, requirements, priorities, budgetsHow accurate the solution needs to be?Do we need all the data?Outsourcing?

Data Science projectA 7 step program

2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?

Data Science projectA 7 step program

2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?

3. Identify if additional data sources are neededWhat? How much? How to?Real time?Do we need experimental design?

Data Science projectA 7 step program

4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback

Data Science projectA 7 step program

4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback

5. Implementation, developmentFSSRR: Fast, simple, scalable, robust, re-usableDebuggingNeed to create an API to communicate with other apps?

Data Science projectA 7 step program

6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation

Data Science projectA 7 step program

6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation

7. MaintenanceTest the model or implementation; stress testsRegular updatesOutsourcing?

Data scientistsOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Data scientistsSkills

Business:Business analysis, market knowledge, product usage,. . .

Data Management:Data collection, storage, cleaning, filtering,integration,. . .

Statistic and Machine Learning:Data modeling, inference, prediction, patternrecognition,. . .

Programming:Software development, Large-scale or parallel dataprocessing,. . .

Interface and Data Visualization:HCI design, visualization, story-telling,. . .

Data scientistsProfiles

No one masters all the skills!

Data scientistsData science team

Gather people having different skills

Data scientistsMain types of data scientists

There are the ones...

Strong in statistics: develop new statistical theories for bigdata: statistical modeling, experimental design, sampling,clustering, data reduction, confidence intervals, testing,modeling, predictive modeling, etc.

Strong in mathematics: operations research, analyticbusiness (inventory management and forecasting, pricingoptimization, supply chain, quality control, yield optimization)

Strong in data engineering, Hadoop, database/memory/filesystems optimization and architecture, API’s, Analytics as aService, optimization of data flows

Data scientistsMain types of data scientists

Strong in computer science (algorithms, computationalcomplexity, optimization)

Strong in business, ROI optimization, decision sciences(dashboards design, metric mix selection and metricdefinitions, ROI optimization, high-level database design)

Strong in production code development, software engineering

Data scientistsMore than data scientists?

Big Data, Data Science, StatisticsOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Big Data, Data Science, StatisticsWikipedia

Big Data, Data Science, StatisticsWikipedia

Big data is an all-encompassing term for any collection ofdata sets so large and complex that it becomes difficult toprocess using traditional data processing applications.Data science is the study of the generalizable extraction ofknowledge from data, yet the key word is science.Statistics is the study of the collection, analysis,interpretation, presentation and organization of data.

Big Data, Data Science, StatisticsData science evolution

Main Paradigmatic Changes in Big Data Analytics Environment

Big Analytics >2008 -up to now

(Unconstrained Data Mining)

Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs

Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes

No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.

Basic Analytical Principles

Hypotheses driven mode: Power use of sampling Techniques

Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations

Full Data driven mode:Power use of learning techniques, mainly unsupervised

Main Algorithmic approaches

Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.

Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.

Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.

New types of Business deliverables

Score Cards, Decisional Models based on sampling

Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling

Data types Homogeneous Structured Data (proprietary)

Homogeneous Structured & Homogeneous Unstructured Data, separately

Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)

VolumeCost/volume Exponential volume increase

Statistical Data Analysis<1985

(Pure Statistical Inference)

Business Intelligence 1985-2008

(Constrained Data Mining)

Exponential cost decrease

Big Data, Data Science, Statistics3Vs of Big Data

Big Data, Data Science, Statistics5Vs of Big Data

Big Data, Data Science, StatisticsData science or statistics?

A vocabulary problem:

data scientist or statistician?

statistics or data science?

Big Data, Data Science, StatisticsData science or statistics?

A possible answer:

Computing and Distributed ComputingOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Computing and Distributed ComputingComputer Architecture

Everything should go through the CPU...

Computing and Distributed ComputingMemories

CPU register 64 b × 16Level 1 cache access 8-128 kbLevel 2 cache access 32-1024 kbLevel 3 cache access 1-8 MBMain memory access 2-16 GBSolid-state disk I/O 250 GB-1 TB (4TB)Rotational disk I/O 500 GB-4 TB

Computing and Distributed ComputingMemories

1 CPU cycle 0.3 ns 1 sLevel 1 cache access 0.9 ns 3 sLevel 2 cache access 2.8 ns 9 sLevel 3 cache access 12.9 ns 43 sMain memory access 120 ns 6 minSolid-state disk I/O 50-150 µs 2-6 daysRotational disk I/O 1-10 ms 1-12 monthsInternet: SF to NYC 40 ms 4 yearsInternet: SF to UK 81 ms 8 yearsInternet: SF to Australia 183 ms 19 yearsOS virtualization reboot 4 s 423 yearsSCSI command time-out 30 s 3000 yearsHardware virtualization reboot 40 s 4000 yearsPhysical system reboot 5 m 32 millenia

Computing and Distributed ComputingDistributed/Parallel Computing

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor Processor

(a)

(b)

(c)

Distributed (a/b)Parallel (c)

Computing and Distributed ComputingMultiCore

Several processors/cores with the same shared ram.No too expensive transfer between core.Strategies:

Independent batchParallelization technique limiting information transfer...

System memory limitation!

Computing and Distributed ComputingHadoop and Map/Reduce

Data transfer through disk and networked file system!Hadoop: Node failure handling and ecosystem.

Computing and Distributed ComputingSpark

Strategy: keep everything as much as possible in memory...

Computing and Distributed ComputingGP-GPU

Combine different processor types...CPU < DSP < FPGA < ASICS

Data Science ChallengesOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Data Science ChallengesNew Interdisciplinary Challenges

Applied math AND Computer scienceHuge importance of domain specific knowledge: physics,signal processing, biology, health, marketing...

Some joint math/computer science challengesData acquisitionUnstructured data and their representationHuge dataset and computationHigh dimensional data and model selectionLearning with less supervisionVisualization

Data Science ChallengesData acquisition

Some challengesHow to measure new things?How to choose what to measure?How to deal with distributed sensors?How to look for new sources of informations?

Data Science ChallengesUnstructured Data

Some challengesHow to store efficiently the data?How to describe (model) them to be able to process them?How to combine data of different nature?How to learn dynamics?

Data Science ChallengesHuge Dataset

Some challengesHow to take into account the locality of the data?How to construct distributed architectures?How to design adapted algorithms?

Data Science ChallengesHigh Dimensional Data

Main Paradigmatic Changes in Big Data Analytics Environment

Big Analytics >2008 -up to now

(Unconstrained Data Mining)

Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs

Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes

No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.

Basic Analytical Principles

Hypotheses driven mode: Power use of sampling Techniques

Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations

Full Data driven mode:Power use of learning techniques, mainly unsupervised

Main Algorithmic approaches

Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.

Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.

Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.

New types of Business deliverables

Score Cards, Decisional Models based on sampling

Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling

Data types Homogeneous Structured Data (proprietary)

Homogeneous Structured & Homogeneous Unstructured Data, separately

Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)

VolumeCost/volume Exponential volume increase

Statistical Data Analysis<1985

(Pure Statistical Inference)

Business Intelligence 1985-2008

(Constrained Data Mining)

Exponential cost decrease

Some challengesHow to describe (model) the data?How to reduce the data dimensionality?How to select/mix models?

Data Science ChallengesLearning and Supervision

Some challengesHow to learn with the less possible interactions?How to learn simultaneously several related tasks?

Data Science ChallengesVisualization

Some challengesHow to look at the data?How to present results?How to help taking better informed decision?

Vocabulary of Data ScienceOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Vocabulary of Data ScienceLots of words

Vocabulary of Data ScienceLots of words

Vocabulary of Data ScienceMain Fields

Data mining. Extract patterns from data by combining methodsfrom statistics, machine learning and data processing technologies.Example: market basket analysis to model the purchase behaviorof customers.

Machine learning. Design and develop algorithms allowingcomputers to learn from data, in order to take intelligent decisionsautomatically. Example: Natural language processing.

Statistics. Collection, organization, and interpretation of data.Mathematical methods to construct quantitative assessments oferrors and risks when taking decisions, estimating parameters anddoing predictions. Example: quantitative assessments ofrelationships between variables, computing confidence intervals formodel parameters, hypothesis testing.

Vocabulary of Data ScienceMain Fields

Natural language processing (NLP). Specialization of machinelearning and linguistics that builds algorithms to analyze human(natural) language. Example: sentiment analysis on socialnetworks.

Network analysis. Characterize relationships among nodes in agraph or a network, understand the communities, the influence ofnodes on the others, understand how information travels in thenetwork. Example: identify key opinion leaders in a social network,identify the information flows in a large company

Predictive modeling. Use of a mathematical model to predict anoutcome, e.g. regression, classification, etc. Example. Predict theprobability that a customer will churn.

Vocabulary of Data ScienceMain Fields

Supervised learning. Machine learning techniques that infer afunction or a relationship from a set of training data. Examples:classification, regression

Unsupervised learning. Machine learning techniques that findsstructure in unlabeled data. Example: clustering is a part ofunsupervised learning

Visualization. Techniques used for representation of data bycreating images, diagrams, animations, in order to communicate,understand, explore and improve understanding of data.

Vocabulary of Data ScienceMachine Learning

Labels. Characteristics / categories of interest in points of data.This is the information one wants to predict in supervised learning.

Features. A set of information about a point of data (a customer,a company, a country, etc.)

Clustering. Algorithms used in unsupervised learning to assign agroup to each data point. Groups are called clusters. Example:customer segmentation in a e-commerce platform

Classification. Algorithms used in supervised learning to predictthe labels of data points. It relies on the training of a model oralgorithm using labeled data. Keywords: Logistic Regression,SVM, CART, Boosting, etc.

Feature selection. Algorithms in machine learning that selectfeatures that best explain an outcome. Example. In biology, findthe genetic informations that best explains a patient’s response toa drug.

Vocabulary of Data ScienceMathematical Concepts

Generalization. Ability of a predictive algorithm to generalize:give good predictive results on a sample different than to one usedto train the algorithm.

Parameters. A set of coefficients (vectors, matrices), thatspecifies a model. Example: the mean and standard deviation of aGaussian distribution

Statistical model. A mathematical formulation that attempts toexplain how data is generated. Example: data is generated by amultivariate Gaussian distribution

Likelihood. The probability that data is generated by a model forsome parameters choice

Vocabulary of Data ScienceMathematical Concepts

Optimization. Study and design of numerical algorithms used to(but not only) minimize or maximize functions. Example:optimization of a likelihood is called the training step in machinelearning.

Goodness-of-fit. A quantity that assesses the closeness of a(trained) model to data. Example. The least-squares error forlinear regression.

Over-fitting. Something that must be avoided to have a goodpredictive performance on new data.

Cross-validation. Splitting of data into several subsets. A modelis trained on a subset and tested on another, to check itsgoodness-of-fit both on data used for training, but on new data aswell.

Vocabulary of Data ScienceData Engineering

Structured data. Data structured in fixed fields. Example:relational databases or excel spreadsheets.

Semi-structured data. Data not structured in fixed fields butcontain markers to separate data elements. Example: XML orHTML-tagged text.

Unstructured data. Data not structured in fixed fields. Example:books, articles, body of e-mail messages, audio, image and videodata, etc.

Metadata. Data that describes the content and context of data:creation, purpose, time and date, author, etc.

Vocabulary of Data ScienceData Engineering

Data fusion and data integration. Set of techniques thatintegrate and analyze data from multiple sources, instead ofanalyzing single sources of data. Example: combine analysis ofsocial network data with NLP and real-time sales data, to assessthe effect of a marketing campaign on customer sentiment andpurchases

Vocabulary of Data ScienceDatabases and Data Procesing

Cloud computing. A computing paradigm where computingresources are configured as a distributed system, which provides aservice through a network.

Distributed system. Several computers, communicating througha network, used to solve a common storing or computationalproblem. Aim is higher performance at a lower cost, higherreliability and scalability.

Relational database. A database consisting of collections oftables (relations), namely data are stored in rows and columns.SQL is the most widely used language for managing relationaldatabases.

Non-relational database. A database that does not store data intables (rows and columns).

Vocabulary of Data ScienceDatabases and Data Procesing

SQL. Acronym for Structured Query Language. It is a computerlanguage designed for managing data in relational databases.Example. insert, query, update, delete data, manage databasestructures, and control access to data in the database.

NoSQL. A group of database management systems. Data is notstored in tables like in Relational database. It does not rely on themathematical relationship between tables. It gives a way of storingand retrieving unstructured data quickly.

Hbase. A distributed, non-relational database. It is managed as aproject of the Apache Software foundation and a part of Hadoop.

Vocabulary of Data ScienceDatabases and Data Procesing

Hadoop. A framework that supports large scale data processingby allowing the decomposition of large tasks into smaller tasks,that are executed in parallel, on independently slices of the dataand then finally merged to answer to the task.

MapReduce. A software framework introduced by Google forprocessing huge datasets on a distributed system. Implemented inHadoop. It supports large scale data processing by decomposinglarge tasks into smaller tasks, executed in parallel, on independentparts of data and finally merged to answer to the task.

Stream processing. Technologies designed to process largereal-time streams of event data. Example: high-frequencyalgorithmic trading, analysis of Twitter data streams

BibliographyOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

BibliographyBibliography

T. Hastie, R. Tibshirani, and J. Friedman (2009)The Elements of Statistical LearningSpringer Series in Statistics.

G. James, D. Witten, T. Hastie and R. Tibshirani (2013)An Introduction to Statistical Learning with Applications in RSpringer Series in Statistics.

B. Schölkopf, A. Smola (2002)Learning with kernels.The MIT Press

R. Schutt, and C. O’Neil (2014)Doing Data Science: Straight talk from the frontlineO’Reilly