Data Science Starter Program Introduction to Data Science
Transcript of Data Science Starter Program Introduction to Data Science
![Page 1: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/1.jpg)
Data Science Starter ProgramIntroduction to Data Science
E. Le Pennec, A. Fermin
Spring 2015
![Page 2: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/2.jpg)
Introduction to Data ScienceOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 3: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/3.jpg)
Data Science in the mediaOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 4: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/4.jpg)
Data Science in the mediaLe Monde
![Page 5: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/5.jpg)
Data Science in the mediaNY Times
![Page 6: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/6.jpg)
Data Science in the mediaWorld Bank
![Page 7: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/7.jpg)
Data Science in the mediaCriteo
![Page 8: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/8.jpg)
Data Science in the mediaWe are in the press as well...
![Page 9: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/9.jpg)
Data Science in the mediaData is the new oil?
![Page 10: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/10.jpg)
From Data to ProductOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 11: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/11.jpg)
From Data to ProductWeb search
![Page 12: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/12.jpg)
From Data to ProductRecommendation system
![Page 13: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/13.jpg)
From Data to ProductAdvertisement
![Page 14: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/14.jpg)
From Data to ProductIntrusion detection
![Page 15: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/15.jpg)
From Data to ProductCrime prevention
![Page 16: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/16.jpg)
From Data to ProductMarketing
Marketing Technology Landscape January 2014
INFRA&'
STRU
CTURE
'BA
CKBO
NE'
PLATFORM
S'MIDDLE&'
WARE
'
Databases' Big'Data'
by'Sco?'Brinker'''@chiefmartec'''h?p://chiefmartec.com'
Cloud'
CRM' MarkeNng'AutomaNon'/'Integrated'MarkeNng' Web'Site'/'WCM'/'WEM' E&commerce'
User'Mgmt' Cloud'Connectors' APIs'
MARKETING'EXPERIENCES'
Channel/Local'Mktg'
MarkeNng'Resource'Mgmt'
MARKETING'OPERATIONS'
Agile'&'Project'Mgmt'
Dashboards'
MarkeNng'AnalyNcs'
Business'Intelligence'
Digital'Asset'Mgmt'
MarkeNng'Data'
Sales'Enablement'
Content'MarkeNng'PersonalizaNon'
TesNng'&'OpNmizaNon'
SEO'
MarkeNng'Apps'
Customer'Experience/VoC'
Calls'&'Call'Centers'
Events'&'Webinars'
Loyalty'&'GamificaNon'
Social'Media'MarkeNng'
CommuniNes'&'Reviews'
Video'Ads'&'MarkeNng'
Email'MarkeNng'
Display'AdverNsing'
Search'&'Social'Ads'
Tag'Management'
INTERN
ET'Web'Dev' MarkeNng'Environment'
Data'Management'PlaYorms/Customer'Data'PlaYorms'
Web'&'Mobile'AnalyNcs'
Mobile'App'Dev'
Mobile'MarkeNng'
CreaNve'&'Design'
![Page 17: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/17.jpg)
From Data to ProductHealth
![Page 18: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/18.jpg)
From Data to ProductLinkedin
![Page 19: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/19.jpg)
From Data to ProductSmart city
![Page 20: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/20.jpg)
From Data to ProductSports
![Page 21: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/21.jpg)
From Data to ProductGenomics
![Page 22: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/22.jpg)
From Data to ProductPhysics
![Page 23: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/23.jpg)
An example: Real Time BiddingOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 24: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/24.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
A customer visits a webpage with his browser: a complex processof content selection and delivery begins.
An advertiser might want to display an ad to this customer on thewebpage he is going to
The webpage belongs to a publisher. The publisher deliverscontents: news, music, information, sports, etc. This content drawsan audience
The publisher sells ad space to advertisers who want to reach thataudience
![Page 25: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/25.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
1. 2. The customer visits a publisher’s webpage: the browser opensa connection to the publisher’s content server. It returns thecontent for the page (html code).
The html code describing this content is retrieved by the browser,and it starts to render an interpret it.
But... there is a line in this html code that says “follow this URL toretrieve ad content”
![Page 26: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/26.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
3. The publisher has an ad server: it answers the request byconsidering possibilities: can I put an ad for my premium buyers ?Do I have data about this consumer viewing my content? (it couldhelp me to decide to which buyer I could give this displayopportunity). Only logical rules apply (no machine-learning here).
The ad display opportunity is not premium, and this space or type ofcustomer is not already reserved by a buyer. Publisher’s ad serverputs this opportunity of ad display in the open ad market.
![Page 27: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/27.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
4. The publisher ad server connects to an SSP (Supply-SidePlatform). This platform monetizes its programmable displayinventory.
The SSP asks: have I already seen this consumer before ? Do I haveadditional data on him? The SSP requests extra information to aDMP (Data-Management Platform) about the user: profiling,audience segments, etc. Here machine learning is applied.
5. Using this information, the SSP sends the ad request to anad-exchange
![Page 28: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/28.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
6. Meanwhile, the ad-exchange is connected and exchanges withmany potential buying systems: DSP (Demand-Side Platform),ad-networks, even other ad-exchange networks.
Ad-network and DSP can have pre-cached bid: I’m paying 1$ for1000 displays of 25years-old males in France, I buy 100 displays assoon as the price is below some threshold (like a broker).
If no pre-cached bids, the ad-exchange says: no direct buyer for thisdisplay. Let’s us an auction rule! The RTB (Real Time Bidding)begins.
![Page 29: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/29.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
6. RTB: buyers have 10ms (!) to give a price to thead-exchange. Buyers assess in real-time how willing they are todisplay an ad to this customer.
Machine learning is used here, but only the prediction step, e.g. toassess the probability that the customer will click on some ads. Themodel must contain few parameters to answer quickly: the use offeature selection in the training step is crucial here.
![Page 30: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/30.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
7. The ad-exchange selects the highest bidder. The winning DSPgives instruction to the ad-exchange to retrieve the ad creative.
8. The ad-exchange passes these instructions to the SSP
9. The SSP send the request to the publisher ad server
10. The publisher ad server responds to the still existing httpconnection of the browser,11. 12. and tells to the browser to go to the agency’s ad serverto download the ad.
![Page 31: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/31.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
Now the ad can be displayed in the browser.
Full process takes < 100ms !Where is data science:
DMP side, to cluster audience into marketing segments and toprofile customers: clustering and classificationBuyer’s side (DSP, ad-network) to compute the price proposedfor RTB. Need to estimate the probability of a click on ads:regression and classification
![Page 32: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/32.jpg)
An example: Real Time BiddingAn example: Real Time Bidding
Some numbers for a large web-advertisement company:10 million prediction of click probability per secondanswers within 10msstores 20Terabytes of data daily
![Page 33: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/33.jpg)
Data Science ecosystemOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 34: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/34.jpg)
Data Science ecosystemA new Context
Data everywhereHuge volume,Huge variety...
Affordable computation unitsCloud computingGraphical Processor Units (GPU)...
Growing academic and industrial interest!
![Page 35: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/35.jpg)
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
![Page 36: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/36.jpg)
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar
cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv
cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv
⇒ Logistic regression for arbitrary large dataset!
![Page 37: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/37.jpg)
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
![Page 38: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/38.jpg)
Data Science ecosystemBig Data is (quite) Easy
Example of off the shelves solution
export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar
cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv
cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv
⇒ Logistic regression for arbitrary large dataset!
![Page 39: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/39.jpg)
Data Science ecosystemA Complex Ecosystem!
![Page 40: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/40.jpg)
Data Science ecosystemA Complex Ecosystem!
![Page 41: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/41.jpg)
Data cycleOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 42: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/42.jpg)
Data cycleData Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow vision
Goal orientedIterative and interactive process
![Page 43: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/43.jpg)
Data cycleData Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow visionGoal oriented
Iterative and interactive process
![Page 44: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/44.jpg)
Data cycleData
Raw material:Structured and unstructured data (Variety)Data quality issue (Veracity)Quantity (Volume and Velocity)
Various sources:Open data,Proprietary data
![Page 45: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/45.jpg)
Data cycleAcquisition, Cleaning, Storage and Integration
Get the data from the sources.Storage issue and availability for processing.Cleaning and formatingIntegration: Data preparation for analysisTime consuming!
![Page 46: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/46.jpg)
Data cycleAnalysis
Extract information from the dataStatistics/Machine learningBig Data: hardware is the limit (time/volume)
![Page 47: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/47.jpg)
Data cycleVisualization and Interface
Reporting part: Visualization, text...Also used for data explorationVery important aspect!
![Page 48: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/48.jpg)
Data cycleDecision and goal oriented analysis
Better decisions: ValueNeed to answer a problem/question!Need to formalize the problem: no answerwithout a question!Feedback everywhere...
![Page 49: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/49.jpg)
Data cycleReal Data Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow vision
Goal orientedIterative and interactive process
![Page 50: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/50.jpg)
Data cycleReal Data Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow visionGoal oriented
Iterative and interactive process
![Page 51: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/51.jpg)
Data cycleReal Data Cycle
DataAcquisition,
Cleaning andStorage
Integration Analysis Visualizationand Interface
DecisionProcess
Practical Issue
Data/Information flow visionGoal orientedIterative and interactive process
![Page 52: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/52.jpg)
Data cycleDoing Data Science
Doing Data Science: Straight talk from the frontline.Rachel Schutt, Cathy O’NeilO’Reilly
![Page 53: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/53.jpg)
Data Science projectOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 54: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/54.jpg)
Data Science projectA 7 step program
1. Identify the problemType of problems and metric used to measure successIdentify key people within your organization and outsideGet specifications, requirements, priorities, budgetsHow accurate the solution needs to be?Do we need all the data?Outsourcing?
![Page 55: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/55.jpg)
Data Science projectA 7 step program
2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?
![Page 56: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/56.jpg)
Data Science projectA 7 step program
2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?
3. Identify if additional data sources are neededWhat? How much? How to?Real time?Do we need experimental design?
![Page 57: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/57.jpg)
Data Science projectA 7 step program
4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback
![Page 58: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/58.jpg)
Data Science projectA 7 step program
4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback
5. Implementation, developmentFSSRR: Fast, simple, scalable, robust, re-usableDebuggingNeed to create an API to communicate with other apps?
![Page 59: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/59.jpg)
Data Science projectA 7 step program
6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation
![Page 60: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/60.jpg)
Data Science projectA 7 step program
6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation
7. MaintenanceTest the model or implementation; stress testsRegular updatesOutsourcing?
![Page 61: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/61.jpg)
Data scientistsOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 62: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/62.jpg)
Data scientistsSkills
Business:Business analysis, market knowledge, product usage,. . .
Data Management:Data collection, storage, cleaning, filtering,integration,. . .
Statistic and Machine Learning:Data modeling, inference, prediction, patternrecognition,. . .
Programming:Software development, Large-scale or parallel dataprocessing,. . .
Interface and Data Visualization:HCI design, visualization, story-telling,. . .
![Page 63: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/63.jpg)
Data scientistsProfiles
No one masters all the skills!
![Page 64: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/64.jpg)
Data scientistsData science team
Gather people having different skills
![Page 65: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/65.jpg)
Data scientistsMain types of data scientists
There are the ones...
Strong in statistics: develop new statistical theories for bigdata: statistical modeling, experimental design, sampling,clustering, data reduction, confidence intervals, testing,modeling, predictive modeling, etc.
Strong in mathematics: operations research, analyticbusiness (inventory management and forecasting, pricingoptimization, supply chain, quality control, yield optimization)
Strong in data engineering, Hadoop, database/memory/filesystems optimization and architecture, API’s, Analytics as aService, optimization of data flows
![Page 66: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/66.jpg)
Data scientistsMain types of data scientists
Strong in computer science (algorithms, computationalcomplexity, optimization)
Strong in business, ROI optimization, decision sciences(dashboards design, metric mix selection and metricdefinitions, ROI optimization, high-level database design)
Strong in production code development, software engineering
![Page 67: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/67.jpg)
Data scientistsMore than data scientists?
![Page 68: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/68.jpg)
Big Data, Data Science, StatisticsOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 69: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/69.jpg)
Big Data, Data Science, StatisticsWikipedia
![Page 70: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/70.jpg)
Big Data, Data Science, StatisticsWikipedia
Big data is an all-encompassing term for any collection ofdata sets so large and complex that it becomes difficult toprocess using traditional data processing applications.Data science is the study of the generalizable extraction ofknowledge from data, yet the key word is science.Statistics is the study of the collection, analysis,interpretation, presentation and organization of data.
![Page 71: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/71.jpg)
Big Data, Data Science, StatisticsData science evolution
Main Paradigmatic Changes in Big Data Analytics Environment
Big Analytics >2008 -up to now
(Unconstrained Data Mining)
Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs
Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes
No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.
Basic Analytical Principles
Hypotheses driven mode: Power use of sampling Techniques
Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations
Full Data driven mode:Power use of learning techniques, mainly unsupervised
Main Algorithmic approaches
Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.
Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.
Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.
New types of Business deliverables
Score Cards, Decisional Models based on sampling
Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling
Data types Homogeneous Structured Data (proprietary)
Homogeneous Structured & Homogeneous Unstructured Data, separately
Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)
VolumeCost/volume Exponential volume increase
Statistical Data Analysis<1985
(Pure Statistical Inference)
Business Intelligence 1985-2008
(Constrained Data Mining)
Exponential cost decrease
![Page 72: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/72.jpg)
Big Data, Data Science, Statistics3Vs of Big Data
![Page 73: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/73.jpg)
Big Data, Data Science, Statistics5Vs of Big Data
![Page 74: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/74.jpg)
Big Data, Data Science, StatisticsData science or statistics?
A vocabulary problem:
data scientist or statistician?
statistics or data science?
![Page 75: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/75.jpg)
Big Data, Data Science, StatisticsData science or statistics?
A possible answer:
![Page 76: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/76.jpg)
Computing and Distributed ComputingOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 77: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/77.jpg)
Computing and Distributed ComputingComputer Architecture
Everything should go through the CPU...
![Page 78: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/78.jpg)
Computing and Distributed ComputingMemories
CPU register 64 b × 16Level 1 cache access 8-128 kbLevel 2 cache access 32-1024 kbLevel 3 cache access 1-8 MBMain memory access 2-16 GBSolid-state disk I/O 250 GB-1 TB (4TB)Rotational disk I/O 500 GB-4 TB
![Page 79: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/79.jpg)
Computing and Distributed ComputingMemories
1 CPU cycle 0.3 ns 1 sLevel 1 cache access 0.9 ns 3 sLevel 2 cache access 2.8 ns 9 sLevel 3 cache access 12.9 ns 43 sMain memory access 120 ns 6 minSolid-state disk I/O 50-150 µs 2-6 daysRotational disk I/O 1-10 ms 1-12 monthsInternet: SF to NYC 40 ms 4 yearsInternet: SF to UK 81 ms 8 yearsInternet: SF to Australia 183 ms 19 yearsOS virtualization reboot 4 s 423 yearsSCSI command time-out 30 s 3000 yearsHardware virtualization reboot 40 s 4000 yearsPhysical system reboot 5 m 32 millenia
![Page 80: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/80.jpg)
Computing and Distributed ComputingDistributed/Parallel Computing
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Processor Processor
(a)
(b)
(c)
Distributed (a/b)Parallel (c)
![Page 81: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/81.jpg)
Computing and Distributed ComputingMultiCore
Several processors/cores with the same shared ram.No too expensive transfer between core.Strategies:
Independent batchParallelization technique limiting information transfer...
System memory limitation!
![Page 82: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/82.jpg)
Computing and Distributed ComputingHadoop and Map/Reduce
Data transfer through disk and networked file system!Hadoop: Node failure handling and ecosystem.
![Page 83: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/83.jpg)
Computing and Distributed ComputingSpark
Strategy: keep everything as much as possible in memory...
![Page 84: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/84.jpg)
Computing and Distributed ComputingGP-GPU
Combine different processor types...CPU < DSP < FPGA < ASICS
![Page 85: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/85.jpg)
Data Science ChallengesOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 86: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/86.jpg)
Data Science ChallengesNew Interdisciplinary Challenges
Applied math AND Computer scienceHuge importance of domain specific knowledge: physics,signal processing, biology, health, marketing...
Some joint math/computer science challengesData acquisitionUnstructured data and their representationHuge dataset and computationHigh dimensional data and model selectionLearning with less supervisionVisualization
![Page 87: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/87.jpg)
Data Science ChallengesData acquisition
Some challengesHow to measure new things?How to choose what to measure?How to deal with distributed sensors?How to look for new sources of informations?
![Page 88: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/88.jpg)
Data Science ChallengesUnstructured Data
Some challengesHow to store efficiently the data?How to describe (model) them to be able to process them?How to combine data of different nature?How to learn dynamics?
![Page 89: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/89.jpg)
Data Science ChallengesHuge Dataset
Some challengesHow to take into account the locality of the data?How to construct distributed architectures?How to design adapted algorithms?
![Page 90: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/90.jpg)
Data Science ChallengesHigh Dimensional Data
Main Paradigmatic Changes in Big Data Analytics Environment
Big Analytics >2008 -up to now
(Unconstrained Data Mining)
Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs
Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes
No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.
Basic Analytical Principles
Hypotheses driven mode: Power use of sampling Techniques
Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations
Full Data driven mode:Power use of learning techniques, mainly unsupervised
Main Algorithmic approaches
Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.
Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.
Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.
New types of Business deliverables
Score Cards, Decisional Models based on sampling
Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling
Data types Homogeneous Structured Data (proprietary)
Homogeneous Structured & Homogeneous Unstructured Data, separately
Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)
VolumeCost/volume Exponential volume increase
Statistical Data Analysis<1985
(Pure Statistical Inference)
Business Intelligence 1985-2008
(Constrained Data Mining)
Exponential cost decrease
Some challengesHow to describe (model) the data?How to reduce the data dimensionality?How to select/mix models?
![Page 91: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/91.jpg)
Data Science ChallengesLearning and Supervision
Some challengesHow to learn with the less possible interactions?How to learn simultaneously several related tasks?
![Page 92: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/92.jpg)
Data Science ChallengesVisualization
Some challengesHow to look at the data?How to present results?How to help taking better informed decision?
![Page 93: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/93.jpg)
Vocabulary of Data ScienceOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 94: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/94.jpg)
Vocabulary of Data ScienceLots of words
![Page 95: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/95.jpg)
Vocabulary of Data ScienceLots of words
![Page 96: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/96.jpg)
Vocabulary of Data ScienceMain Fields
Data mining. Extract patterns from data by combining methodsfrom statistics, machine learning and data processing technologies.Example: market basket analysis to model the purchase behaviorof customers.
Machine learning. Design and develop algorithms allowingcomputers to learn from data, in order to take intelligent decisionsautomatically. Example: Natural language processing.
Statistics. Collection, organization, and interpretation of data.Mathematical methods to construct quantitative assessments oferrors and risks when taking decisions, estimating parameters anddoing predictions. Example: quantitative assessments ofrelationships between variables, computing confidence intervals formodel parameters, hypothesis testing.
![Page 97: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/97.jpg)
Vocabulary of Data ScienceMain Fields
Natural language processing (NLP). Specialization of machinelearning and linguistics that builds algorithms to analyze human(natural) language. Example: sentiment analysis on socialnetworks.
Network analysis. Characterize relationships among nodes in agraph or a network, understand the communities, the influence ofnodes on the others, understand how information travels in thenetwork. Example: identify key opinion leaders in a social network,identify the information flows in a large company
Predictive modeling. Use of a mathematical model to predict anoutcome, e.g. regression, classification, etc. Example. Predict theprobability that a customer will churn.
![Page 98: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/98.jpg)
Vocabulary of Data ScienceMain Fields
Supervised learning. Machine learning techniques that infer afunction or a relationship from a set of training data. Examples:classification, regression
Unsupervised learning. Machine learning techniques that findsstructure in unlabeled data. Example: clustering is a part ofunsupervised learning
Visualization. Techniques used for representation of data bycreating images, diagrams, animations, in order to communicate,understand, explore and improve understanding of data.
![Page 99: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/99.jpg)
Vocabulary of Data ScienceMachine Learning
Labels. Characteristics / categories of interest in points of data.This is the information one wants to predict in supervised learning.
Features. A set of information about a point of data (a customer,a company, a country, etc.)
Clustering. Algorithms used in unsupervised learning to assign agroup to each data point. Groups are called clusters. Example:customer segmentation in a e-commerce platform
Classification. Algorithms used in supervised learning to predictthe labels of data points. It relies on the training of a model oralgorithm using labeled data. Keywords: Logistic Regression,SVM, CART, Boosting, etc.
Feature selection. Algorithms in machine learning that selectfeatures that best explain an outcome. Example. In biology, findthe genetic informations that best explains a patient’s response toa drug.
![Page 100: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/100.jpg)
Vocabulary of Data ScienceMathematical Concepts
Generalization. Ability of a predictive algorithm to generalize:give good predictive results on a sample different than to one usedto train the algorithm.
Parameters. A set of coefficients (vectors, matrices), thatspecifies a model. Example: the mean and standard deviation of aGaussian distribution
Statistical model. A mathematical formulation that attempts toexplain how data is generated. Example: data is generated by amultivariate Gaussian distribution
Likelihood. The probability that data is generated by a model forsome parameters choice
![Page 101: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/101.jpg)
Vocabulary of Data ScienceMathematical Concepts
Optimization. Study and design of numerical algorithms used to(but not only) minimize or maximize functions. Example:optimization of a likelihood is called the training step in machinelearning.
Goodness-of-fit. A quantity that assesses the closeness of a(trained) model to data. Example. The least-squares error forlinear regression.
Over-fitting. Something that must be avoided to have a goodpredictive performance on new data.
Cross-validation. Splitting of data into several subsets. A modelis trained on a subset and tested on another, to check itsgoodness-of-fit both on data used for training, but on new data aswell.
![Page 102: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/102.jpg)
Vocabulary of Data ScienceData Engineering
Structured data. Data structured in fixed fields. Example:relational databases or excel spreadsheets.
Semi-structured data. Data not structured in fixed fields butcontain markers to separate data elements. Example: XML orHTML-tagged text.
Unstructured data. Data not structured in fixed fields. Example:books, articles, body of e-mail messages, audio, image and videodata, etc.
Metadata. Data that describes the content and context of data:creation, purpose, time and date, author, etc.
![Page 103: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/103.jpg)
Vocabulary of Data ScienceData Engineering
Data fusion and data integration. Set of techniques thatintegrate and analyze data from multiple sources, instead ofanalyzing single sources of data. Example: combine analysis ofsocial network data with NLP and real-time sales data, to assessthe effect of a marketing campaign on customer sentiment andpurchases
![Page 104: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/104.jpg)
Vocabulary of Data ScienceDatabases and Data Procesing
Cloud computing. A computing paradigm where computingresources are configured as a distributed system, which provides aservice through a network.
Distributed system. Several computers, communicating througha network, used to solve a common storing or computationalproblem. Aim is higher performance at a lower cost, higherreliability and scalability.
Relational database. A database consisting of collections oftables (relations), namely data are stored in rows and columns.SQL is the most widely used language for managing relationaldatabases.
Non-relational database. A database that does not store data intables (rows and columns).
![Page 105: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/105.jpg)
Vocabulary of Data ScienceDatabases and Data Procesing
SQL. Acronym for Structured Query Language. It is a computerlanguage designed for managing data in relational databases.Example. insert, query, update, delete data, manage databasestructures, and control access to data in the database.
NoSQL. A group of database management systems. Data is notstored in tables like in Relational database. It does not rely on themathematical relationship between tables. It gives a way of storingand retrieving unstructured data quickly.
Hbase. A distributed, non-relational database. It is managed as aproject of the Apache Software foundation and a part of Hadoop.
![Page 106: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/106.jpg)
Vocabulary of Data ScienceDatabases and Data Procesing
Hadoop. A framework that supports large scale data processingby allowing the decomposition of large tasks into smaller tasks,that are executed in parallel, on independently slices of the dataand then finally merged to answer to the task.
MapReduce. A software framework introduced by Google forprocessing huge datasets on a distributed system. Implemented inHadoop. It supports large scale data processing by decomposinglarge tasks into smaller tasks, executed in parallel, on independentparts of data and finally merged to answer to the task.
Stream processing. Technologies designed to process largereal-time streams of event data. Example: high-frequencyalgorithmic trading, analysis of Twitter data streams
![Page 107: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/107.jpg)
BibliographyOutline
1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography
![Page 108: Data Science Starter Program Introduction to Data Science](https://reader035.fdocuments.us/reader035/viewer/2022062404/613c118622e01a42d40e7804/html5/thumbnails/108.jpg)
BibliographyBibliography
T. Hastie, R. Tibshirani, and J. Friedman (2009)The Elements of Statistical LearningSpringer Series in Statistics.
G. James, D. Witten, T. Hastie and R. Tibshirani (2013)An Introduction to Statistical Learning with Applications in RSpringer Series in Statistics.
B. Schölkopf, A. Smola (2002)Learning with kernels.The MIT Press
R. Schutt, and C. O’Neil (2014)Doing Data Science: Straight talk from the frontlineO’Reilly