Presentation
-
Upload
jonathan-dinu -
Category
Technology
-
view
277 -
download
0
Transcript of Presentation
Jonathan DinuVP of Academic Excellence, Galvanize
[email protected]@clearspandex
Questions? tweet @clearspandex
• What is Data Engineering?
• Why is Data Engineering?!
• How is Data Engineering!?!
• Data Architectures
• Building a Pipeline (w/ Luigi)
• Q&A
Outline
Questions? tweet @clearspandex
Data Engineering
Questions? tweet @clearspandex
(In Reality)
http://resource.mmgn.com/Gallery/full/SBS5NKQM.jpg
Data Science
Questions? tweet @clearspandex
Prepared Data
Test Set
TrainingSet Train
ModelSampling
EvaluateCross
Validation
Data Engineering
Questions? tweet @clearspandex
Raw Data
Cleaned Data
Scrubbing
Prepared DataVectorization
New Data
Test Set
TrainingSet Train
ModelSampling
EvaluateCross
Validation
Cleaned Data
Prepared DataVectorizationScrubbing
Predict
Labels/Classes
All the Things weNeed to Do
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
At Scale Locally
scrapy
Hadoop Streaming (w/ BeautifulSoup4)
mrjob or luigi
Snakebite (HDFS)
Spark ML (pySpark)
Flask
scikit-learn/NLTK
Questions? tweet @clearspandex
Code
Questions? tweet @clearspandex
• Always keep raw data
• Data Lineage
• Apply a series of transforms to data.
• Flexible, Modular, Extensible (and testable!)
Why Pipelines?
Code
Questions? tweet @clearspandex
• Idempotence
• Checkpointing
• Native Hadoop Support
• But Works for arbitrary scripts (like make!)
Why
Code
Questions? tweet @clearspandex
• Idempotence
• Checkpointing
• Native Hadoop Support
• But Works for arbitrary scripts (like make!)
(also has a nice UI and sends emails :)
Why
Data Engineering
Questions? tweet @clearspandex
Raw Data
Cleaned Data
Scrubbing
Prepared DataVectorization
New Data
Test Set
TrainingSet Train
ModelSampling
EvaluateCross
Validation
Cleaned Data
Prepared DataVectorizationScrubbing
Predict
Labels/Classes
Data Engineering
Questions? tweet @clearspandex
Raw Data
Cleaned Data
Scrubbing
Prepared DataVectorization
New Data
Test Set
TrainingSet Train
ModelSampling
EvaluateCross
Validation
Cleaned Data
Prepared DataVectorizationScrubbing
Predict
Labels/Classes
Framework/Library
Big Data (scalability)
Small Data
Bespoke Code
Cloudera ML
Mahout
MLlib (amplab)H20 (0xdata)
C/C++
MapReduce (Streaming)
MapReduce (Java)
Cascading/Crunch
Pig/Hive
Vowpal Rabbit
GiraphGraphLab
SparkStorm
CRANR
Python
Javascikit-learn
pandas
mlpack
Weka
Numpy
Javascript
All the Things weHave to Do it!
Questions? tweet @clearspandex
Machine Learning
Raw Data
Cleaned Data
Events
Prepared DataVectorization
Test Set
TrainingSet Train
ModelFeatures
EvaluateCross
Validation
Prediction/scoreLabels/Classes
Scoring ServerApplication
Machine LearningService
Raw Data
Cleaned Data
Events
Prepared DataVectorization
Test Set
TrainingSet Train
ModelFeatures
EvaluateCross
Validation
Prediction/scoreLabels/Classes
Scoring ServerApplication
Machine LearningService What goes
on in here??!?
Machine Learning
Raw Data
Cleaned Data
Events
Prepared DataVectorization
Features
Prediction/scoreLabels/Classes
Scoring ServerApplication
Machine LearningService What goes
on in here??!?
ContinuouslyTraining
Machine Learning
Architecture
Questions? tweet @clearspandex Source: http://www.fwdlife.in/bizarre-buildings-around-the-globe/
Architecture
Questions? tweet @clearspandex Source: http://slides.com/gchomatas/delivering-a-big-data-ready-minimum-viable-product
Kafka
Architecture
Questions? tweet @clearspandex Source: http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Architecture
Questions? tweet @clearspandex Source: http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Our Pipeline
RC P
TT M
EC
L
SA
M
Architecture
Questions? tweet @clearspandex Source: http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
So, why the excitement about the Lambda Architecture? I think the
reason is because people increasingly need to build complex, low-latency
processing systems. What they have at their disposal are two things that
don’t quite solve their problem: a scalable high-latency batch system that
can process historical data and a low-latency stream processing system
that can’t reprocess results. By duct taping these two things together,
they can actually build a working solution.
“- Jay Kreps
Experience
Iteration 2: Data!
Questions? tweet @galvanize
METRICS
METRICS EVERYWHERESaturday, April 9, 2011
Data Science Immersive
Masters in Data Science
Data Engineering Immersive
Weekend Workshops
+
Questions? tweet @clearspandex
Goals
• Full-time Instructors
• TAs
• Mentor (volunteer)
We’re Hiring!
Questions? tweet @clearspandex
Questions?
Thank You!
Jonathan DinuVP of Academic Excellence, Galvanize
[email protected]@clearspandex
Questions? tweet @clearspandex