Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ......
Transcript of Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ......
Staying Ahead of the Data AvalancheChallenges and Opportunities in Analytics
Prof. Dr. Seppe vanden BrouckeSAS Analytics Experience Rome – 8 November 2016
Presenter: Seppe vanden Broucke
• Assistant professor in Data and Process Science at department of Decision Sciences and Information Management at KU Leuven (Belgium)
• PhD in Applied Economics at KU Leuven, Belgium in 2014• Title: Advances in Process Mining: Artificial Negative Events and Other Techniques
• Research: business data mining and analytics, machine learning, process management, process mining
• Contact: www.dataminingapps.com [email protected]
BIGDATA
“We live in a data flooded world”
“Making sense of mountains of data” aka
“Scale your data mountain”
“The data avalanche”“Data is
the new
oil”
“The data tsunami”
BIGDATA
“It all sounds kind
of dangerous”
BIGDATA
DATASCIENCE+ =
But so many success stories…
&ANALYTICS
“We live in magical times”
Uber
Contextual RNN-GANs for Abstract
Reasoning Diagram Generation
Arnab Ghosh*, Viveka Kulharia*, Amitabha
Mukerjee, Vinay Namboodiri, Mohit Bansal
Measuring an Artificial Intelligence System's
Performance on a Verbal IQ Test For Young Children
Stellan Ohlsson, Robert H. Sloan, György Turán, Aaron
Urasky
BIGDATA “Let the good
times roll”
DATAANALYTICS
+
So why do so many projects fail?
“During 2015, only 15% of Fortune 500 organizations were able to
exploit big data for competitive advantage” – Gartner
“Data maturity of companies is very disparate, and
the most advanced of them start doubting.”
– Christophe Bourguignat
“75 % have invested in Big Data, but only 10% have
projects in production.”
Companies face disillusions. They start asking
questions: I know how much it costs, but how much
do I earn? What is my return on investment?
Machine learning and data science have ( just) reached “peak hype”
The challenges ahead
TALENT PROCESSTOOLS,
FILES,
FEEDS
COMMU-
NICA-
TION
MEA-
SURING
PRIVACY,
COM-
PLIANCE
ETHICS
QUALITY
TALENT“A data scientist is like a gold-coloured unicorn:
mythical powers, but impossible to find”
TALENT“A data scientist is like a gold-coloured unicorn:
mythical powers, but impossible to find”
Programmer
TALENT Or a spider with 25 legs?
Data science as a straight through process?PROCESS
Adhering to a data science workflow is A-OK:
• CRISP-DM
• The KDD process
• SEMMA
• BinaryEdge
Data science as a straight through process?PROCESS
Data
Selection Cleaning Transformation DiscoveryInterpretation/
Evaluation
Selected Data
Cleaned/Processed
Data
Transformed Data
Mined Model/Patterns
Knowledge/Insights
Not really...PROCESS
Data
Selection Cleaning Transformation DiscoveryInterpretation/
Evaluation
Selected Data
Cleaned/Processed
Data
Transformed Data
Mined Model/Patterns
Knowledge/Insights
More like a loopPROCESS
Experiments can take a while…PROCESS
These things are hardPROCESS
• How to create a sense of urgency?
• What does it mean to be finished?
• You can’t predict the future.
Throw it over the wall projectsCOMMU-
NICA-
TION
Throw it over the wall projectsCOMMU-
NICA-
TION
I want to put this GBM into production,
though some steps are done using R and SAS
Anyone know what this XGBoost thing is?Why aren’t we
deployed yet? We have all this data, why can’t
we find interesting customers?
Talking helpsCOMMU-
NICA-
TION
• Learn each other’s language
• Think with your business hat
• Teach semantics (why a shorter lead list is not easier
to produce)
• Convert hard problems into simpler ones
• Use examples, methaphors, analogies
• Show them and show them often
• IT and data science can live together
“Not everything that counts can be counted…
and not everything that can be counted counts”MEA-
SURING
• Show before and after
• “When are you happy?”
• Accept failures
• Manual measuring can be a good thing• Hard to automate subjective feelings…
“No one ever got fired for installing Hadoop on a
cluster… right?”
TOOLS,
FILES,
FEEDS
A fool can ask more questions in an hour than a
wise man can answer in a hundred years
TOOLS,
FILES,
FEEDS
• Focus on the files
• What are we going to use it for?
A data scientist can find, love, and ditch more
tools/libraries/… in an hour than a procurement
officer can vet in a hundred years
Focus on feeds, files, dataTOOLS,
FILES,
FEEDS
• Let them (us) own the data
• Ship fast, ship often
• Focus on format and storage standards, not on
technology:
“Can I get information on X for months A and B with only those
columns that changed?”
... “Can I get it myself?”
• Where’s your golden data set?
• Trust your experts
Technology moves too fast anyway…TOOLS,
FILES,
FEEDS
• HDFS?
• What about HFD5, or Kudo?
• Do we even have unstructured data?
• Do we know what to do with it?
• V’s of Big Data – yeah right!
• BigSQL, or Hive, or Slurp?
• Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?
• What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X
• We did these things before… they weren’t hard then
• True, but…
It’s a difficult balanceTOOLS,
FILES,
FEEDS
The wall of deployementTOOLS,
FILES,
FEEDS
• Versioning
• Collaboration
• Scalable execution
• Multiple language support
• Multiple kernel support
• Monitoring
• Scheduling
• Acyclic dependency graphs
• Quite different from playing in a notebook• Vendors are starting to help out
• SAS, SPSS, Domino Data Labs, sense.io, ScienceOps
<-> Jupyter, Rodeo, Your 3GB PIP packages
• Not familiar both to most data scientists (too messy) and IT shops (too
unfamiliar)
• Can new hires get set up in the environment to run analyses on their first day?
• Can data scientists utilize the latest tools/packages without help from IT?
• Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?
• Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?
• Does collaboration happen through a system other than email or copying files?
• Can predictive models be deployed to production without custom engineering or infrastructure work?
• Is there a single place to search for past research and reusable data sets, code, etc?
• Do your data scientists use the best tools money can buy?
Source: https://blog.dominodatalab.com/joel-test-data-science/
The “Joel Test” for Data ScienceTOOLS,
FILES,
FEEDS
Garbage in…QUALITY
“This model is gonna be great!”
Sometimes they are…QUALITY
• Really: everyone has bad data• But: more “bad” means more time
• Do make sure to get a continuous source
to the “bad” data
• Survey: 50+ banks participating world-wide• Most banks indicated that between 10–20 percent of their data suffer from data
quality problems
• Manual data entry is one of the key problems
• Diversity of data sources and consistent corporate wide data representation the
main challenges for data quality
• Regulatory compliance is the key motive to improve data quality
Oh boy…
• Datensparsamkeit
• Cookie law
• Basel II / III
• Who knows where the cloud is anyways?
• EU directives outdated
• “It’s all on Facebook anyway”
PRIVACY,
COM-
PLIANCE
Academics are just getting started…PRIVACY,
COM-
PLIANCE
In more ways than one...PRIVACY,
COM-
PLIANCE
“If only we didn’t have to worry about this”PRIVACY,
COM-
PLIANCE
Use it as a competitive
advantage?
PRIVACY,
COM-
PLIANCE
45
https://backchannel.com/an-exclusive-look-at-how-ai-and-machine-learning-work-at-apple-8dbfb131932b#.crky6nt6k
Data science for good?ETHICS
• Can an algorithm be racist? Sexist?
• “Will Predictive Models Outliers Be The New Socially
Excluded?” Companies like DataKind, or Bayes Impact
• Concept of open models
The challenges today
TALENT PROCESSTOOLS,
FILES,
FEEDS
COMMU-
NICA-
TION
MEA-
SURING
PRIVACY,
COM-
PLIANCE
ETHICS
QUALITY
Thank you