Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data...

31
Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad

Transcript of Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data...

Page 1: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Practicing Data Science in the wild

(Or the view from the trench) Arijit Laha

Senior Principal Data ScientistInfosys Ltd, Hyderabad

Page 2: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Why this talk?• I am assuming the audience is mostly consisted of practicing analysts/data

scientists and future ones• (Hoopla notwithstanding) Given the early stage of this area, there is a lot

of confusion and muddy waters flowing around• (“…clear as mud but it covered the ground” – from Man Piaba, Harry Belafonte)

• There is no established “best practices” as yet• Lack of proper knowledge among stakeholders and resulting opportunism

is quite common• We need to practice rigor and transparency all around for sustaining long-

term growth both as individual as well as part of the industry• I do not offer solutions, but share my approach and understanding with

the hope this will stimulate thoughts and discussions.

Page 3: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

A Disclaimer• I, personally, am not yet very sure about what “Data Science” exactly

is and what belongs within its scope and what not. In fact, I feel a little semantic dissonance about this term. Anyway, we are using it to mean certain things (not clearly defined), future will tell whether we it sticks or proves a temporary placeholder.

Wikipedia says:

Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured,…

Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, chemometrics, information theory, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, and high performance computing. …

• Clearly, there are many trajectories for one to hit the “Data Science” land• My own trajectory is Physics -> Computer Science -> Pattern

Recognition and Machine learning -> Data Science• Hence, I shall have certain viewpoint/bias, strengths, weaknesses

and holes in knowledge (well, we all have, only they appear at different spots)

Page 4: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Another perspective: The Data Scientist ZooVoulgaris, Zacharias (2014-05-09). Data Scientist: The Definitive Guide to Becoming a Data Scientist

• There are five different types of data scientists: • Data developers • Data researchers • Data creatives • Data businesspeople • Mixed/ generic

• The data developers are experts in programming, but may lack other parts of the data scientist skill-set. They usually come from the IT industry.

• The data researchers are experts in data analysis techniques and possess state-of-the-art knowledge in machine learning and other fields. They usually have a PhD and have been or are involved in academic research.

• The data creatives are more holistically developed as data science professionals than the other two types, have a bias towards using open-source software, and are very versatile. They come from all kinds of industries, though usually they are computer scientists already.

• The data businesspeople (aka senior data scientists) are the highest level of data scientist and usually have managerial roles, closer to the business world than to data science per se. They usually come from a mixed background that includes a degree in management.

• The mixed/ generic type of data scientists are the most balanced, having developed all of the aspects of data science more or less equally. They have less breadth of experience than data businesspeople, are very versatile, and come from all types of backgrounds. Usually, the mixed/ generic data scientist evolves into the data businesspeople type.

Page 5: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Progress, rapid progress in analytic prowess …

Page 6: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

We a

re m

aki

ng

fast

pro

gre

ss• In sources of large data

• Business operations/transactions• Web• Monitoring• Logging• Sensors• Experiments

• In computing infrastructure• Hadoop and friends: distributed batch processing on cheap hardware• Apache Spark: Distributed in-memory batch processing, can simulate stream processing with continuous processing

of short interval jobs (Spark Streaming)• Apache Storm: Distributed in-memory real-time (stream) processing• GPU computing• Mobile devices, sensors

• In techniques• Deep learning• LDA and other NLP techniques• … many others

• In exciting works – Hardly a day goes when we cannot find some new and interesting works reported• My finds Today (18th August 2015) include

• Flickr photo data used to predict people's locations (http://phys.org/news/2015-08-flickr-photo-people.html) – A media report on the publication “Modelling human mobility patterns using photographic data shared online”, D. Barchiesi , Tobias Preis, S. Bishop and H. S. Moat, Royal Society Open Science, Published 12 August 2015

• Baidu explains how it’s mastering Mandarin with deep learning (https://medium.com/s-c-a-l-e/how-baidu-mastered-mandarin-with-deep-learning-and-lots-of-data-1d94032564a5 )

Page 7: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Progress means …

• Enhanced abilities• Improved confidence• More problem-solving opportunities• More research problems• More things to know about

Page 8: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Who (other than the consumers ultimately) get the benefits?

• Researchers tackling problems not feasible earlier• And in the process contributes to the progress

• New-age tech companies, the tip of the iceberg• Their existence depend on this progress• In fact, they are usually the torch-bearers, inventors of most of them • Continually improve the customer/consumer experience and, of course, make more money• They have dedicated army of analysts/data scientists to leverage the progress

• Rest of the economy, the hidden part below the water• Examples: Services (financial, telecom,…), Manufacturing, • Their core business (apparently) does not require them to participate actively in this progress• But, have potential of great value-add from absorbing and leveraging analytics technologies and

techniques

They need help from the practitioners, the analysts/data scientsts

– This talk is about how to deliver the benefit to themLet us call them “clients”

Page 9: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

The client• Firstly, what constitute a client?• Client typically has an organizational identity, but that is not enough, we need

to interact with people

• From a practitioner’s (very simplified) perspective a client is consisted of the following entities (also called ‘stakeholders’)• The Project Sponsor: Owner of the problem (and also the purse string!)• Domain Experts/Subject Matter Experts (SMEs): Source of information

required to understand the problem as well as intended users• Project team: Client’s representatives responsible for getting a solution built

• If the solution involves application of “data science” techniques, the data scientist enters the stage.• But, at this point of time, data science is just beginning to make

inroad – the perception and expectations are wildly different

Page 10: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

What are their concerns?• Sponsor • How much more money can be made and/or• How much money can be saved

• SME/User• Accuracy• Usability• Maintainability

• Project team• Infrastructure • Budget• Deadlines

• A data scientist needs to be equipped to reconcile these and many more

Page 11: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Talking stakeholders’ language

Data pipeline: Access/Acquire, Process, Reconcile

For Project Team:Algorithms: Computation of measures/statistics/ranks,

Development/Application of models – classifiers, regression functions, clusters,

For SME/User:Domain tasks: Churn prediction, Customer

Segmentation Product Recommendation, Fault identification/prediction

For Sponsor:Business impacts: Customer retention,

Productivity increase

DataScientist

Page 12: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

How to proceed with rigor and transparency

Page 13: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Why is data science difficult to practice?… at least at this point of time• To understand, think of research and practice, in general. • Research is perceived to more difficult than practice. Why?

• Research is aimed at bringing “unknown” to the realm of “known”• Practice is aimed at utilizing “known” for generating value

• Now, what does today’s data scientist face?• Almost every (serious, and thus worth more money) problem has some aspects of uniqueness• Usually there is a very large solution space to explore in terms of design, techniques,

algorithms• Data: What is needed? How much is enough? What is the quality?• Solution: How to be sure of (or demonstrate) the quality/validity?

• Essentially, there is a significant walk into unknowns is involved. No wonder we often find excellent research-level of works by practitioners• “Skill” is always necessary, but often may not be “sufficient”, we may need to bring

“talent” into play.

Page 14: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Data Science (mostly ML- oriented perspective)• Pattern Recognition and machine learning

• Data collected on purpose• Data fits in main memory• Core issues: Building classification, clustering, regression models using complex

but limited data

• Data mining – pattern recognition on steroid• Data in hard disk of a single computer• Working with repurposed data • Core issues : Modifying PR algorithms, tackling new tasks (e.g., ARM) tackled.

• Big data analytics• Data repurposed and heterogeneous (mix of structured and unstructured data

from multiple sources)• May require distributed data and computation• Core issues : Building analysis pipelines/architectures, data reconciliation, new

generative models (e.g., LDA)

Page 15: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Data science life-cycle (DSLC)

Source: Data Science for Business – Provost and Fawcett

Start

Page 16: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

DSLC vs SDLCCharacteristic SDLC DSLC

Begins with Specs A vague request

Ends with Successful UAT Value realization*

Key components are Deterministic Probabilistic

Progress can be measured at

Always (milestones reached)

Typically after 75% effort

Knowledge of science & math Incidental Fundamental

Design or build first? Design first Build first

Cost of poor design Cost overrun Catastrophic

Research & Innovation In niche areas Diverse and Applied

Curtsey: Sandeep Rajput

Page 17: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Taking client into confidence• As practitioners, we should be able to articulate to the stakeholders the

benefit, cost, uncertainty and risks• This may sound commonplace, but in case of an analytic/data science

project, the knowledge gaps are often very significant• Thus, the perceived complexity of work and expectations may differ

significantly• This is essentially a communication problem arising out of relative

novelty specialized nature of the area • There is no known remedy to these problems yet• In my experience, dedicated workshop(s) with the stakeholders before

commencement of the project can be highly beneficial

Page 18: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Solution building and deployment• These should be thought about separately and the platform and

technology needs to be chosen accordingly• Priorities while building• Ease of experimentation• Quick turn-around time

• Priorities while deploying• Deployed artifact (e.g., model built in the development stage)• Scale• Time constraints

• Obviously, the solution must be adaptable to the deployment environments

Page 19: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Data science solution building/development 1. Formulation of business question2. Mapping of the business question into a technical problem3. Set up acceptable performance/accuracy parameters for the

answers4. Determination of data availability

a. Data sourcingb. Data cataloguing

5. Preliminary verification that the data available can be used for answering the business question

6. Identify target environmentsa. Development b. Production/deployment

Contd…

Page 20: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

7. Data Sourcing/Acquisition8. Data preparation

a. Data cleaningb. Data transformation/ feature extractionc. Data sampling and partitioningd. Data fusion

9. Data exploration and Understandinga. Data visualizationb. Data statisticsc. Data reconciliationd. Data preprocessing

Contd…

Contd…

Page 21: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

10. Design solution architecture11. Select relevant/possible/available modeling/analysis techniques for solution

componentsa. Perform modeling/analysis using each of the selected techniquesb. Evaluation and comparison of performance

I. Applying trained model on test dataII. Analyze the results

• Result statistics• Result visualization

c. Select and/or combine best performing model(s)

12. Build the solution by putting together the components13. Analyze and validate the solution14. Translate the output in business language15. Create visualization and/or data products (e.g., dashboard) comprehensible

to the business user

Contd…

Page 22: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Elements of activity for each stage• Participants

• Roles• Skills• Contribution

• Activity• Initial artifacts• Transformation/processing of artifacts

• Tools/technology• Efficiency• Usability• Limitations

• End artifacts

• Validation• Utility• Quality

Page 23: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Solution Deployment• Design the data pipeline based on

• Adapt the solution to the deployment environment and/or context• Prepare and implement an effective and transparent testing and

validation strategy (e.g., pilot deployment, A/B test, …)• If required, modify the solution• Finally prepare a solution maintenance strategy

The knowledge acquired during development on:7. Data Sourcing/Acquisition8. Data preparation

a. Data cleaningb. Data transformation/ feature

extractionc. Data sampling and partitioningd. Data fusion

Page 24: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

State of the world• A very strong interest in advanced data science/analytics exists today• A very large potential clientele also exists• But not many clients knowledgeable enough to understand and

articulate what they really need• And there is far from enough competent professionals to guide them

and/or deliver useful stuff• Consequences

• The business problem itself is being redefined and often trivialized by the professionals

• The client does not get what really needed but not able to point that out

• Clearly it is not sustainable• In fact, a large proportion of data science projects do not progress

beyond POC stage

Page 25: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Waiting for the second wave… hopefully it will arrive before it is too late• That is you, my friends• Who are willing to invest effort in learning • You will be there to help clients asking for right things• And to deliver what he will really get benefit from

Page 26: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Let us open the discussion

Page 27: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Discussion point: Is it similar to Agile development?

Source: Data Science for Business – Provost and Fawcett

Start

DSLC

Page 28: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Discussion point: Must we always use or at least try to, one or other big data technology?

Page 29: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Discussion point: Do we need to replicate deployment environment while developing solution?

Page 30: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Anything else??

Page 31: Practicing Data Science in the wild (Or the view from the trench) Arijit Laha Senior Principal Data Scientist Infosys Ltd, Hyderabad.

Thank youYou can reach me at [email protected]

and [email protected]

I am planning to start putting up some stuffat https://sites.google.com/site/arijitlaha/

that some of you might find interesting.Feel free to have a look when you have spare time