Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation...

Post on 10-Jul-2020

3 views 0 download

Transcript of Practical Data Scienceamitsome/datascience/201516/... · • The rules governing the conversation...

Practical Data Science Yossi Attas

About me

Yossi Attas – Principal R&D Manager

Microsoft Application Insights

yossia@microsoft.com

About this course

Goals

1.  Understand the processes, methodology and tools of generating and communicating actionable insights on top of big data*

2.  Learn practical use of telemetry data as key enabler of application success

* To simplify the course, we reduced the volume of data to enable easier insight production with just a laptop

Agenda for today

1.  What is data science? (or why you should be here?)

2.  The data set for the course

3.  The project

4.  Homework

1. What is Data Science?

Data Science is hot…

•  “Data scientists are the new superheroes”

•  KPMG survey of C-level executives: “99% said analysis of big data was important to their strategy for next year”

•  McKinsey: “by 2018 U.S. alone may face a 50%-60% gap between supply and demand of deep analytic talent”

What happened?

•  Exponential growth of data generated and collected:

•  2.5 quintillion bytes of data are created daily

•  90% of the data in the world today has been created in the last two years alone.

•  Enterprise-generated data is expected to exceed 240 exabytes daily by 2020

•  A single connected car generates 25GB data per hour

•  Data storage prices are dropping

•  Over the last 30 years, space per unit cost has doubled roughly every 14 months

•  Affordable tools to analyze massive volumes of data

Storage prices are dropping

Hard drive storage prices are dropping

“Cloud wars” are driving cloud storage prices even further down

How big is Big Data?

•  Can you imagine a petabyte? Exabyte?

•  1 PB == if you counted one byte per second, it would take 35.7 million years

•  200 PB == the entire written works of mankind, from the beginning of history, in all languages

•  5 EB == all words ever spoken by human beings

•  And yet this is today reality:

•  Google processes 100 petabytes of data every single day •  Stores 15 EB of data on 3 million servers

•  Microsoft’s Cosmos (internal analytical Map/Reduce system) – one of many systems •  3 Exabyte, 160K servers, 200K+ jobs/day

•  Not just Internet companies - Walmart processes 40PB / day

Big Data Technologies

•  Can you master them all?

•  Hint: no-one can •  …or needs to

What is Data Science?

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

•  HOW: Hacking skills (aka: master the technology)

•  WHY: Math & Statistical Knowledge (aka:

correct interpretation of your findings) •  Running ML algorithm today is as easy

as calling a function or pushing a button… but beware!

•  WHAT: Substantive expertise (aka: domain knowledge)

Goal: Actionable Insight

•  Insight == actionable, data-driven finding that create business value

•  Metrics are easy, insights are hard

•  To be actionable, insight needs to be:

•  Useful/Valuable – clear business gain

•  Accessible – easily understood by relevant stakeholders

•  Non-trivial – “tell me something I don’t know”

Insight – Example (Walmart, 2012)

•  Improving page performance by 1-2 sec results in significantly better conversion (% of customers completing a purchase) – worth many millions of $$

http://www.webperformancetoday.com/2012/02/28/4-awesome-slides-showing-how-page-speed-correlates-to-business-metrics-at-walmart-com/

Getting to the Insight – Two approaches

1.  Top down, hypothesis-driven

•  Given the problem à formulate hypothesis à work with data to prove/disprove it •  Example:

•  Problem: “Should we invest in improving performance” •  Hypothesis: “Poor performance causes lower conversion” •  Insight: previous slide

2.  Bottom up, data-driven

•  Given the data, explore it to find new insights

•  Example:

Getting to the Insight - Process

Formulate the hypothesis

Acquire the data

“Learn” the data

Cleanse the data

Produce the insight Validate Visualize /

Communicate

Summary

Data scientist must:

1.  Focus on the right problem

2.  Get the data

3.  Produce the insight

4.  Communicate

In this course:

1.  You pick the problem (but we can help)

2.  We give you the data

3.  Use Python with ML packages Use Excel to explore the data

4.  Use Excel to visualize your findings

2. The Data Set

The problem

•  “By 2017, 94.5% of downloads will be for free apps; Less than 0.01% of consumer mobile apps will be considered a financial success”

-Gartner

Situation: building successful apps is hard

•  Fierce Competition: User retention requires constant improvements of apps

and services

• Constant Evolution: Web services & Mobile apps need to evolve rapidly to

survive & grow

• Continuous Delivery: Most major services push update as often as every day

What is telemetry data?

•  Telemetry data tracks the behavior of the application to establish

•  Operational KPIs

•  Availability

•  Performance

•  COGs

•  Business KPIs

•  Adoption

•  Engagement

•  Retention

•  Conversion funnels

requires instrumenting client and server code

What is Application Insights? Telemetry is collected at each tier, incl. browser and server-side 1

Telemetry arrives in the Application Insights service in the cloud where it is processed & stored

Get a 360° view of the application including availability, performance and usage patterns 3

2

Application Insights

Data set for this course

•  Requests – observed app behavior

•  Capture the details of HTTP request processed by web server, e.g.:

•  URL, success/failure (incl. response code), duration + info about device sending request

•  Can be used to understand: reliability of the site (how many requests succeed); performance of site or certain pages; volumes

•  PageView – observed user behavior

•  Capture the details of HTML page viewed by the user, e.g.:

•  URL, many details on the devices used (location, OS, browser, screen size, …)

•  Can be used to understand: usage patterns; audience segmentation

•  AJAX events

•  Capture the detailed interactions of the specific page with the server, both system and user originated

•  Exceptions

•  Also, all telemetry types are linked to a user and user session

A simplest web application

http://www.site.com/index.html

Hello, world

Browser communicates with Web server; fetches and renders HTML pages Into beautiful screens

HTTP HTTP

Web server accepts HTTP requests, Performs business logic and serves (dynamic) HTML pages

An HTTP conversation

I would like to open a connection

GET http://www.site.com/index.html

Display response

Close connection

OK

Send page or error message

OK

Client Server

<!DOCTYPE html> <html> <body> <h1>My First Page</h1> <p>Hello, world!</p> </body> </html>

Both were invented at the same time by the same person: Sir Tim Berners-Lee, 1989

HTTP vs HTML

•  HTTP: hypertext transfer protocol

•  The rules governing the conversation between a Web client and a Web server

•  How messages are formatted and transmitted; what actions web servers and browsers should take in response to various commands

•  HTML: hypertext markup language

•  Tag-based language for describing web pages

•  Instructs the browser how to render a page / what actions to perform on certain events

HTTP headers

Accept: text/html, application/xhtml+xml, image/jxr, */* Accept-Encoding: gzip, deflate, peerdist Accept-Language: en-US, en; q=0.7, he; q=0.3

Connection: Keep-Alive Cookie: _ga=GA1.2.1161181038.1455475184; __gads=ID=63f97d8b6f522032:T=1455475184: S=ALNI_MbHACjONbrZtk3Et5JqdFyl_Lg9ow; _gat=1

Host: www.w3schools.com Referer: http://www.w3schools.com/html/html_examples.asp

User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405 X-P2P-PeerDist: Version=1.1

X-P2P-PeerDistEx: MinContentInformation=1.0, MaxContentInformation=1.0

•  A lot of information is passed in HTTP header (metadata) – primary source for telemetry

AJAX

•  The “classical” web model: for every request you receive back a new HTML page and re-render the entire browser screen

•  Many web sites still work this way today…

•  But we (people) became impatient… we want higher interactivity

•  Example: typing into Google search box, you get instant suggestions…

•  AJAX == Asynchronous JavaScript And XML

•  Send / receive data from a server – asynchronously, in background

•  Still uses HTTP as underlying protocol

Example - Request telemetry

{ "request": [{ "id": "4251295413255663004", "name": "GET InsightsExtension/Index", "responseCode": 200, "success": true, "durationMetric": { "value": 39751.0 }, "url": "https://stamp2.app.insightsportal.visualstudio.com/ InsightsExtension", "urlData": { "protocol": "https", "host": "stamp2.app.insightsportal.visualstudio.com", "base": "/InsightsExtension" } }], …

… "context": { "data": { "eventTime": "2015-08-01T00:48:35.3821824Z" }, "device": { "os": "Windows", "osVersion": "Windows 7", "browser": "Internet Explorer", "browserVersion": "Internet Explorer 9.0", "locale": "en-US", "userAgent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; AppInsights)" }, "user": { "anonId": "us-il-ch1-t4t-edge" }, "session": { "id": "3dd36639-bf4c-4271-9814-6be1ad1f31b8" }, "location": { "continent": "North America", "country": "United States", "point": { "lat": 47.674, "lon": -122.1215 }, "clientip": "0.46.14.57", "province": "Washington", "city": "Redmond" } } }

Example - PageView telemetry

{ "view": [{ "urlData": { "host": "stamp2.app.insightsportal.visualstudio.com", "protocol": "https", "base": "/InsightsExtension" }, "name": "AspNetOverview", "url": "https://stamp2.app.insightsportal.visualstudio.com/In

sightsExtension?sessionId=a9689a39acdf4918996610ba31b2 944f&extensionName=AppInsightsExtension&shellVersion=5 .0.302.65%20(production%23ede3859.150729- 2229)&traceStr=&l=en.en- us&trustedAuthority=portal.azure.com%3A"

}], "context": { "device": { "os": "Windows", "osVersion": "Windows 10", "type": "PC", "browser": "Internet Explorer", "browserVersion": "Internet Explorer 12.10240", "screenResolution": { "value": "1707X960" }, "locale": "en-us", "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)

AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"

}, …

… "location": { "continent": "North America", "country": "United States", "point": { "lat": 27.7362, "lon": -82.6691 }, "clientip": "0.185.245.49", "province": "Florida", "city": "Saint Petersburg" }, "data": { "eventTime": "2015-08-01T00:57:10.489Z" }, "user": { "anonId": "ac0cd123-b7c8-4658-b17d-7a0d4b552d1e", }, "custom": { "dimensions": [{ "Prod": "5.0.302.65 (production#ede3859.150729-2229)" }, { "AppInsightsVersion": "1.0.5688.17982" } }, "session": { "id": "FA79174B-2622-4FD7-945E-56F0E716D905" } } }

Data set for this course

Good news: we simplified it for you! – smaller size, cleaner and in convenient shape

•  The data set is a representative sample of pageviews, requests and exceptions recorded on Application Insights site during Oct 2015 (full month)

•  Total of: XXX requests, YYY pageviews; ZZZ exceptions

•  Data set in Comma Separated Values (CSV) format

•  Separate file for each type, for each date

Telemetry data is…

… Exciting

•  Lot of rich information

•  Many possible questions and insights

•  High potential impact on site/app success

… Hard

•  Data may be complex and confusing (previous examples were simplified J)

•  High volumes of data •  100M pageviews/month for medium site; billions of requests •  Frequently requires specialized tools / coding skills

•  Non-friendly format

•  Data is never clean as you wish it to be

4. The Project

Pick a business goal / a problem

Area Main business goals Problems

Operational Intelligence

Keep the site up and running, with minimum downtime,

Detect the problems early (or even before they occur) Isolate the problem efficiently

…good performance Detect performance degradation … and optimal cost Optimize capacity to usage patterns

Customer Intelligence

Grow the customer base, Discover customers that are about to leave (churn) optimize customer acquisition costs

Discover customer segments you should advertise to

monetize better Predict customer “stickiness” based on their first session

These are just examples!!!

Generate an insight

•  Formulate a hypothesis you are trying to prove

•  Make sure you understand the data and it’s semantics

•  Make sure you have sufficient data for the experiment

•  Consult us early when in doubt

•  Use Python/ML to produce the insight

•  Beware of difference between Correlation vs. Causality

“Selling” your insight

•  Your target audience are not data scientists

•  You need to present complex insight as…

•  Easy to understand

•  Using their language (business domain)

•  Visually appealing (people love beautiful graphs)

•  Be prepared to “go interactive” – every time you give an answer, more questions will follow

•  Ideally – something they can play on their own

•  Excel is your best friend here

4. Homework

Homework

1.  Get access to the data, download to your PC

2.  Familiarize yourself with how data is organized

•  Can you find requests? Pageviews?

3. “Sniff it”

•  Open some of the files in text editor (Excel is even better)

•  Do you understand some of the columns? Most of the columns? [Don’t worry if you don’t for now]

•  If you want to dig more, read the docs about data structure, prepare questions when unclear

Questions?

•  Contact information

Some useful links

•  “14 definitions of a data scientist”: http://bigdata-madesimple.com/what-is-a-data-scientist-14-definitions-of-a-data-scientist/

•  “Doing data science @Twitter” : https://medium.com/@rchang/my-two-year-journey-as-a-data-scientist-at-twitter-f0c13298aee6

•  “Why so many fake data scientists?”: https://www.linkedin.com/pulse/why-so-many-fake-data-scientist-bernard-marr

Thank you

BACKUP

Who is Data Scientist?

•  Evolution from business or data analyst role?

•  What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization.

•  “Part analyst, part artist”

•  Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It's almost like a Renaissance individual who really wants to learn and bring change to an organization.“

•  Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.

3B minutes of calls daily

Big Data Technologies