Introduction to Machine Learning & Data Analytics...Machine Learning & Predictive Analytics >...

Post on 25-Jan-2020

10 views 0 download

Transcript of Introduction to Machine Learning & Data Analytics...Machine Learning & Predictive Analytics >...

Introduction to Machine Learning & Data Analytics

Agenda – 2:00 pm – 2:45 pm

2:00 – 2:05Introductions and Session Overview

2:05 – 2:10Machine Learning & Predictive Analytics Background

2:10 – 2:30 Sample ML/PA Study & Findings

2:30 – 2:45Expert Panel Q & A

Founded in 2004 as a public sector IT consulting firm, Infiniti has evolved into a public sector cloud services and consulting organization with a reputation for delivering results on time and on budget.

Infiniti - Who We Are…

Harnessing a deep commitment to state & local government, education, and healthcare; Infiniti aims to improve the lives of students through innovation and technology.

Infiniti - Where We Work

Cloud Education Gov’t Agency Healthcare IV&V MSP

Machine Learning & Predictive Analytics

Deploying analytical IT tools is

relatively easy.

Understanding how they might

be used is much less clear.

Machine Learning & Predictive Analytics

> Typically start with sensing problems or potential opportunities, which may initially just be somebody’s hunch.

> Often move on to develop theories about the existence of a particular outcome or effect, generate hypotheses, identify relevant data, and conduct experiments.

> They are opportunities for discovery.

Focus more on the “I” and less on the “T” in IT

More like scientific research than traditional IT initiatives. Leads to specific targeted actions.

Our Predictive Analytics Process

The cycle of analyzing, transforming and learning

can be repeated many times

Popular Machine Learning Use Cases

Fraud / Anomaly Detection

Targeted Citizen Outreach

Business / Operational efficiency

Educational Outcome Predictions

Content Personalization

Document Classification

John Gray

Sample ML/PA Study & Findings

Problem Statement & Project Objective

Tasks

Tools and Environments

Deliverables

Roles

Schedule/Duration

Problem Statement & Project Objective

The California public sector client has surveys from millions of people who

apply online. A small percentage give negative feedback. The feedback is

entered as free form text. Client wants to analyze this text to identify specific

areas of the application process that need to be improved.

The objective of this project is to perform text processing, analysis, and

clustering to understand survey comments from dissatisfied users and

determine the parts of the application process that might need improvement.

(This is a “starter” ML/PA project – client expects us to help with more complex and higher benefit projects in the future)

Tasks - Typical

1. Define the problem. Work with customers to get a good understanding of the

specific questions they want to get answered

2. Analyze existing customer data. If not sufficient, work with customer to collect

additional / relevant data

3. Perform ETL (Extract, Transform & Load) and complete data integrity

checks. Make sure there are no issues with data (missing data, statistical

anomalies, etc.)

4. Make predictions and test outcomes. Model development - feature

engineering and predictive modeling

5. Test predictions for accuracy and validity. Improve / refine until results are

satisfactory

6. Deploy in production

7. Train / transition to customer’s team (or continue to support if required)

8. Discover other potential opportunities. Provide suggestions on other

questions that can be asked

Tasks – This Project

• Perform Sentiment Analysis - Provides insight into positive (or) negative emotions communicated in the textual data

• Process Word Cloud - Visual representation of key words communicated.

Sizes indicate relative importance/frequency.

• Perform Clustering - An unguided (Unsupervised Learning) machine learning technique that reveals underlying themes in text source

• Temporal Analysis of Negative Sentiment - We look at changes in negative sentiment over time – this might correlate to some client event

that occurred or in the world in general.

Sentiment Analysis – Steps and Results

• Evaluated a couple of modelso Logistic Regression, Naïve Bayes

• Logistic regression performed better

• Classification scoreso Logistic regression

• Accuracy – 93.7%

• Precision – 95.6%, Recall – 97.4%, Fscore – 96.5%

o Naïve Bayes

• Accuracy – 90.9%

• Precision – 96.7%, Recall – 93%, Fscore – 94.8%

Sentiment correlates very well with the user provided experience rating

Sentiment Analysis - Results

Words identified align well with sentiment

Generate Word Cloud - Tasks

• Generate key words that dominate negative comments

• Survey comments transformed using below text pre-processing steps• Stemming

• Removing most common words (I, we, is etc.)

• Spellcheck

Word Cloud - Results

Clustering - Tasks

• Used K-Means Clustering model

• The Process• Text pre-processing

• Convert text to numbers: Term Frequency – Inverse Document Frequency Transformation

• Run K-Means for assigned number of clusters

• Generate top-n key words that most represent each cluster

• Analyze output to identify key insights

• Tune, Iterate: Algorithm parameters, number of clusters, etc.

Clustering - Results

Cluster Key Words Theme

0 college, times, just, apply, confusing, did, student, need, website, difficult No clear theme

1 time, kept, consuming, time consuming, waste, logging, waste time, kept logging , times, page

Potential Website Issues

2 process, application process, long, times, college, student, just, class, students, online

No clear theme

3 long, takes, way long, took, way, unnecessary, process, complicated, tedious, personal

Time Consuming

4 school, personal, sexual, high, high school, orientation, sexual orientation, personal information, personal questions, college

PersonalInformation

Clustering model identified some generic trends on potential sources of dissatisfaction

Temporal Analysis

2014 had a spike in number of negative comments

Tools/ Environment

Environment can be on-premise, hybrid cloud, or cloud.NetApp storage provide excellent performance for this type of application

Tools and Environment – This Project

Tools/Environment:

• Secure AWS Environment with following open source tools:

• Python / Natural language toolkit (NLTK) package

• Open source machine learning tools: Python / Scikit-learn package

Roles, Schedule, and Next Steps

Roles on this project:

• Two Data Scientists (Harsha & Ananth)

• Project Manager (part time)

• AWS Solution Architect (to build environment)

Schedule:

• Less than three elapsed months from concept to completion

• Approximately three weeks of actual work

Next Steps:

• This client has at least half a dozen other ML projects

• Fraud, where to apply expert guidance, …

Thank You

Panel Discussion