Data Analysis in a Divan

Post on 08-Feb-2017

13 views 0 download

Transcript of Data Analysis in a Divan

Data Analysis on a Divan:

Let’s talk about our problems...

Dr Grazziela Figueredo

2

Data Analyst

• Who that?

• What does it do?

• Why are there so many people talking about it?

• http://www.nottingham.ac.uk/adac/meet-the-team/meet-the-team.aspx

3

The Hype Cycle - 2014

4

2015

5

2016

6

Data Analyst – What is expected• Technical expertise

• Stats• Machine learning• HPC, MPI, …• Hadoop (Scala, Spark, Storm, the zoo…)• R, Matlab, SQL, Python, tableau, google graphs,… • Sentiment Analysis• Bioinformatics• Maths…

• Interpersonal skills• Communication (spoken, written)• Salesperson• Creative thinking• Management skills• Teamwork• Fluctuate between different disciplines• Eager to learn• Etc etc…

InfoGrazzphics (outdated already)

8

Common Data Analysis Phases

Problem definition

Agreement

Planning

Pre-processing

Analysis

Verification/Validation

Results Report

9

Talking to the Client• The clients speaks as if you were an expert in their

field…• Multidisciplinary contexts• New jargon• If you don’t understand: ask questions • Ask for literature• Interaction is the key!

Problem definition Agreement

10

Work Plan• Difficult to determine the time required for the analysis• Prepare the data for analysis• Define deliverables

• Depends on the data• Type of analysis• Amount of money to pay for the analysis• Availability of the team• Technical expertise available• Assessment of the data• Infrastructure available

Agreement Planning

11

Data Formats• Different sources, different formats• Same data, different formats of files• Fusion, consistency• Selection of relevant data

Pre-processing

12

And suddenly your import script is not working anymore… why is that?

13

Large data, short memory/few resources? What infrastructure do you need? Who you are going to share it with?What are the team priorities for resource allocation?

Pre-processing Analysis

14

Incomprehensible errors… back to programming life…

Analysis

15

Torture the data until it confesses?• Large data does not always mean useful data• The more the merrier?• Difficulties of dealing with small data• Generalisation• Models without robustness• Missing values

• Data with no detectable patterns• Was the data collected correctly? • Was the correct data collected?

Analysis Verification/Validation Results Report

16

Clients• As in any area of CS:

• Unrealistic deadlines• Even when the client doesn’t know what to get from the analysis

• Unrealistic expectations (i.e. major analysis breakthrough with 12 data points)

• Disappointment when the result of the analysis does not produce what was expected (i.e. a major breakthrough)

• Get discouraged and stop believing in data analysis• You need them to validate your results (mostly)• Complicated solutions with high performance vs simple solutions

with lower performance (different clients, different preferences)• Interactive/iterative process is always very useful

• Data scientists need love and validation ;)

Planning Verification/Validation Results Report

17

Disappointing Results? Says who? According to who?Is one scientist trash another scientist’s gold?

  Before VIP selection After VIP selection  Cross validation Accuracy Cross validation Accuracy

NB Sensitivity 0.63±0.18 62.12±12.11 0.60±0.18 57.59±12.86Specificity 0.60±0.21 0.54±0.21 SVMs Sensitivity 0.93±0.08 67.55±8.75 0.80±0.13 64.00±10.98Specificity 0.24±0.18 0.35±0.19 RF Sensitivity 0.86±0.13 65.12±10.16 0.86±0.12 61.86±9.73Specificity 0.28±0.19 0.19±0.18 RBF Sensitivity 0.83±0.13 65.29±11.10 0.68±0.18 60±11.57Specificity 0.35±0.20 0.45±0.22 MLP Sensitivity 0.75±0.17 66.33±12.28 0.76±0.14 66.83±11.57Specificity 0.51±0.23 0.51±0.21

 

Results Report