Профессия Data Scientist - OSPconData science is still in its very early phase, with the...

20
Профессия Data Scientist Леонид Жуков Отделение Прикладной Математики Director Data Science Ancestry.com [email protected] Высшая школа экономики, Москва, 2013 www.hse.ru Конференция «Большие Данные в национальной экономике» Москва 2013

Transcript of Профессия Data Scientist - OSPconData science is still in its very early phase, with the...

Page 1: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Профессия Data Scientist

Леонид Жуков Отделение Прикладной Математики

Director Data Science Ancestry.com

[email protected]

Высшая школа экономики, Москва, 2013

www.hse.ru

Конференция «Большие Данные в национальной экономике»

Москва 2013

Page 2: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Sexiest job of the 21scentury

2

McKinsey оценивает

нехватку в

140,000-190,000

специалистов к

2018г

Page 3: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Требуются Data Scientists!

3

Page 4: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Спрос и предложение

4

Page 5: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Кто такие Data

Scientists?

A practitioner of data science is called a data scientist (

Wikipedia)

5

Предпочтительное образование: • Computer Science

• Статистика, математика

• Точные науки: Физика, Инженерия, итд

• Магистры и кандидаты наук

Data Scientist: • Любит данные

• Исследовательский склад ума

• Цель работы – нахождение закономерностей в данных

• Практик, не теоретик

• Умеет и любит работать руками

• Эксперт в прикладной области (*)

• Работает в команде

demand for a certain set of skills, while later demand wanes as many of those initial skills are

automated by even newer tools. Consider, for instance, the way many data processing and network

management jobs that used to require legions of computer operators are now handled by automated

monitoring tools. Data science is still in its very early phase, with the amount of data exploding and

the right tools to process them just becoming

available.

Although data science is generating new

opportunities, our capacity to train new data

scientists is not keeping up, and nearly two-

thirds of respondents foresee a looming

shortfall in the number of data scientists over

the next five years. This aligns with other

research, including a recent McKinsey Global

Institute study that predicts a shortage of

190,000 data scientists by the year 2019iii.

And when our respondents were asked where

the best source for talent was, few looked to

today’s business intelligence professional.

Instead, nearly two-thirds looked for today’s

university students.

Who is the Data Scientist?

Although the term data science has been around for decades – indeed, most scientists’ use data of

some form – the term data scientist in its current context is relatively new, frequently credited to DJ

Patil, who started the data science team at LinkedIn.iv But as a new term, the field is still very much

in flux, and without evidence about the practitioners, we’re left to speculate about what it may mean.

In our survey, we allowed users to self-identify as

“data science professionals,” in order to avoid

conflicts over terminology in job titles. In this

section we’ll attempt to define the data scientist

by comparing them with the previous big player in

the analytics space, business intelligence

professionals.

Twenty years ago, business intelligence was itself

a new term, just emerging to take over the

various database management and decisions

support functions within an organization. As the

field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent

expectations for talent, better training, and more rigorous organizational standards. As our data

demonstrates, data scientists are currently going through that transition,

Students

studying

com puter

science

3 4 %

Students

studying

fields other

than

com puter

science

2 4 %

Professionals

in disciplines

other than I T

or com puter

science

2 7 %

Today's BI

professionals

1 2 %

Other

3 %

The best source of new Data Science ta lent

is:

Jim Asplund, Chief Scientist at Gallup Consulting, is a data

scientist focused on evaluating the role that human perception

has on everything from disease conditions and GDP to worker

productivity and consumer behavior. He works with massive

data sets linking perception with actual behavior, and micro

and macroeconomic outcomes. His work has isolated

emotional factors that are most highly related to outcomes

organizations care about.

EMC Data Science Community Survey, 2011

Dre

w C

on

wa

y, 2

01

0

Page 6: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Рабочие инструменты

• Operating systems:

• Linux + shell tools

• Big data instruments:

• Hadoop (MapReduce) + hadoop tools

• Hive, Pig

• NoSQL (Hbase, MongoDB, Cassandra, Neo4J)

• Database:

• SQL

• Programming:

• Python

• Java

• Scala

• Machine Learning:

• R

• Matlab

• Python libraries (NumPy, SciPy, Nltk,…)

• Java libraries (Mahaut) .

6

Page 7: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

День из жизни Data Scientist

Постановка задачи

Получение данных

Разбор форматов,

организация

Очистка, фильтрация

Исследование данных

Построение моделей

Визуализация Обсуждение результатов

7

Page 8: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Data Scientist или

Аналитик

Если Вы программируете, то скорее всего Вы - Data

Scientist, если используете Excel, то - аналитик

• Data Scientist:

• Используют Hadoop, MapReduce, Hive, R

• Создают специализированные системы

и инструменты

• Работают со структурированными и не

структурированными данными

• Рабочие данные измеряются в TB, PB

• Опыт научной работы, экспертиза в

статистке, машинном обучении,

программировании

• Магистры и кандидаты наук (PhDs)

• Разрабатывают предсказательными

модели

• Создают data products

• Analysts:

• Используют Excel, SQL

• Используют существующие

инструменты и системы

• Работают с табличными данными

• Данные измеряются MB,GB

• Профессиональное образование,

нет формального научного

• Бакалавры etc (BS, BA, MS, MBA)

• Работают тесно с BI и маркетингом

• Создают отчеты и описывают

данные

• Чаще всего данные о показателях

работы бизнеса

8

Page 9: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Опрос: роли и навыки Data Scientist

9

From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012

Page 10: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Data Science команда - ”the dream team”

10

From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013

Page 11: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Прикладные задачи

• Маркетинг: • Сегментация рынка

• Моделирование приобретения и оттока клиентов

• Рекомендательные системы

• Анализ социальных медиа

• Финансовые и страховые компании: • Предотвращение fraud

• Детектирование аномального

поведения

• Анализ кредитных рисков

• Страховые моделирование

• Оптимизация портфолио

11

• Здравоохранение и Фармакология: • Генетический анализ

• Анализ клинических испытаний

• Клинические системы принятия решений

Page 12: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Дорога дальняя…

• Программирование

• Алгоритмы и структуры данных

• Базы данных

• Статистика

• Анализ данных

• Машинное обучение

• Компьютерная обработка

текста

• Распределенные системы

• Инструменты Big Data

• Визуализация данных

12

From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions

Page 13: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Подготовительные программы в индустрии

13

Page 14: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013 14

Подготовительные программы в индустрии

Page 15: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Образовательные программы

Университетские программы: • University of Washington: Certificate in Data Science

• UC Berkeley: Master of information and data science program

• New York University: Data Science at NYU

• Columbia University: Institute for Data Sciences and Engineering

• University of Southern California (UCS) : Master of Science in Data

Science

15

Онлайн курсы обучения

(MOOC):

• Coursera

• edX

• Udacity

Ускоренные образовательные программы (компании):

• Zipfian Academy (12 weeks intensive program)

• Insight Data Science Fellows program ( 6 weeks post doc training)

Page 16: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Конференции

Индустрийные конференции и выставки: • O’Reilly Strata Conference Making Data Work

• Hadoop world

• Big Data Techcon

• Big Data Innovation summits

16

Meetups («кружки по интересам»)

Научные и академические конференции (peer

reviewed): • IEEE & ACM Supercomputing

• IEEE Big Data

• ACM KDD Knowledge Discovery and Data Mining

• ACM SIGIR Information Retrieval

• ICML International Conference on Machine Learning

• NIPS Neural Information Processing

• WWW World Wide Web Conference

• VLDB Very Large Data Bases

• IEEE Visualization

Page 17: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Книги

17

Page 18: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

Открытые вопросы

• Насколько важно быть экспертом в предметной области

решаемой задачи (domain expertise) ?

• Что более важно в профессии Data Scientist : образование

или практический опыт?

• Перспективы профессии Data Scientist, будут ли она

замещена программными решениями?

18

Page 19: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

Высшая школа экономики, Москва, 2013

ВШЭ Отделение Прикладной

Математики

Курсы, читаемые на отделении:

• Программирование ( Python, Java, Matlab)

• Методы разработки данных

• Машинное обучение

• Статистика

• Компьютерная лингвистика

• Анализ социальных сетей

• Распределенные системы

• Основы визуализации

19

Page 20: Профессия Data Scientist - OSPconData science is still in its very early phase, with the amount of data exploding and the right tools to process them just becoming available.

101000, Россия, Москва, Мясницкая ул., д. 20

Тел.: (495) 621-7983, факс: (495) 628-7931

www.hse.ru