Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec,...

38
facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015

Transcript of Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec,...

Page 1: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

facebook.com/statisticssweden @SCB_nyheter

Unlocking the Full Potential of Big Data

Lilli Japec, Frauke Kreuter

JOS anniversary

June 2015

Page 2: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

The report is available at https://www.aapor.org

Page 3: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O’Neil, Johnson Research Labs Abe Usher, HumanGeo Group

Page 4: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

AAPOR (American Association for Public Opinion Research) a professional organization dedicated to advancing

the study of “public opinion,” broadly defined, to include attitudes, norms, values, and behaviors

promotes best practices and transparency works to educate its members as well as policy

makers, the media, and the public at large to help them make better use of surveys and survey findings, and to inform them about new developments in the field

other task force reports available on https://www.aapor.org

Page 5: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Outline of our presentations What is Big Data? Paradigm shift Big Data activities in different organizations Skills required Big Data process and data quality

Page 6: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

UNTIL RECENTLYthree main data sources

Page 7: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Administrative Data

Survey Data

Experiments

Page 8: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

NOW

Page 9: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.

Page 10: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.

Page 11: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Social media sentiment (daily, weekly and monthly) in the Netherlands, June 2010 - November 2013. The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).

Page 12: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Big Data

http://www.rosebt.com/blog/data-veracity

Page 13: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Hope that found/organic data

Can replace or augment expensive data collections

More (= better) data for decision making

Information available in (nearly) real time

Page 14: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

New paradigm

New business model Federal agencies no longer major players

New analytical model Outliers Finegrained analysis New units of analysis

New sets of skills Computer scientists Citizen scientists

Different cost structure

Source: Julia Lane

Page 15: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Eurostat Big Data Action Plan and Roadmap Pilots exploring the potential of selected big data

sources The project will also include activities on:

Methodological frameworks, Quality frameworks, Metadata frameworks, IT infrastructures, Communication, Legal frameworks, Ethical frameworks, Skills and training, and Experience sharing.

Page 16: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

UNECE and Big Data The “ Sandbox” provides a computing environment to load

Big Data sets and tools Consumer price indices – experimenting with the

computation of price indexes Mobile telephone data – statistics on tourism and daily

commuting Smart meters – statistics on power consumption using data

collected from smart meter readings. Traffic loops – traffic statistics using data from traffic loops Social media – using Twitter data to analyze sentiment and

to tourism flows. Job portals – computing statistics on job vacancies Web scraping – tested methods for automatically collecting

data from web sources.

Page 18: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Statistics Netherlands: Roadmap BIG DATA

Two focus projects: • the use of traffic loop data for transportation statistics• the use of mobile phone data for daytime population and tourism

statistics.

Six other projects:• the use of internet data for price statistics, • investigating the use of bank and credit card transactions, • the use of social media data for detecting trends in social cohesion, • the use of internet data for encoding enterprise purchases and sales,• investigating the use of smartcards of public transport for statistics,

and• the use of internet data for statistics about job vacancies.

18Source: Pieter Vlag, Statistics Netherlands

Page 19: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Examples from Statistics Sweden

Scanner data to improve the Household Budget Survey

Job vacancy statistics by scraping of the web To evalutate the use of AIS (Automatic

Identification System) data. Cooperation between Statistics Sweden and the agency for Transport Analysis (Trafa). Research funding from the Swedish Innovation Agency (Vinnova).

Page 20: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

One day data

Source: Moström and Justesen, Statistics Sweden

Page 21: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

SKILLSWhat tasks are required to get there?

Page 22: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

We have to do this jointly …

Data Generating Process

Data Curation/Storage

Data Analysis

Data Output/Access

Examples: geolocated social media + survey+ administrative data

Example: Hadoop Distributed File System

Example: Hadoop MapReduce; High Frequency Data

Example: map visualization / privacy

Research QuestionsExamples: Behavior of interest (migration/political participation/job searches)

Page 23: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Source: Abe Usher

Page 24: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Big words …

What is big data?

What is Hadoop File System? (HDFS)

What is Hadoop MapReduce? (MR)

How do you link surveys with big data?

Source: Abe Usher

Page 25: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

System Administrator• Storage systems

(MySQL, Hbase, Spark)• Cloud computing:

• Amazon Web Services (AWS)• Google Compute Engine

• Hadoop ecosystem

Computer scientist• Data preparation• MapReduce algorithms• Python/R programming• Hadoop ecosystem

Source: Abe Usher

Page 26: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

RESEARCHWhat do we know about the data generating process?

Page 27: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Veracity

Who? What? Why?

Who is missing? Who is counted repeatedly?

What is not said / measured? ..and why?

Page 28: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

But (at least) one more V

http://www.rosebt.com/blog/data-veracity

Page 29: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Terr

ori

st D

ete

ctor

Terr

ori

st D

ete

ctor

Errors in Big Data: An Illustration

Suppose 1 in 1,000,000 people are terrorists

The Big Data Terrorist Detector is 99.9 accurate

The detector says your friend, Jack is a terrorist.

What are the odds that Jack is

really a terrorist?

29Source: Paul Biemer

Page 30: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Terr

ori

st D

ete

ctor

Terr

ori

st D

ete

ctor

Suppose 1 in 1,000,000 people are terrorists

The Big Data Terrorist Detector is 99.9 accurate

The detector says your friend, Jack is a terrorist.

What are the odds that Jack is

really a terrorist?

30

Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false!

Source: Paul Biemer

Errors in Big Data: An Illustration

Page 31: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Big Data Process Map

31

Generate

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Source: Paul Biemer

Page 32: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Big Data Process Map

32

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Errors include: low signal/noise ratio; lost signals; failure to capture; non-random (or non-representative) sources; meta-data that are lacking, absent, or erroneous.

Source: Paul Biemer

Page 33: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Big Data Process Map

33

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Errors include: specification error (including, errors in meta-data), matching error, coding error, editing error, data munging errors, and data integration errors..

Source: Paul Biemer

Page 34: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Big Data Process Map

34

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Data are filtered, sampled or otherwise reduced. This may involve further transformations of the data.

Errors include: sampling errors, selectivity errors (or lack of representativity), modeling errors

Source: Paul Biemer

Page 35: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

Big Data Process Map

35

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Errors include: modeling errors, inadequate or erroneous adjustments for representativity, computation and algorithmic errors.

Source: Paul Biemer

Page 36: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

POTENTIAL

Page 37: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

We have to do this jointly …

Data Generating Process

Data Curation/Storage

Data Analysis

Data Output/Access

Examples: geolocated social media + survey+ administrative dataSocial Science & Psychology, Humanities, Econ, Business

Example: Hadoop Distributed File SystemMath & Computer Science, Applied Statistics

Example: Hadoop MapReduce; High Frequency DataEconomics, Social Sciences, Business, Math&Comp

Example: map visualization / privacyPsychology, Law, Math&Comp, Business

Research QuestionsExamples: Behavior of interest (migration/political participation/job searches)Any field

Page 38: Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec, Frauke Kreuter JOS anniversary June 2015.

..and think about legal framework