Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec,...

Post on 24-Dec-2015

220 views 4 download

Tags:

Transcript of Facebook.com/statisticssweden @SCB_nyheter Unlocking the Full Potential of Big Data Lilli Japec,...

facebook.com/statisticssweden @SCB_nyheter

Unlocking the Full Potential of Big Data

Lilli Japec, Frauke Kreuter

JOS anniversary

June 2015

The report is available at https://www.aapor.org

Task Force Members: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter, Co-Chair, JPSM at the U. of Maryland, U. of Mannheim & IAB Marcus Berg, Stockholm University Paul Biemer, RTI International Paul Decker, Mathematica Policy Research Cliff Lampe, School of Information at the University of Michigan Julia Lane, American Institutes for Research Cathy O’Neil, Johnson Research Labs Abe Usher, HumanGeo Group

AAPOR (American Association for Public Opinion Research) a professional organization dedicated to advancing

the study of “public opinion,” broadly defined, to include attitudes, norms, values, and behaviors

promotes best practices and transparency works to educate its members as well as policy

makers, the media, and the public at large to help them make better use of surveys and survey findings, and to inform them about new developments in the field

other task force reports available on https://www.aapor.org

Outline of our presentations What is Big Data? Paradigm shift Big Data activities in different organizations Skills required Big Data process and data quality

UNTIL RECENTLYthree main data sources

Administrative Data

Survey Data

Experiments

NOW

US Aggregated Inflation Series, Monthly Rate, PriceStats Index vs. Official CPI. Accessed January 18, 2015 from the PriceStats website.

Number of vehicles detected in the Netherlands on December 1, 2011 created by Statistics Netherlands (Daas et al. 2013). The vehicle size is shown in different colors; black is small size, red is medium size and green is large size.

Social media sentiment (daily, weekly and monthly) in the Netherlands, June 2010 - November 2013. The development of consumer confidence for the same period is shown in the insert (Daas and Puts 2014).

Big Data

http://www.rosebt.com/blog/data-veracity

Hope that found/organic data

Can replace or augment expensive data collections

More (= better) data for decision making

Information available in (nearly) real time

New paradigm

New business model Federal agencies no longer major players

New analytical model Outliers Finegrained analysis New units of analysis

New sets of skills Computer scientists Citizen scientists

Different cost structure

Source: Julia Lane

Eurostat Big Data Action Plan and Roadmap Pilots exploring the potential of selected big data

sources The project will also include activities on:

Methodological frameworks, Quality frameworks, Metadata frameworks, IT infrastructures, Communication, Legal frameworks, Ethical frameworks, Skills and training, and Experience sharing.

UNECE and Big Data The “ Sandbox” provides a computing environment to load

Big Data sets and tools Consumer price indices – experimenting with the

computation of price indexes Mobile telephone data – statistics on tourism and daily

commuting Smart meters – statistics on power consumption using data

collected from smart meter readings. Traffic loops – traffic statistics using data from traffic loops Social media – using Twitter data to analyze sentiment and

to tourism flows. Job portals – computing statistics on job vacancies Web scraping – tested methods for automatically collecting

data from web sources.

Statistics Netherlands: Roadmap BIG DATA

Two focus projects: • the use of traffic loop data for transportation statistics• the use of mobile phone data for daytime population and tourism

statistics.

Six other projects:• the use of internet data for price statistics, • investigating the use of bank and credit card transactions, • the use of social media data for detecting trends in social cohesion, • the use of internet data for encoding enterprise purchases and sales,• investigating the use of smartcards of public transport for statistics,

and• the use of internet data for statistics about job vacancies.

18Source: Pieter Vlag, Statistics Netherlands

Examples from Statistics Sweden

Scanner data to improve the Household Budget Survey

Job vacancy statistics by scraping of the web To evalutate the use of AIS (Automatic

Identification System) data. Cooperation between Statistics Sweden and the agency for Transport Analysis (Trafa). Research funding from the Swedish Innovation Agency (Vinnova).

One day data

Source: Moström and Justesen, Statistics Sweden

SKILLSWhat tasks are required to get there?

We have to do this jointly …

Data Generating Process

Data Curation/Storage

Data Analysis

Data Output/Access

Examples: geolocated social media + survey+ administrative data

Example: Hadoop Distributed File System

Example: Hadoop MapReduce; High Frequency Data

Example: map visualization / privacy

Research QuestionsExamples: Behavior of interest (migration/political participation/job searches)

Source: Abe Usher

Big words …

What is big data?

What is Hadoop File System? (HDFS)

What is Hadoop MapReduce? (MR)

How do you link surveys with big data?

Source: Abe Usher

System Administrator• Storage systems

(MySQL, Hbase, Spark)• Cloud computing:

• Amazon Web Services (AWS)• Google Compute Engine

• Hadoop ecosystem

Computer scientist• Data preparation• MapReduce algorithms• Python/R programming• Hadoop ecosystem

Source: Abe Usher

RESEARCHWhat do we know about the data generating process?

Veracity

Who? What? Why?

Who is missing? Who is counted repeatedly?

What is not said / measured? ..and why?

But (at least) one more V

http://www.rosebt.com/blog/data-veracity

Terr

ori

st D

ete

ctor

Terr

ori

st D

ete

ctor

Errors in Big Data: An Illustration

Suppose 1 in 1,000,000 people are terrorists

The Big Data Terrorist Detector is 99.9 accurate

The detector says your friend, Jack is a terrorist.

What are the odds that Jack is

really a terrorist?

29Source: Paul Biemer

Terr

ori

st D

ete

ctor

Terr

ori

st D

ete

ctor

Suppose 1 in 1,000,000 people are terrorists

The Big Data Terrorist Detector is 99.9 accurate

The detector says your friend, Jack is a terrorist.

What are the odds that Jack is

really a terrorist?

30

Answer: 1 in 1000 i.e., 99.9% of the terrorist detections will be false!

Source: Paul Biemer

Errors in Big Data: An Illustration

Big Data Process Map

31

Generate

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Source: Paul Biemer

Big Data Process Map

32

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Errors include: low signal/noise ratio; lost signals; failure to capture; non-random (or non-representative) sources; meta-data that are lacking, absent, or erroneous.

Source: Paul Biemer

Big Data Process Map

33

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Errors include: specification error (including, errors in meta-data), matching error, coding error, editing error, data munging errors, and data integration errors..

Source: Paul Biemer

Big Data Process Map

34

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Data are filtered, sampled or otherwise reduced. This may involve further transformations of the data.

Errors include: sampling errors, selectivity errors (or lack of representativity), modeling errors

Source: Paul Biemer

Big Data Process Map

35

Generation

Source 1

Source 2

Source K

Extract

Transform (Cleanse)

ETL Analyze

Filter/Reduction (Sampling)

Computation/Analysis

(Visualization)

• • •

Load (Store)

Errors include: modeling errors, inadequate or erroneous adjustments for representativity, computation and algorithmic errors.

Source: Paul Biemer

POTENTIAL

We have to do this jointly …

Data Generating Process

Data Curation/Storage

Data Analysis

Data Output/Access

Examples: geolocated social media + survey+ administrative dataSocial Science & Psychology, Humanities, Econ, Business

Example: Hadoop Distributed File SystemMath & Computer Science, Applied Statistics

Example: Hadoop MapReduce; High Frequency DataEconomics, Social Sciences, Business, Math&Comp

Example: map visualization / privacyPsychology, Law, Math&Comp, Business

Research QuestionsExamples: Behavior of interest (migration/political participation/job searches)Any field

..and think about legal framework