Nonparametric Importance Sampling for Big Dataasurtg/Projects/RTGSlidesNachtsheimS18.pdfReal Data...

Nonparametric Importance Sampling for Big Data

Abigael C. Nachtsheim

SCHOOL OF MATHEMATICAL AND STATISTICAL SCIENCES

Research Training Group Spring 2018

Advisor: Dr. Stufken

SCHOOL OF MATHEMATICAL AND STATISTICAL SCIENCES Abigael C. Nachtsheim

Motivation

• Goal: build a model that predicts well over the predictor space• Massive amounts of data increasingly available• Big data presents computational challenges• First step: some method of data reduction

Data Reduction Overview

• Our data set consists of n observations• n is very large

• From the full data, select s observations• s << n•  the s observations make up the subdata

• Carry out data analysis on subdata only

Data Reduction Overview: Example

• Full data: 1 response, 9 predictors, 10,000,000 observations• n = 10,000,000

• Choose s = 5,000• Subdata: 1 response, 9 predictors, 5,000 observations

Obs Y X1 X2 X3 X4 X5 X6 X7 X8 X9

But how do we choose?

Selecting Subdata: Approach 1

• Goal: Subdata that is similar to full data• Just take a simple random sample- Fast- Easy

• But this may not be the best sample for prediction

Selecting Subdata: Approach 2

• Goal: select an optimal subsample- Determinant of information matrix- Mean square error for prediction

• Select subdata carefully to optimize some criterion• Improves properties of the estimator

Approach 2: Some Methods

• Leverage-based subsampling• Shrinkage leveraging method• Unweighted leveraging estimator• Information-Based Optimal Subdata Selection (IBOSS)*

*Wang, H., Yang, M., & Stufken, J. (2017). Information-Based Optimal Subdata Selection for Big Data Linear Regression. Journal of the American Statistical Association

Approach 2 Example: IBOSS

• Goal: maximize determinant of subdata information matrix

• Some nice properties- Unbiased estimators- Variance of estimators ! 0 as n ! ∞- Computationally efficient

Approach 2 Example: IBOSS

• Drawback: assumes linear model

• With big data we may not be able to guess the underlying model

Another Possibility?

• Nonparametric approach- We don’t know the underlying model

• Goal: spread the subdata out throughout full region

Today’s Plan

1)  Consider 2 new methods- Clustering- Space-filling design

2)  Perform a simulation study to evaluate the methods

3)  Conclusions

k-means Clustering

• Divide dataset into k initial clusters• Assign each point to cluster with nearest mean• Euclidean distance

• Update means• RepeatMinimizes within cluster sum of squares

Potential Method 1: Clustering

• Cluster full data using k-means

• Choose subsample from clusters based on cluster characteristics

We consider two clustering sampling strategies

Two Possible Strategies

1)  Inversely proportional to density of cluster• Sparse cluster " sample (proportionally) more points• Dense cluster " sample (proportionally) fewer points

2)  Equal subsample size from each cluster• Take s/k points from each cluster

Both are attempts at selecting subsample uniformly from the full sample

Space Filling Designs

• Spread design points through experimental region

• Used when form of underlying model is unknown

Some Examples

• Sphere Packing Design• Uniform Design• Fast Flexible Filling Design• Latin Hypercube Design*

*McKay, M., Beckman, R., & Conover, W. (1979). Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 21(2), 239-245.

Potential Method 2: Design

• Construct Latin hypercube design with k points• Cluster full data around these points• Sample equally from each cluster

SCHOOL OF MATHEMATICAL AND STATISTICAL SCIENCES Abigael C. Nachtsheim 20

Simulation Study: Generate X

• One dimensional, mixture of Normals, n = 1000• Z1 ~ N(-100, 10,000)• Z2 ~ N(300, 1)• wi ~ Bernoulli(0.1)

Xi = wi*Z1 + (1 – wi)*Z2

SCHOOL OF MATHEMATICAL AND STATISTICAL SCIENCES Abigael C. Nachtsheim 21

Simulation Study: Generate Y

• E(Yi | Xi ) = -0.002222 * Xi 2• Y(Xi ) = E(Yi | Xi ) + 30*εiεi = independent standard normal errors

Simulation Study Analysis

For each of 1000 data sets with n = 1000:• Select subdata, s = 50 using each method- Simple random sample- IBOSS- Cluster with inverse proportional size, k = 5- Cluster with equal size, k = 5- Space-filling design, k = 5

Simulation Study Analysis

• Using subdata only, estimate a model- Use OLS- Fit quadratic model

• Compute integrated predicted mean squared error

Simulation Results

10% of the data is here

Simulation Results

90% of the the data is here10% of the data is here

Simulation Results

90% of the the data is here10% of the data is here

This is the true response:Y = -0.002222*X2

Simple Random Sample

Cluster: Equal Sizes

Cluster: Inverse Proportional Sizes

Space-filling Design

Full Data

Toy Example: Results

Method Predicted RMSE

59,498

Cluster: Inverse Prop.

Space-Filling Design

Cluster: Equal

Full Data

Toy Example: Results

Method Predicted RMSE

59,498

Cluster: Inverse Prop.

Space-Filling Design

Cluster: Equal

Full Data

Example with Real Data

• n = 4.2 million • p = 15• 1 continuous response• Used in the IBOSS paper

• Construct subdata of size s = 2,000• Consider 4 methods:- Simple random sample- IBOSS- Space-filling design- Cluster: Equal

• Fit two models- First-order linear model (as in IBOSS paper)- Second-order linear model

• Compute holdout predicted mean squared error

Real Data Results: First-Order Model

Method Predicted MSEIBOSS 434.56Simple random sample 0.0106Cluster: Equal 0.0118Space-filling design 0.0148

Using 2,000 observations

Using 4.2 million observations

Predicted MSE from the full data: 0.0105

Real Data Results: Second-Order Model

Method Predicted MSEIBOSS 90,545.1Simple random sample 0.0085Cluster: Equal 0.0053Space-filling design 0.0038

Real Data Results: Second-Order Model

Method Predicted MSEIBOSS 90,545.1Simple random sample 0.0085Cluster: Equal 0.0053Space-filling design 0.0038

Preliminary Conclusions

• We can spread points uniformly using clustering and space-filling methods • If goal is prediction: clustering and space-filling methods as good or better than simple random sample• Space-filling design method performs best with quadratic model

Future work

1) More extensive simulation study involving• Different sizes of k• Different underlying models

2)  Explore alternative methods to choose seed points• Fast Flexible Filling Design• Uniform random sample

3)  Nearest neighbor to seed points rather than cluster4)  Consider large sample properties

Nonparametric Importance Sampling for Big Dataasurtg/Projects/RTGSlidesNachtsheimS18.pdfReal Data...

Documents

Transcript of Nonparametric Importance Sampling for Big Dataasurtg/Projects/RTGSlidesNachtsheimS18.pdfReal Data...

MSE Quasquicentennial

iBOSS global presentationv- ENGLISH (1)

MSE(324(( CASTING(and(SOLIDIFICATION(mse324.cankaya.edu.tr/uploads/files/MSE 324_Casting_Riser Design... · MSE(324((CASTING(and(SOLIDIFICATION ... Caine’s Curve . Caine’s Equation

iboss Secure Cloud Gateway Certified Secure Report...The iboss Secure Cloud Gateway (SCG) offers the flexible, scalable and reliable security that networks should have when the perimeter

iboss, Inc.

MECHANICALLY STABILIZED EARTH RETAINING … · MSE Wall – Mechanically stabilized earth retaining wall, MSE Wall Vendor – Vendor supplying the chosen MSE wall system, MSE Panel

Embankments MSE Permanent and Temporary MSE Walls

iBoss Enterprise Deployment Guide · 2018-04-03 · Rev 2 Version 1.5: May 13, 2011 Page 8 of 55 Figure 1 - iBoss inline deployment diagram 3.1.2.2 iBoss WITH “Management” Network

MSE Magazine

Iboss Brochure

Scholarship_First List_2017-18.pdf · mse eco mse chm csf. mse che ce mse eco bsbe mse me ae cse cse mse csf. cse ... 331 332 334 335 336 337 338 339 340 341 342 34.3 -344 346 347

MSE$307$MATERIALS$ CHARACTERIZATION$IImse307.cankaya.edu.tr/uploads/files/MSE 307_Introduction...MSE 307 Materials Characterization II!""#$ %&'()*$+&,"-".$!" No Laboratory" #" Production

iBoss Enterprise Deployment Guide - · PDF fileiBoss Enterprise Deployment Guide iBoss Web Filters . Page 2 of 53 ... 3.5.3 Configuring the Active Directory Logon/Logoff scripts

MSE Facades Profile

Mse entrepreneurship

iBOSS Investor Deck & Overview

DOCTOR OF ENGINEERING Introduction · PDF fileDOCTOR OF ENGINEERING Introduction ... MSE 331 Thermodynamics of Materials 3 MSE 333 Kinetics 3 MSE 341 Physics of Solids 3 MSE 343 Epitaxial

MSE series

MSE PROJECT

MSE-09CRN1-BQ8W; MSE-09HRN1-BQ8W MSE-12CRN1-BQ8W; MSE ...manuflowhvac.com/pdf_support/MSE18CRN1/SERVICE-MANUAL-FOR … · Service manual - 1 - 1 Precaution 1.1 Safety Precaution To