Employing Grid Technology for Data Analysis Contact :

24
Employing Grid Technology for Data Analysis Contact : www.business.duq.edu/faculty/davies

description

Employing Grid Technology for Data Analysis Contact : www.business.duq.edu/faculty/davies. Computation Solutions. Traditional High-Performance Computer (HPC) Pro:Node-node communication Con:20x to 200x cost of other solutions Traditional Cluster Computer - PowerPoint PPT Presentation

Transcript of Employing Grid Technology for Data Analysis Contact :

Page 1: Employing Grid Technology for Data Analysis Contact :

Employing Grid Technology for Data Analysis

Contact: www.business.duq.edu/faculty/davies

Page 2: Employing Grid Technology for Data Analysis Contact :

Computation SolutionsTraditional High-Performance Computer (HPC)Pro: Node-node communicationCon: 20x to 200x cost of other solutions

Traditional Cluster ComputerPro: Less expensive to maintain & upgradeCon: Requires significant infrastructure

Internet Grid ComputerPro: Massive power on demandCon: Less adequate for massive data

Enterprise Grid ComputerPro: Harness existing infrastructureCon: Limited power

Page 3: Employing Grid Technology for Data Analysis Contact :

Total Annual Spending on HPC

$0

$2,000

$4,000

$6,000

$8,000

$10,000

$12,000

$14,000

$16,000

$18,000

1995 1996 1997 1998 1999 2000 2001 2002 2003

mill

ions

$

Purchases of New HPC Spending on Existing HPC

Page 4: Employing Grid Technology for Data Analysis Contact :

Price per GF for HPC-Generated Computation

Worldwide (excluding Departmental class)

Installed base of HPC 400,000 GF*

Annual cost of installed base $16.2 billionAverage annual cost per GF $41,000

*1 gigaflops = 1 billion calculations per second ~ 5 GHz

Page 5: Employing Grid Technology for Data Analysis Contact :

Cluster Costs

20-Node Cluster Computer

Purchase price per node $1,100Effective life 3 yearsWeight (including racks) 1 tonPower consumption 8 kilowattsRequired air conditioning 1 tonRequired space 16 square feetComputational power 12 GF

Page 6: Employing Grid Technology for Data Analysis Contact :

Price per GF for Cluster-Generated ComputationAnnual Cost of a Cluster (per GF)

$763

$275

$69

$69

$466

$33

$125

$533

Nodes Hardware Software Hardware Service Contract Electricity Space Installation & Configuration Labor

$533 / year for labor$763 / year for nodes (amortized purchase price)

$69 / year for software (amortized)

$69 / year for hardware service contracts

$466 / year for electricity

$33 / year for space

$125 / year for installation and configuration (amortized)

Annual cost of a cluster > 125% of the (unamortized) purchase price of the nodes

$275 / year for additional hardware (amortized)

Total = $2,300 per GF per year

Page 7: Employing Grid Technology for Data Analysis Contact :

Price Comparison

Annual Cost per GF

Traditional HPC $41,000Traditional Cluster $2,300Internet Grid $300*

*Assumes ½ availability at $100 per year.

What can a researcher do with cheap computation?

Page 8: Employing Grid Technology for Data Analysis Contact :

(in the literature, “all subsets regression”)

Goal: Examine all combinations of factors that have a significant effect on an outcome variable. Evaluate each combination on its ability to predict the outcome variable.

Scope: With K factors, there are 2K possible factor combinations.

Exhaustive Regression

There are statistical issues associated with performing data searches in this manner. But, in the absence of a theoretical model, the alternative is to do nothing.

Page 9: Employing Grid Technology for Data Analysis Contact :

Rock PyrolysisOrganic Mass SpectrometryPotenti

alFactors

Outcome variablePresence of Natural GasVitrinite

Reflectance

Factor CombinationsExample: Examine all combinations of three factors

that might predict presence of natural gas.

Page 10: Employing Grid Technology for Data Analysis Contact :

Combination #1Rock PyrolysisOrganic Mass SpectrometryVitrinite Reflectance

Factor Combinations

Combination #2Rock PyrolysisOrganic Mass SpectrometryCombination #3Rock PyrolysisVitrinite Reflectance

Combination #4Organic Mass SpectrometryVitrinite ReflectanceCombination #5Rock Pyrolysis

Combination #6Organic Mass SpectrometryCombination #7Vitrinite Reflectance

Example: Examine all combinations of three factors that might predict presence of natural gas.

Page 11: Employing Grid Technology for Data Analysis Contact :

As the number of possible factors grows, the number of models in the search space rises exponentially.

Number of Models in the Search Space

-

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

25 26 27 28 29 30

Number of Factors

Factor Combinations

Page 12: Employing Grid Technology for Data Analysis Contact :

Time requirement to exhaust all factor combinations with 40 factors when a single PC can compute 1,000 models per second 40 factors implies over 1 trillion possible models.

Factor Combinations

Procedure One PC 10,000-node Grid

100,000-node Grid

OLS ER 35 years 2 days 5 hours

LOGIT ER Several centuries 10 days 1 day

Typically, researchers would use “stepwise procedures” to avoid having to compute all 1 trillion models.

Page 13: Employing Grid Technology for Data Analysis Contact :

Search Space

Each square represents one combination of factors (a “model”).The 144 squares shown here correspond (approximately) to all the possible models that can be constructed using just 7 factors.

Stepwise ProceduresModel Quality

BadPoorBetter

Best

Good

Page 14: Employing Grid Technology for Data Analysis Contact :

Search Space Model QualityBadPoorBetter

Best

Good

Stepwise Procedures

Stepwise methods pick a single model as a starting point and follows an “improvement path” to a local optimum.

Starting here…

x

…stepwise finds this model.

*

Page 15: Employing Grid Technology for Data Analysis Contact :

Search Space Model QualityBadPoorBetter

Best

Good

Stepwise Procedures

In this example, depending on where stepwise begins its search, stepwise could return any one of these four models.

x

*

x

*

x *

x

*

Page 16: Employing Grid Technology for Data Analysis Contact :

Search Space

Stepwise Procedures

1. That there are four locally optimal models.

2. That there are five models that are as good as the local optima but are not themselves locally optimal.

3. That there are nine models that are ranked “Good” or “Best.”

4. Commonalities among the more preferred models.

5. Commonalities among the less preferred models.

Stepwise methods would not reveal:

Page 17: Employing Grid Technology for Data Analysis Contact :

Exhaustive regression looks at all the models in the search (either within an OLS or LOGIT framework) and:

1. Returns results from all models, or

2. Returns only results from models that contain no insignificant

parameter estimates, and/or

3. Returns only models that satisfy a specified minimumgoodness of fit.

Exhaustive Regression

Page 18: Employing Grid Technology for Data Analysis Contact :

List of factors that appear in each model.Each row corresponds to one of the 2K models.

X[1] X[2] X[3] STDEV(X[1]) STDEV(X[2]) STDEV(X[3])[1,2,15,17,19] 316 -78.402 -89.839 0.07390 1.2763 0.4347[1,15,16,24,26] 315 -79.193 -89.753 0.07491 -0.0148 0.0046[2,15,17,19] 316 -78.948 -89.839 0.07520 1.2664 0.4372[1,3,11,15,16] 313 -76.069 -89.580 0.07520 -0.0143 1.1939 0.0046 0.4290[1,2,3,16,24,26] 315 -77.098 -89.753 0.07522 -0.0145 1.1939 1.3143 0.0046 0.4290 0.4201

Parameter Estimates Standard ErrorsFactor List N UnRestr ln L Restr ln L MSPE

Exhaustive RegressionModels can be evaluated via:

1. Multiple correlation, 2. k-Fold cross-validation mean squared prediction

error, or3. Other methods

Page 19: Employing Grid Technology for Data Analysis Contact :

Exhaustive Regression

Proposed “other method”: Cross-model stability measure

Assuming:

1. The list of potential factors does not exclude any factors that determine the outcome variable, and

2. The pair-wise between-factor correlations are randomly distributed…

…the expected values, across models, of parameter estimates will equal the values of the parameters.

Page 20: Employing Grid Technology for Data Analysis Contact :

Exhaustive Regression

1 3

2

1'1| 1| 2| 1| 1|

For included factors ( ), excluded factors ( ), and extraneousfactors ( ),the expected value of the mean, across models, ofthe parameter estimate vector is:

12 2 1 k k k kK

X XX

X I P X X

2 1

'2| 3| 3|

1

1' '2| 2| 2| 2| 2|where and factor combination

K

k k k kk

thk k k k k

I P X

P X X X X k k

Page 21: Employing Grid Technology for Data Analysis Contact :

Exhaustive Regression

In preliminary Monte-Carlo experiments in which there are three “true” factors (among a set of up to 12 factors), the cross-model procedure correctly identifies:

1. All three of the “true” factors 85% of the time, and

2. Two of the three “true” factors 100% of the time.

(moderate-low correlated data sets; average “true” R2 = 0.43)

Page 22: Employing Grid Technology for Data Analysis Contact :

Problem:Amarillo Biosciences collected patient data from a phase II clinical study. Repeated statistical analyses of their experimental drug yielded no conclusive evidence for or against the drugs efficacy.

With patient data comprising 36 factors, there were almost 69 billion possible ways to model the data.

Exhaustive Regression: Case StudyCase Study: Amarillo Biosciences

Page 23: Employing Grid Technology for Data Analysis Contact :

Case Study: Amarillo BiosciencesSolution:Looking at all 69 billion models, Exhaustive

Regression revealed…

Exhaustive Regression: Case Study

• 250 models in which all factors were statistically significant,

• 8 models that were superior (by stepwise criteria) to the single model found by stepwise methods,

• 15 factors that were more stable (w.r.t. the cross-model criterion) than were the factors that stepwise methods selected,

• 42 factors that did not appear in any of 250 significant models.

Page 24: Employing Grid Technology for Data Analysis Contact :

Employing Grid Technology for Data Analysis

Contact: www.business.duq.edu/faculty/davies