Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf ·...

41
Biostatistics 615/815 Statistical Computing Gonçalo Abecasis

Transcript of Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf ·...

Page 1: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Biostatistics 615/815Statistical Computing

Gonçalo Abecasis

Page 2: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Course Objective

Introduce skills required for executing statistical computing projects

Applications and examples mostly in C.• Can be easily translated into R, etc.

But the focus is on an algorithmic way of thinking!

Page 3: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Part I: Key Algorithms

ConnectivityySortingSearchingSearchingHashingK d t t tKey data structures

Page 4: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Part II: Statistical Methods

Random Numbers Markov-Chain Monte-Carlo• Metropolis-Hastings• Gibbs Sampling

Function OptimizationFunction Optimization• Naïve algorithms• Newton’s Methodse to s et ods• E-M algorithm

Numerical Integration

Page 5: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Textbooks

Algorithms in C• Sedgewick (1998)• 3rd edition printed in 1998

Numerical Recipes in C• Press, Teukolsky, Vetterling, Flannery • 2nd edition printed in 2002

Page 6: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Assessment for 615

Weekly Assignmentsy g• About 50% of the final mark

2 Exams• About 50% of the final markAbout 50% of the final mark

Page 7: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Assessment for 815

Weekly Assignments• About 33% of the final mark

2 Exams• Ab t 33% f th fi l k• About 33% of the final mark

P j t t b l t d i iProject, to be completed in pairs• About 33% of the final mark

Page 8: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Office Hours

Please fill out doodle poll with your availability:• www.doodle.ch/tsuhaiaqe4ar5cer

My office:• School of Public Health II, Crossroads Level 4

My e-mail:• [email protected]

Page 9: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Algorithms

Methods for solving problems that are g pwell suited to computer implementation

Good algorithms make apparently impossible problems become simpleimpossible problems become simple

Page 10: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Algorithms are ideas …

Focus on approach to a problempp p

Typically the actual implementationTypically, the actual implementation could be take many different forms• Computer languagesComputer languages• Pen and paper

Page 11: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Example:Example:DNA Sequence Matches

When the Human Genome Project started, searching through the entire genome sequence seemed impracticalgenome sequence seemed impractical…

F lFor example,• Searching for ~150 sequences of about

500bp each in ~3 000 000 000 bases of500bp each in 3,000,000,000 bases of sequence would take ~3 hours with the original BLAST or FASTA3 algorithms

Page 12: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Example:Example:DNA Sequence Matches

Mullikin and colleagues (2001) described an improved algorithm, using hash tables, that could do this in < 2 seconds

Reference:• Ning, Cox and Mullikin (2001) GenomeNing, Cox and Mullikin (2001) Genome

Research 11:1725-1729

Page 13: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Today’s Lecture

Introduce a “Connectivity problem” and some alternative solutions

If you haven’t done much programming b f d ’t t h b tbefore, don’t worry too much about implementation details.

• We’ll fill these in later lectures.

Page 14: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

The Connectivity Problem

N objects• Integer names 0 .. N – 1

M connections between pairs of objects• Each connection identifies a pair (p, q)

Possible questions:• A ll bj t t d?• Are all objects connected?• Are some connections redundant?• What are the groups of connected objects?

Page 15: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Possible applications

Is a direct connection between two computers required in a network?• O i ti ti i t d?• Or can we use some existing connections instead?

Are two individuals part of the same extendedAre two individuals part of the same extended family in a genetic study?

Are two genes in the same regulatory network?

Page 16: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Are the two points connected?

Page 17: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

A possible approach …P i bProcess connections one-by-one

First check whether a connection links twoFirst, check whether a connection links two previously unconnected items• If not, proceed to the next connection• If yes, update list of connected items

With N items no more than N 1 updates toWith N items, no more than N-1 updates to the list of connected items required…• After N-1 updates, all items connected to each

other!

Page 18: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

A simple example …

Connections• 3-4• 4 9• 4-9• 8-0• 2 3• 2-3• 5-6• 2-92 9• 4-8• 0-2

Page 19: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

A simple example …

Connections• 3-4 √• 4 9 √• 4-9 √• 8-0 √• 2 3 √• 2-3 √• 5-6 √• 2-9 Redundant: 2-3 ; 3-4 ; 4-92 9 Redundant: 2 3 ; 3 4 ; 4 9• 4-8 √• 0-2 Redundant: 0-8; 8-4; 4-3; 3-2

Page 20: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Specific Tasks

As we proceed through list of connections, conduct two tasks:

• Decide if each connection is new.

• Incorporate information about new connections.

Page 21: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

The Fundamental Operations

The Find operationp• Identify the set containing a particular item or

items.

The Union operation• Replace the sets containing two groups of

objects by their union

Page 22: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

The First Step

Developing a solution that works

• Easy to verify correctness• May not be most efficient• Should be simple

Useful as check of “better” solutions…

Page 23: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Arrays of Integers

Simple data structure• Analogous to a vector

The notation a[i] refers to the ith

integer in the array• When programming, we typically pre-specify

the total number of entries in a array

Page 24: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Quick Find AlgorithmData• Array of N integers• Objects p and q connected iif a[p] == a[q]Objects p and q connected iif a[p] a[q]

Setupp• Initialize a[i] = i, for 0 ≤ i < N

For each pair• If a[p] == a[q] objects are connected (FIND)• Move all entries in set a[p] to set a[q] (UNION)Move all entries in set a[p] to set a[q] (UNION)

Page 25: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

A Simple C Implementation#define N 1000

int main() {{int i, p, q, set, a[N]; // Variable declarationsint unique_connections = 0;

for (i = 0; i < N; i++) // Data Initializationfor (i 0; i < N; i++) // Data Initializationa[i] = i;

while (read_connection(p, q)) // Loop through connections{if (a[p] == a[q]) continue; // FIND operation

set = a[p]; // UNION operationfor (i = 0; i < N; i++)

( )if (a[i] == set)a[i] = a[q];

print_connection(p, q);unique connections++;unique_connections++;}

return 0;}

Page 26: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

A Simple C Implementationbool read_connection(int &a, int &b)

{// The scanf function is a standard C function for processing input// data It processes inp t fields (each marked ith a % sign)// data. It processes input fields (each marked with a % sign) // according to a type qualifier. For example, %d fields are stored // in integer variables.

bool success = scanf(“ %d %d ”, &a, &b) == 2;bool success scanf( %d %d , &a, &b) 2;

return success;}

void print_connection(int a, int b) {// The printf function is a standard C function for processing // output. It processes fields (each marked with a % sign) //// according to a type qualifier. For example, %d fields are // replaced with the contents of corresponding integer variables.

printf(“The connection from %d to %d is non-redundant.\n”, a, b);}}

Page 27: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Pictorial Representation

Array as connections are added:

• 3-4• 4 94-9• 8-0• 2-3• 2-9 * Redundant *

Page 28: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

How efficient is Quick Find?

If there N objects and M connections*, the Quick Find algorithm requires on the order ofQuick Find algorithm requires on the order of MN operations

Not feasible for very large numbers of objects…

* In this case only non-redundant connections actually count

Page 29: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Quick-Union Algorithm I

Complementary to Quick Find

More complex data organization

• Each object points to “parent” object in the same setsame set

Page 30: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Quick-Union Algorithm II

For each pair• Follow pointers until we reach object that

points to itselfpoints to itself

• If a[p] and a[q] eventually lead to the sameIf a[p] and a[q] eventually lead to the same object, we are in the same set (FIND)

• Otherwise, link the object to which a[p] leads to the object which a[q] leads (UNION)

Page 31: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

C Implementation// Loop through connections// Loop through connections while (read_connection(p, q))

{// Check that input is within boundsif (p < 0 || p >= N || q < 0 || q >= N) continue;

// FIND operationi = a[p];while (a[i] != i)

i = a[i];

j = a[q];j = a[q]; while (a[j] != j)

j = a[j];

if (i == j) continue;

// UNION operationa[i] = j;

print_connection(p, q);unique_connections++;}}

Page 32: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Pictorial Representation

Array as connections are added:

• 3-4• 4 94-9• 8-0• 2-3• 2-9 * Redundant *

Page 33: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

How efficient is Quick Union?

Quick Union is typically faster than Quick Find.

However, the data can conspire to make things difficult:• If bj t i d 1 2 2 3 3 4 4 5 ’ll b ild• If objects are paired 1-2; 2-3; 3-4; 4-5; … we’ll build

long chains which slow down FIND operations

In the worst case, we might even need more than MN operationsp

Page 34: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Weighted Quick Union

A smarter version of Quick Union, that avoids long chains

Keep track of the number of elements in each set ( sing a separate arra )set (using a separate array)

Li k ll t t l tLink smaller set to larger set• Union increases length of chains in smaller set by 1

Page 35: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

C Implementation// Initialize weightsfor (i = 0; i < N; i++)

weight[i] = 1;

// Loop through connections while (read_connection(p, q))

{// Check that input is within boundsif (p < 0 || p >= N || q < 0 || q >= N) continue;

// FIND operationfor (i = a[p]; a[i] != i; i = a[i] ) ;f (j [ ] [j] ! j j [j] )for (j = a[q]; a[j] != j; j = a[j] ) ;if (i == j) continue;

// UNION operationif (weight[i] < weight[j])if (weight[i] < weight[j])

{ a[i] = j; weight[j] += weight[i]; }else

{ a[j] = i; weight[i] += weight[j]; }

print_connection(p, q);unique_connections++; }

Page 36: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Pictorial Representation

Array as connections are added:

• 3-4• 4 94-9• 8-0• 2-3• 2-9 * Redundant *

Page 37: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Efficiency of Weighted Quick Efficiency of Weighted Quick Union

Guarantees that pointer chains are no more than log2 N elements long

Overall, requires about M log2 N tioperations

Suitable for very large data sets with millions of objects and connections

Page 38: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Pictorial ComparisonPictorial ComparisonQuick Union Quick Find Weighted

Page 39: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Empirical Timings in SecondsW i ht dNodes

(Connections) Quick Find

Quick Union

Weighted Quick UnionUnion

50,000(50,000)

6 1 <1

100,000(100,000)200 000

12 4 <1

200,000(200,000)

25 15 <1

Page 40: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Summary

Considered 3 alternative solutions to the “connectivity problem”• Are any connections in a set redundant?• Are all objects in a set connected?

Compared some of the computational p pcost for the different methods

Page 41: Biostatistics 615/815 Statistical Computingcsg.sph.umich.edu/abecasis/class/2008/615.01.pdf · Assessment for 615 zWeeklyyg Assignments ... • Ab t 33% f th fi l kAbout 33% of the

Reading Material

Read Chapter 1 of Sedgewick

www.sph.umich.edu/csg/abecasis/class/