Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

14
Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review

Transcript of Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Page 1: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Recitation for BigData

Jay GuJan 10

HW1 preview and Java Review

Page 2: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Outline

• HW1 preview• Review of java basics• An example of gradient descent for linear

regression in Java

Page 3: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

HW1 Preview

On ~1 million size data.

• Warm up exercise

• Stochastic Gradient Descent for Logistic Regression

• SGD with Hashing Kernel

• Extra credit: Personalized Logistic Regression

Page 4: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Starter Code

–Class for parsing the input file and iterate over the dataset.

Dataset dataset = new Dataset(your_path, is_training, size)While(dataset.hasNext()) {

DataInstance d = dataset.next();… some action on d …

}

Page 5: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Starter Codepublic class DataInstance {

int clicks; // number of clicks, -1 if it is testing data.int impressions; // number of impressions, -1 if it is testing data.

// Feature of the sessionint depth; // depth of the session.int[] query; // List of token ids in the query field

// Feature of the ad….

// Feature of the user….

}

Page 6: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Starter Codepublic class Weights {

double w0;/* * query.get("123") will return the weight for the feature: * "token 123 in the query field". */Map<Integer, Double> query;Map<Integer, Double> title;Map<Integer, Double> keyword;Map<Integer, Double> description;double wPosition;double wDepth;double wAge;double wGender;

}

Page 7: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

BigData is often sparse

Be as lazy as you can …

Update only when necessary…

Page 8: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Avoid O(d): Sparse and lazy update

• Although the feature space d is huge, each data point only has a few tokens.– Only update what is changed.

• But even so, regularization should be applied to all d weights at each step.– Delay and batch the regularization.

Page 9: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Java Review

Not required but good to know:Interface, Inheritance, Access Modifier,

I/O,…

• Language: Class, Object, variable, method• Data Structure: Java Collections– Array– List : ArrayList– Map: HashMap

Page 10: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Classpublic class DataInstance {

// Feature of the sessionint[] query ….// Feature of the adint[] title …DataInstance(String line, … ) {

// parse the line, and set the field}

public void print() {System.out.println( “title: “);for (int token : title)

System.out.print(token + “\t”);}

}

Members or fields

Constructor

Method

Page 11: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Object

• DataInstance data = new DataInstance();

• int clicked = data.clicked

• data.print()

Page 12: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Collections

• Array– int[] tokens– double[] weights

• ArrayList– ArrayList<DataInstance>

• HashMap– HashMap<K, V>

Fixed Length, Most compact

Dynamically Increasing (double the size every time)

Constant time key value look upDynamically Increasing, use more memory

Page 13: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Variables

• “Everything” in Java is an Object– Except for primitive types : int, double

• All object variables are reference/pointers to the Object

• Function passes variables by value

Page 14: Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

Example: SGD for linear regression

• Demo