Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

32
Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar

Transcript of Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Page 1: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Hands on Classification with Learning Based Java

Gourab Kundu

Adapted from a talk by Vivek Srikumar

Page 2: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Goals of this tutorial

At the end of these lectures, you will be able to

1. Get started with Learning Based Java

2. Use a generic, black box text classifier for different applications

…and write your own text classifier, if needed

3. Understand how features can impact the classifier performance

… and add features to improve your application

4. Build a badge classifier based on character features

Page 3: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

A Quick Recap

Given: Examples (x,f(x)) of some unknown function f

Find: A good approximation of f

x provides some representation of the input The process of mapping a domain element into a

representation is called Feature Extraction. (Hard; ill-understood; important)

x €{0,1}n or x € Rn The target function (label)

f(x) € {-1,+1} Binary Classification f(x) € {1,2,3,.,k-1} Multi-class classification

Page 4: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

What is text classification?

✓✗✗

✗A document

Some labels

A classifier (black box)

Page 5: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Several applications fit this framework

Spam detection Sentiment classification

What else can you do, if you had such a black box system that can classify text?

Try to spend 30 seconds brainstorming

Page 6: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Outline of this session

Getting started with LBJ

Writing our first classifier: Spam/Ham

Playing with features

Looking inside the black box classifier for feature

weights

Page 7: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

LEARNING BASED JAVAWriting classifiers

Page 8: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

What is Learning Based Java?

A modeling language for learning and inference

Supports Programming using learned models High level specification of features and

constraints between classifiers Inference with constraints Different learning algorithms

The learning operator Classifiers are functions defined in terms of data Learning happens at compile time

Page 9: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

What does LBJ do for you?

Abstracts away the feature representation, learning and inference

Allows you to write learning based programs

Application developers can reason about the application at hand

Page 10: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Demo

A learning based program

First, we will write an application that assumes the existence of a black box classifier

Page 11: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

SPAM DETECTION

Page 12: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Spam detection

Which of these (if any) are email spam?

Subject: save over 70 % on name brand software

ppharmacy devote fink tungstate brown lexicon pawnshop crescent railroad distaff cytosine barium cain application elegy donnelly hydrochloride common embargo shakespearean bassett trustee nucleolus chicano narbonne telltale tagging swirly lank delphinus bragging bravery cornea asiatic susanne

Subject: please keep in touch

just like to say that it has been great meeting and working with you all . iwill be leaving enron effective july 5 th to do investment banking in hongkong . i will initially be based in new york and will be moving to hong kongafter a few months . do contact me when you are in the vicinity .

How do you know?

Page 13: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

What do we need to build a classifier?

1. Annotated documents*

2. A feature representation of the documents

3. A learning algorithm

* Here we are dealing with supervised learning

Page 14: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Our first LBJ program

/** A learned text classifier; its definition comes from data. */

discrete TextClassifier(Document d) <-learn TextLabel using WordFeatures from new DocumentReader("data/spam/train")

with SparseAveragedPerceptron { learningRate = 0.1 ; thickness = 3.5; } 5 rounds

testFrom new DocumentReader("data/spam/test”)end

Defines a classifier

The object beingclassified

The function being learned

The feature representation

The source of thetraining data

The learning algorithm

Page 15: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Demo

Let’s build a spam detector

How to train?

How do different learning algorithms perform? Does this choice matter much?

Page 16: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Features

Our current spam detector uses words as features

Can we do better?

Let’s try it out

Page 17: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

MORE TEXT CLASSIFICATION

Page 18: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Sentiment classification

Which of these product reviews is positive?

I recently made the switch from PC to Mac, and I can say that I'm not sure why I waited so long. Considering that I have only had my computer a few weeks I can't say much about the durability and longevity of the hardware, but I can say that the operating system (mine shipped with Lion) and software is top notch.

I've been an Apple user for a long time, but my most recent MacBook Pro purchase has convinced me to reconsider. I've had several hardware issues, including a failed keyboard, battery failure, and a bad DVD drive. Now, the backlight on the display fails to turn on when waking from sleep

How do you know?

Page 19: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Classifying news groups

Which mailing list should this message be posted to?

I am looking for Quick C or Microsoft C code for image decoding from file forVGA viewing and saving images from/to GIF, TIFF, PCX, or JPEG format. I havescoured the Internet, but its like trying to find a Dr. Seuss spell checker TSR. It must be out there, and there's no need to reinvent the wheel.How do you know?alt.atheism

comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.xmisc.forsalerec.autosrec.motorcyclesrec.sport.baseball

rec.sport.hockeysci.cryptsci.electronicssci.medsci.spacesoc.religion.christiantalk.politics.gunstalk.politics.mideasttalk.politics.misctalk.religion.misc

Page 20: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Demo

Converting our spam classifier into a Sentiment classifier A newsgroup classifier

Note: How different are these at the implementation level?

Page 21: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Most of the engineering lies in the features

✓✗✗

✗A document

Some labels

A classifier (black box)

Page 22: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Summary

What is LBJ? How do we use it?

Writing a simple spam detector

Playing with features

How much do we need to change to move to a different application?

Page 23: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Assignment before Next Class (Not Graded)

Download the code & data (http://l2r.cs.uiuc.edu/~

danr/Teaching/CS446-12/handsonclassification.html) for this class and play with it

Try to solve the Badges game puzzle with LBJ Think about what features are needed Write a parser for reading the data Write a classifier for solving the puzzle

Page 24: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Next Class

We will solve the Badges Game puzzle by Machine Learning

We will look at more text classification examples

We will think about a famous people classifier

Questions

Page 25: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Badge Classifier

Brainstorm the possible Features Characters in entire name Two consecutive Characters Character as Vowel, Character as Consonant …. …

Feature Engineering is Important (especially if labeled data is small)

What is the baseline? 70 +, 24 -

Page 26: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

THE FAMOUS PEOPLE CLASSIFIER

Page 27: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

The Famous People Classifier

f( ) = Politician

f( ) = Athlete

f( ) = Corporate Mogul

Page 28: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

The NLP version of the fame classifier

All sentences in the news, which the string Barack Obama occurs

All sentences in the news, which the string Roger Federer occurs

All sentences in the news, which the string Bill Gates occurs

Represented by

Page 29: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Our goal

Find famous athletes, corporate moguls and politicians

Athlete

• Michael Schumacher

• Michael Jordan• …

Politician

• Bill Clinton• George W.

Bush• …

Corporate Mogul

• Warren Buffet• Larry Ellison• …

Page 30: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Let’s brainstorm

How do we build a fame classifier?Remember, we start off with just raw text from a news website

Page 31: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

One solution

Let us label entities using features defined on mentions

Identify mentions using the named entity recognizer

Define features based on the words, parts of speech and dependency trees

Train a classifier

All sentences in the news, which the string Barack Obama occurs

Page 32: Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Summary

1. Get started with Learning Based Java

2. Use a generic, black box text classifier for different applications

…and write your own text classifier, if needed

3. Understand how features can impact the classifier performance

… and add features to improve your application

4. Build a badge classifier based on character features

Questions