Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer...

39
Computer Processing of Handwriting in Documents Sargur (Hari) Srihari [email protected] Center of Excellence for Document Analysis and Recognition (CEDAR) Department of Computer Science and Engineering University at Buffalo, State University of New York

Transcript of Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer...

Page 1: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Computer Processing of Handwriting in Documents

Sargur (Hari) [email protected]

Center of Excellence for Document Analysis and Recognition (CEDAR)

Department of Computer Science and Engineering

University at Buffalo, State University of New York

Page 2: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Outline of Talk

1. Role of Handwriting2. Handwriting Recognition3. Postal Address Interpretation4. Writer Recognition5. Automatic Essay Scoring6. Future- what are the possibilities

Page 3: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Handwriting vs SpeechSpeech Handwriting

Medium ofCommunication

Most Natural Natural

Means of Recording Personal Information

Not Common Most Common

Role in Human History Distinguishes Homo Sapiens species

Led to civilization(writing/printing)

Storage Technology Still in infancy Great Advances*

RecognitionTechnology

Great Advances Still in infancy

*Gutenberg’s movable type considered invention of the millenium

Page 4: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Extracting Information from Handwriting

HandwrittenDocument Image

HandwritingRecognition

LanguageRecognition

WriterRecognition

Words“Listen”

LanguageEnglish

Writer“Joe Smith”

InterpretationEssay Score 6

Addressee“Mr. Ramsey”

Page 5: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Terminology of Handwriting Processing

HandwritingRecognition

Writer Recognition

Graphology

HandwritingInterpretation

Dynamic Static Writer VerificationWriter Identification

Natural Writing Forgery Disguised Writing

Traced Simulated Freehand

Page 6: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Dynamic vs Static HandwritingDynamic(on-line)

Static(off-line)

Writing Surface

Specialized(electrostatic or patterned paper)

Ordinary paper

Pen Specialized Ordinary pen

Ubiquity Slow acceptance Continued popularity

Temporal Information

Available Image only

Text field Boxes Thresholding, Occlusion

Page 7: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Handwriting in SchoolsFCAT Sample Test

Read, Think and Explain Question (Grade 8)

Reading Answer BookRead the story “The Makings of a Star” before answering Numbers 1 through 8 in Answer Book.

Page 8: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

School Test Modalities

• On-Line– Key-boarding skills

• How early to introduce?

– Computer network down-time– Academic integrity

• Paper and Pencil– Natural means of communication

Page 9: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Outline of Talk

1. Role of Handwriting2. Handwriting Recognition3. Postal Address Interpretation4. Writer Recognition

– Demo5. Automatic Essay Scoring6. Future- what are the possibilities

Page 10: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Handwriting Recognition (OHR)Letter Pairs(state abbr)

Characters Digit Pairs

Error

Reject

Words

Page 11: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Line and Word Segmentation

0.010.04

0.35

Page 12: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Line SegmentationTouching lines

Overlapping lines

Page 13: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Line Segmentation Algorithm

STEP 1

Line_Above ~ N(µa,Σa)µa=[404,13,2548]

Σa=[28944 -546

-546 208]?

Line_Below ~N(µa,Σa)µa=[481, 2659]

Σa=[35778 -707

-707 238]

Component points

COMPN points

STEP 2: log P (COMP | Line_Above ) = Пi log P ( COMPi | µa,Σa) = -22161

log P (COMP | Line_Below ) = Пi log P ( COMPi | µb,Σb) = -44391

STEP 3

Page 14: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Outline of Talk

1. Role of Handwriting2. Handwriting Recognition3. Postal Address Interpretation4. Writer Recognition5. Automatic Essay Scoring6. Future- what are the possibilities

Page 15: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Handwritten Address Interpretation*

Postnet Bar Code representing Delivery Point

**Among first handwritten addresses sorted automatically (Oct 1996Among first handwritten addresses sorted automatically (Oct 1996))

Page 16: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Mining Postal DirectoriesSi

ze o

f stre

et n

ame

lexi

con

log (Number of ZIP Codes)

No. of ZIP’s with | f6 | = 1=> 6,264 (14.97%)

| f3 | = 41,840Mean | f6 | = 95.04Max | f6 | = 1,513

(3.80, 1)

Size

of s

treet

nam

e le

xico

n

log (No. of (ZIP, primary) pairs)

No. of (ZIP, pri) with | f6 | = 1=> 34,102,092 (69.11%)

| (f3 , f5 ) | = 49,347,888Mean | f6 | = 2.21Max | f6 | = 542

(7.53, 1)

69% of (ZIP, Primary) pairs have single street

15% of ZIP Codes have single street

Page 17: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Context StrategyStreet address

Lexicon entry(Street name)

ZIP+4add-on

AMHERSTON DR 7006BELVOIR RD 3604CADMAN DR 6948

CLEARFIELD DR 2336FORESTVIEW DR 1438

HARDING RD 7111HUNTERS LN 3330MCNAIR RD 3718

MEADOWVIEW LN 3557OLD LYME DR 2250

RANCH TRL 2340RANCH TRL W 2246

SHERBROOKE AVE 3421SUNDOWN TRL 2242TENNYSON TER 5916

Database query

Records Retrieved

Recognizer choice(after lex. expansion)

ZIP+4: 142213557

ZIP Code: 14221Primary number: 276

Addressencoding

Page 18: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

USPS Handwritten Address Encode Rate

1997: 30% finalization

2001: 75% finalization http://www.usps.com/history/cs01/c2f.htm

2004: 88% finalization

From 1996 to 2004, the encode rate for RCRs increased from 35 percent to 90 percent, reducing the need for manual matching at RECs from 24 billion pieces per year to 6 billion annually. As a result, USPS has been able to reduce the number of RECs in the national network from a high of 55 in 1998 to only 15 today.

Page 19: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Outline of Talk

1. Role of Handwriting2. Handwriting Recognition3. Postal Address Interpretation4. Writer Recognition5. Automatic Essay Scoring6. Future- what are the possibilities

Page 20: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Writer Recognition

Anthrax Letters written by the same person ?Anthrax Letters written by the same person ?Do they match Do they match writershipwritership of known writing?of known writing?

Page 21: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Each Person Writes Differently

Within-writer variation is less than between-writer variation.

Person 1 (Female,API, 25~44)

Person 5(Male, White, 15~24)

Person 2(Male, API, 45~64)

Person 6(Male, White, 45~64)

Person 3(Male, White, 25~44)

Person 7(Male, White, 15~24)

Person 4(Female, White, 45~65)

Person 8(Female, White, 15~24)

Page 22: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Writer Verification Written by Same Person?

Person 1 (Female,API, 25~44)

Person 2(Male, API, 45~64)

Person 3(Male, White, 25~44)

Person 4(Female, White, 45~65)

Person 5(Male, White, 15~24)

Person 6(Male, White, 45~64)

Person 7(Male, White, 15~24)

Person 8(Female, White, 15~24)

Page 23: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

CedarFox Demo

Page 24: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Software System

Pre-processing:Thresh, Line seg

Similarity LLR

Parameters from Training data

FeatureExtraction

Questioned

LLR

Known

Page 25: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

PerformanceDifferent Content Half-Page Images

Same Writer Distribution

-100

0

100

200

300

400

0 100 200 300 400 500

Test Case

LLR V

alue

s

Different Writer Distribution

-600-500-400-300-200-100

0100200300

0 200 400 600 800 1000 1200

Test Case

LLR

Valu

eRejection vs. Error rate

0

5

10

15

20

25

30

Perc

enta

ge

Rejection rateError rate Writer Verification: written by same person?

Full page: 4% error rate Single word: 16% error rate

Page 26: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Outline of Talk

1. Role of Handwriting2. Handwriting Recognition3. Postal Address Interpretation4. Writer Recognition5. Automatic Essay Scoring6. Future- what are the possibilities

Page 27: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

NY English Language Arts Assessment (ELA)-Grade 8

Page 28: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Sample Question and AnswersHow was Martha Washington’s role as First Lady different from

that of Eleanor Roosevelt? Use information from American First Ladies in your answer.

Page 29: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Automatic Handwritten Essay Scoring• Why Automatic Assessment Technologies?

– Timely scoring and reporting results– Need to test later in school year for

• capturing most student growth and • requirement to report scores before summer break

– When large nos. of assignments are submitted at once,

• teachers bogged down to provide consistent evaluations and high quality feedback to students

• within short time frame-- in days not weeks

Page 30: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Holistic Rubric Chart for “American First Ladies”

6 5 4 3 2 1

Understanding of text

Understanding of similarities and differences among the roles

Characteristics of first ladies

•Complete•Accurate•Insightful•Focused•Fluent•engaging

Understanding roles of first ladies

Organized

Not thoroughly elaborate

Logical

Accurate

Only literal understanding of article

Organized

Too generalized

Facts without synchronization

Partial understanding

Drawing conclusions about roles of first ladies

Sketchy

Weak

Readable

Not logical

Limited understanding

Brief

Repetitive

Understood only sections

Page 31: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

OHR ProcessesForm RemovalScanned Answer Line/Word

SegmentationAutomatic WordRecognition

Page 32: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Lexicon of “American First Ladies”martha

meet

miles

much

nation

nations

newspaperp

not

occasions

of

often

on

opened

opinions

or

other

our

outgoing

overseas

own

part

partner

people

play

polio

politicians

politics

residency

president

presidential

presidents

press

prisons

property

proposals

public

quaker

rather

really

receptions

remarkable

rights

initial

inspected

its

james

job

just

known

ladies

lady

lecture

life

light

like

limited

made

madison

madisons

magazines

make

making

many

married

held

helped

her

him

his

homemaking

honor

honored

hospitals

hostess

hosting

human

husband

husbands

ideas

ii

important

in

inaugural

influence

influences

us

usually

very

vote

want

war

was

washington

weakened

well

were

when

where

which

who

whom

whose

wife

will

with

woman

womans

family

fdr

fdrs

few

first

for

former

franklin

from

funeral

garment

gathered

general

george

girls

given

great

had

half

harry

he

than

that

the

their

there

they

this

those

to

tours

travel

traveled

travels

treated

trips

troops

truly

truman

two

united

universal

up

did

diplomats

discussion

doing

dolley

during

early

ears

easily

education

eleanor

elected

encountered

equal

established

even

ever

everything

expanded

eyes

factfinding

1800s

1849

1921

1933

1945

1962

38000

a

able

about

across

adlai

after

allowed

along

also

always

ambassadorcame

american

an

and

anna

appointed

aristocracy

articles

as

at

be

became

began

boys

brought

but

by

call

called

candidate

candle

career

role

roosevelt

roosevelts

royalty

saw

schools

service

sharecroppers

she

should

skills

social

society

some

states

stevenson

strong

students

suggestions

summed

take

taylor

center

century

column

community

conference

considered

contracted

could

country

create

Curse

daily

darkness

days

dc

death

decided

declaration

delano

delegate

depression

women

workers

world

would

wrote

year

years

zachary

Page 33: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Latent Semantic Analysis

• Approximate the T-dimensional term space by k principal components directions in this space– Using the N xT document term matrix to estimate

directions– Results in a N x k matrix– Terms first lady and president’s wife are combined into

a single principal component• Dimensionality Reduction – Using SVD

– Small enough to facilitate elimination of irrelevant representations

– Large enough to represent the structure of the answer documents

Page 34: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Application of LSA to “American First Ladies”: Sample Answer Texts

Score: 5M. Washington's role as first Lady was different from

E. Roosevelt's because she didn't want to called first lady, and because she didn’t want to be treated like royalty or aristocracy.

E. Roosevelt's role as first Lady was different from M. Washington's because she liked to called First Lady. she was always there with suggestions, proposals, and ideas, she also traveled across country on lecture tours, wrote articles for magazines, and even wrote a daily newspaper column. Later in 1945 after her husband's death; she was appointed U.S. delegate to the United Nations, (where she helped to create the Universal Declaration of Human Rights); and at her funeral in 1962, President Harry Truman called her "the First Lady of the World"; and former presidential candidate Adlai Stevenson summed up E. roosevelt's remarkable career by saying: "she would rather light a candle than curse the darkness".

Score: 0Dolley became an outgoing woman with strong opinions,

whose influence on her husband was well known. Eleanor became the "eyes and ears" of her husband, often making fact finding trips for him.

Document Term Matrix

Terms (after word stemming)

Student Answer Scores

Page 35: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Comparison of Human and Machine Scores

Manual Transcription OHR

Mean difference = 1.70 Mean difference = 1.65

Page 36: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Outline of Talk

1. Role of Handwriting2. Handwriting Recognition3. Postal Address Interpretation4. Writer Recognition5. Automatic Essay Scoring6. Conclusion and Future Work

Page 37: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Main Lesson Learned

• Handwriting Interpretation Tasks (Postal Address Interpretation and School Essay Scoring):

• Context overcomes ambiguity to achieve useful performance

Page 38: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Future Possibilities

• Multilingual postal addresses• Searching handwritten databases• Accepted as Standard for Questioned

Document Examination• Handwriting as a diagnostic aid with brain

imaging such as ADHD, dyslexia• Grand Challenge: scoring science and

engineering answer booklets

Page 39: Computer Processing of Handwriting in Documentssrihari/talks/MSU.pdf · 2006-12-19 · Computer Processing of Handwriting in Documents Sargur (Hari) Srihari srihari@cedar.buffalo.edu

Thank You

[email protected]