Learning sign language by watching TV

37
Learning sign language by watching TV Shishir Agrawal Lei Fan

description

Learning sign language by watching TV. Shishir Agrawal Lei Fan. Introduction and Goal. Television programs are now routinely broadcast with both subtitles and a person signing (usually as an overlay) to provide simultaneous ‘translation’ of the spoken words for deaf people - PowerPoint PPT Presentation

Transcript of Learning sign language by watching TV

Page 1: Learning sign language by watching TV

Learning sign language by watching TV

Shishir AgrawalLei Fan

Page 2: Learning sign language by watching TV

Introduction and Goal

• Television programs are now routinely broadcast with both subtitles and a person signing (usually as an overlay) to provide simultaneous ‘translation’ of the spoken words for deaf people

• To learn the translation of English words to British Sign Language signs from these TV broadcasts using the supervisory information available from subtitles broadcast simultaneously with the signing.

Page 3: Learning sign language by watching TV

Goal

Page 4: Learning sign language by watching TV

Previous Research

• Required manual training data to be generated for each sign e.g. a signer ‘performing’ each sign in controlled conditions – a time-consuming and expensive procedure.

• Many considered only constrained situations for example requiring the use of data- gloves or coloured gloves to assist with image processing at training and/or test time.

Page 5: Learning sign language by watching TV

Training Data Set

• The source material for learning consists of many hours of video with simultaneous signing and subtitles recorded from BBC digital television.

• This supervisory information is WEAK and NOISY

Page 6: Learning sign language by watching TV

WHY WEAK?

• It is weak due to the correspondence problem since temporal distance between sign and subtitle is unknown and signing does not follow the text order.

• PolysemyThe same English word may have different meanings and therefore signs, or the same sign may correspond to multiple English words.

Page 7: Learning sign language by watching TV

WHY NOISY ?

• The occurrence of a subtitle word does not imply the presence of the corresponding sign

Page 8: Learning sign language by watching TV

Noisy and Weak Data Set

Page 9: Learning sign language by watching TV

Approach: How to Learn

• Find set of subtitles which contain the word to form the positive training set and also the one’s which do not contain the word to form the negative set of subtitles

• Make visual descriptor for all signs corresponding to in subtitles

• Find sign each which is present in all positive set of subtitles and not in the negative set (IDEALLY)

• Because of noise and weak training data we need to score the signs and choose the one with best score

Page 10: Learning sign language by watching TV

Example

Page 11: Learning sign language by watching TV

Method

• Extract the subtitles from the video• Generate a feature vector for each frame of

the video describing the position, shape and orientation of the hands

• Find score of each sign (window)

Page 12: Learning sign language by watching TV

Extracting Data from Video

• OCR methods used to extract subtitles from the video.

• Each subtitle instance consists of a short text, and a start and end frame indicating when the subtitle is displayed.

• Typically a subtitle is displayed for around 100–150 frames.

Page 13: Learning sign language by watching TV

• By processing subtitles we obtain a set of video sequences labeled with respect to a given target English word as ‘positive’ (likely to contain the corresponding sign) or ‘negative’ (unlikely to contain the sign).

Page 14: Learning sign language by watching TV

+ve Sequence Frame Range

• Due to latency • Given the subtitle in which the target word

appears, the frame range of the extracted positive sequence is defined as the start frame of the previous subtitle until the end frame of the next subtitle

• Consequently, positive sequences are, on average, around 400 frames in length. In contrast, a sign is typically around 7–13 frames long

Page 15: Learning sign language by watching TV

-ve Sequence Frame Range

• Similarly, negative sequences are determined by searching for subtitles where the target word does not appear.

• For any target word an hour of video yields around 80,000 negative frames which are collected into a single negative set.

Page 16: Learning sign language by watching TV

Visual Processing

• A description of the signer’s actions for each frame in the video is extracted by tracking the hands via an articulated upper-body model.

Page 17: Learning sign language by watching TV

Upper Body Tracking

• It keeps track the head, torso, arms and hands of the signer unlike traditional methods which just keep track of the hands

• It requires a few frames (around forty) of manual initialization to specify the size of the parts and learn their colour and shape, and then tracking proceeds automatically for the length of the video.

• Robust method to track long videos, e.g. an hour, despite the complex and continuously- changing background

Page 18: Learning sign language by watching TV

Frame Descriptor

• After we extract the hands from the frame, a descriptor for the left, right and pairs of hands as a whole is made to solve the problem of overlapping or touching hands

Page 19: Learning sign language by watching TV

What Does the Descriptor Describe

• Position• Shape• Orientation

Page 20: Learning sign language by watching TV

Output of Tracker

• Segmented parts like hands which are represented by their HOG (histogram of oriented gradients) as they help in conveying shape.

Page 21: Learning sign language by watching TV

Exemplar (visual word)• This HOG descriptor is converted into an ‘exemplar’ hand

shape.• Exemplar are precomputed hand shapes• Exemplars are learnt separately for the left hand, right

hand, and hand pairs, using automatically chosen ‘clean’ images: the hands must not be in front of the face, and should be separate for individual hands or connected for hand pairs. K-means clustering of the corresponding HOG descriptors is used to determine the exemplar set.

• 1,000 clusters for each of left/right hands and hand pairs are used

Page 22: Learning sign language by watching TV

How it works

• Given the exemplars, the segmented hands in each frame are then assigned to their nearest exemplar (as measured by Euclidean distance between HOG descriptors) using the position of the wrists in the frame and in the hand exemplar for approximate alignment.

Page 23: Learning sign language by watching TV

Example

Page 24: Learning sign language by watching TV

Frame and window descriptors

• Frame descriptor (position, hand exemplar, hand pair exemplar)

• Window descriptor is the concatenation of the per-frame descriptors

Page 25: Learning sign language by watching TV

Visual distance between signs

• Distance between windows

is learnt for each individual target sign.• Distance for left hand and right hand defined

similar

weights are learnt offline

Page 26: Learning sign language by watching TV

Position distance

• Position relative to torso. Translation invariant

maximum translation is learnt from training data and is set at 5 pixels.

• Other transformations (scaling and rotation) investigated to be slightly detrimental.

Page 27: Learning sign language by watching TV

Hand shape distance

• Hand shape distance. Reliable either hands are apart or touching

Page 28: Learning sign language by watching TV

Hand orientation distance

• The square of angle needed to rotate one hand exemplar to the other

Page 29: Learning sign language by watching TV

• Each positive sequence in turn is used as ‘driving sequence’ where each temporal window of length n within the sequence is considered as a template for the signs.

Page 30: Learning sign language by watching TV

Sliding window classifier

• The classifier is used to determine if a temporal window matches the ‘template’ window

Page 31: Learning sign language by watching TV

MIL

• For a given target word, a set of positive bags

Page 32: Learning sign language by watching TV

Score function

• Maximize score function to obtain parameter

predications on positive bags and negative instances and the prior knowledge about the likely temporal location of target signs

Page 33: Learning sign language by watching TV

Score function

• Distribution of errors modeled as

estimated by fitting a parametric model to ground truth training data.

Page 34: Learning sign language by watching TV

Temporal prior

• Sign instances corresponding to a target word are more likely to be temporally located close to the center of positive sequences

• For bags with negative output• Otherwise

Page 35: Learning sign language by watching TV

Temporal prior

• For bags with negative output

• Otherwise (Maximum likelihood)

temporal locations scaled to [-1, +1], learnt from a subset of signs

Page 36: Learning sign language by watching TV

Maximizing the score

• Given template window , score function is maximized by searching over the left hand and a set of thresholds

• Operation repeated for all template windows, the template window that maximizes the score is deemed to be the sign corresponding to the target word.

Page 37: Learning sign language by watching TV

Experiment

• Given an English word, the goal is to identify the corresponding sign.

• Success if • i. the selected template window shows the

true sign (at least 50% overlap with ground truth)

• ii. At least 50% of all windows in the matched sequence show the true sign.