Antispam Image Filtering Technologies

Post on 15-Jan-2015

161 views 1 download

Tags:

description

Slides from my wildly popular presentation at HP World 2005. Who knew? Grossly over-simplified signal processing methodology and sample photos of models in bikinis was a winning combo, even in San Francisco.

Transcript of Antispam Image Filtering Technologies

Image-Filtering

Technologies

Michael Lamont

Senior Software

Engineer

Process Software

Overview

• Role of image filtering in anti-spam

filtering

• Two popular image filtering methods:

– Shape recognition

– Skin detection

• Example image filtering

• Image filtering issues

• Tools you can play with on your own

What Isn’t Covered

• Anything requiring advanced math

• Optical character recognition (OCR)

Spam Images

• A picture is worth 1000 words…

• …and it’s a lot harder to filter than

1000 words.

• Especially when spamvertizing

pornography, photos are essential

marketing tools.

Spam Images

• Right now, a spam filter can be very

effective without looking at images.

• This is going to change when the

majority of sites start installing more

accurate filters, and spammers are

forced to adapt.

90-Second Image Review

• To understand how image filtering

technologies work, you need a basic

understanding of how computers

represent images.

• Images are broken into square dots,

which correspond to pixels on a

monitor.

90-Second Image Review

• Example image:

90-Second Image Review

• Each dot’s color is represented by 3

components: red, green, and blue.

• Each of the three color components

has a value of 0 to 255.

• If all three are 0, then the pixel is black.

If all three are 255, then the pixel is

white.

90-Second Image Review

• The higher the number, the more

intense the color component.

• Example: Increasing red value from 0

to 255 while leaving other components

at 0:

Shape Recognition

• Identifies objects in an image using

posterization and edge finding.

• Extracts interesting objects and

searches for similar objects in a

database of “bad” objects.

• For our application, the objects are

human body parts.

Posterization

• Dramatically reduces the number of

colors in an image.

• Has the side effect of lumping most of

an object’s pixels together.

• Called “posterization” because the

same kind of color reduction used to

be done for images printed on posters.

Posterization - Example

Posterization - Example

Posterization - Method

• A number of color bins are created.

• The number of bins is a lot less than

the ~16m colors that are possible.

• Each bin holds several hundred colors

that are closely related.

• Every color in the bin is represented by

the average color.

Posterization - Method

• Example: If a bin contained every

shade of red from light pink to dark

blood, every color in the bin would be

represented by plain old red.

• The posterization process itself

consists of replacing the color of every

pixel in the image with its bin’s

representative color.

Posterization - Example 2

Posterization - Example 2

Posterization - Example 3

Posterization - Example 3

Edge Finding

• After posterizing the image, edge

finding is used to identify individual

objects.

• Edge finding determines the

boundaries between different patches

of color and contrast.

Edge Finding - Example

Edge Finding - Example

Edge Finding - Method

• The edge finding program scans the

image looking for pixels that are very

different from their neighbors.

• When it finds a radically different pixel,

it marks it as part of an edge.

• Good edge finding algorithms look at

lots of neighboring pixels to help

reduce noise.

Edge Finding - Demonstration

Edge Finding - Example 2

Edge Finding - Example 2

Edge Finding - Example 3

Edge Finding - Example 3

Object Extraction

• Once objects have been identified with

posterization and edge finding, they’re

easy to extract.

Object Extraction

• Leg, midriff, and upper torso objects

are being searched in the case of

people wearing swimsuits.

Object Extraction

• A database of known objects is

searched for matches to the extracted

objects.

• Both object shape and color are used

in the search.

• Comparisons are done with a fuzzy

logic algorithm, since it’s unlikely two

objects will be exactly alike.

Skin Detection

• Subset of an image classification

method called color histogram

matching.

• Finds patches of skin tone in an image.

• Calculates the overall percentage of

the image that is skin.

• If more than a specified amount of the

image is skin, it’s filtered.

Skin Tones

• Almost all human skin is the same hue

- saturation differences result in

different skin colors.

• Human skin tones don’t often appear

in other photographed objects, so color

alone can be used to identify skin.

• Skin tones are primarily red, without

any blue and little if any green.

Skin Color Model

• To identify skin tones in an image, a

filter needs to know what colors are

skin tones.

• You could hardcode every skin color,

but there are tens of thousands of

them.

• Much more accurate to identify skin

patches in an image and “train” the

filter.

Skin Color Training

• Works almost like Bayesian filter

training, but with image colors instead

of message tokens.

• Filter maintains one database of skin

colors, and another database of non-

skin colors.

• If a color appears more often in the

skin color database, it’s treated as a

skin color.

Skin Color Training

• This system has the nice side-effect of

dropping out most skin colors that also

appear in non-skin areas of photos.

Training Sample

Skin Identification

• To analyze an image, the filter

examines the color of each pixel.

• If the color is a skin tone, the filter

marks the pixel as skin.

• When every pixel has been examined,

the % of the image that is skin is

calculated.

• If the % is over a specified threshold,

the image is filtered.

Skin Detection Example

Skin Detection Example

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Skin

Correctly Filtered Images - Skin

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Skin

Correctly Filtered Images - Skin

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Skin

Correctly Filtered Images - Skin

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Skin

Correctly Filtered Images - Skin

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Skin

Correctly Filtered Images - Skin

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Shape

Correctly Filtered Images - Skin

Correctly Filtered Images - Skin

Shape Recognition Problems

• Following are examples of images that

shape recognition doesn’t handle

correctly.

• Skin detection handles them correctly,

but only because it’s biased to filter

images with a lot of skin.

Shape Recognition Problems

• Unusual angle obscures shapes

Shape Recognition Problems

• Unusual angle obscures shapes

Shape Recognition Problems

• Unusual angle obscures shapes

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Shapes are too broken up for the filter

to work

Shape Recognition Problems

• Shapes are too broken up for the filter

to work

Shape Recognition Problems

• Shapes are too broken up for the filter

to work

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Not enough “swimsuit” objects

Shape Recognition Problems

• Not enough “swimsuit” objects

Shape Recognition Problems

• Not enough “swimsuit” objects

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Not enough “swimsuit” objects

Shape Recognition Problems

• Not enough “swimsuit” objects

Shape Recognition Problems

• Not enough “swimsuit” objects

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Skin detection works

Shape Recognition Problems

• Image is so noisy that edge detection

goes crazy

Shape Recognition Problems

• Image is so noisy that edge detection

goes crazy

Shape Recognition Problems

• Image is so noisy that edge detection

goes crazy

Shape Recognition Problems

• Amazingly, skin detection still works

Shape Recognition Problems

• Amazingly, skin detection still works

Skin Detection Problems

• Following are examples of images that

skin detection incorrectly filters.

• Shape recognition works for most of

these, mainly because it can’t extract

any useful shapes.

Skin Detection Problems

• Baby photos tend to show lots of skin

Skin Detection Problems

• Baby photos tend to show lots of skin

Skin Detection Problems

• Shape recognition doesn’t filter the

image

Skin Detection Problems

• Shape recognition doesn’t filter the

image

Skin Detection Problems

• Shape recognition doesn’t filter the

image

Skin Detection Problems

• Portraits have the same problem as

babies.

Skin Detection Problems

• Portraits have the same problem as

babies.

Skin Detection Problems

• Shape recognition ignores the image.

Skin Detection Problems

• Shape recognition ignores the image.

Skin Detection Problems

• Shape recognition ignores the image.

Skin Detection Problems

• In the right light, sand can be the same

color as skin.

Skin Detection Problems

• In the right light, sand can be the same

color as skin.

Skin Detection Problems

• That’s fairly rare - usually skin color

models exclude sand colors.

Skin Detection Problems

• That’s fairly rare - usually skin color

models exclude sand colors.

Skin Detection Problems

• Black & white images can’t be filtered

Skin Detection Problems

• It also makes life rough on shape

recognition filters.

Skin Detection Problems

• It also makes life rough on shape

recognition filters.

Wedding Photos

• Wedding photos are guaranteed to

make a mess of image filters.

• Skin fades into the background

because of soft lighting, soft filters, and

retouching.

• Turns out that brides get upset if the

image is crystal clear with good

contrast - it shows off skin flaws.

Wedding Photos

• Skin detection filters start identifying

everything as skin (false positive).

• Shape recognition filters give up and

don’t filter the message (accurate, but

not for the right reasons).

• Porn tends not to be shot with soft

lighting - good contrast makes skin

“pop” in photos.

Example Wedding Photo - Shape

Example Wedding Photo - Shape

Example Wedding Photo - Shape

Example Wedding Photo - Skin

Example Wedding Photo - Skin

Example Wedding Photo - Shape

Example Wedding Photo - Shape

Example Wedding Photo - Shape

Example Wedding Photo - Skin

Example Wedding Photo - Skin

“Art Porn”

• Usually shot with the same lighting

effects as wedding photos.

• Rarely seen in email.

• In this case, skin detection is accurate

for the wrong reasons while shape

recognition lets the image pass.

“Art Porn” Example - Shape

“Artistic” Example - Shape

“Artistic” Example - Shape

“Artistic” Example - Skin

“Artistic” Example - Skin

Things I Can’t Show You

• S & M

– Skin tends to be covered with “clothing”

– Shapes are broken up by all of the

paraphernalia

• Simpson’s shocker

• Still images from “interesting” videos

– Images are badly pixelated

– Colors are muddy and smudged

Image Filtering Issues

• Accuracy:

– Shape recognition misses lots of images it

shouldn’t (false negatives)

– Skin detection filters lots of images it

shouldn’t (false positives)

– Best skin detection systems are about

80% accurate

– Best shape recognition systems are about

40% accurate

Image Filtering Issues

• Performance:

– Image filtering requires huge amounts of

memory, CPU time, and disk bandwidth.

– Unacceptably slows down most site’s

email servers/filtering systems.

– DL380 benchmark:

• ~1.2 million messages/hour with no filtering

• ~195,000 messages/hour with skin detection

• ~69,000 messages/hour with shape recognition

Image Filtering Issues

• Diminishing returns on accuracy - most

spam filters won’t see a noticeable

increase in accuracy with the addition

of image filtering.

• That’s likely to change in the future as

spammers discover it’s one of the

better options for circumventing current

solutions.

I Wanna Play!

• Shape recognition:

– UC Berkeley’s blobworld

• Open source

• http://elib.cs.berkeley.edu/

– Skin detection

• No good open-source examples

• Trivial to write your own using ImageMagick

• http://www.imagemagick.org/

Quick Review

• We covered:

– How and why images appear in spam

– Why the use of images in spam is likely to

increase

– Two methods for filtering images

– Examples of how the two methods work

and don’t work

– Why image filtering isn’t widely used at

this point.