Searching Images by Color Using Solr

Post on 02-Jul-2015

1.096 views 2 download

description

Slides from "Searching 35 Million Images by Color Using Solr" presented by Chris Becker at Solr Lucene Revolution 2014 in Washington D.C.

Transcript of Searching Images by Color Using Solr

Searching Images by ColorChris Becker

Search Engineering @ Shutterstock

What is Shutterstock?

• Shutterstock sells stock images, videos & music.

• Crowdsourced from artists around the world

• Shutterstock reviews and indexes them for search

• Customers buy a subscription and download them

Why search by color?

Stock photography on the internet…

images from www.shutterstock.com

Stock photography on the internet…

images from www.shutterstock.com

Color is one of many visual

attributes that you can use

to create an engaging

image search experience

Diving into Color Data

Color Spaces

• RGB

• HSL

• Lab

• LCH

images from www.wikipedia.org

Calculating Distances Between Colors

• Euclidean distance works reasonably well in any color space

distRGB = sqrt((r1-r

2)^2 + (g

1-g

2)^2 + (b

1-b

2)^2)

distHSL = sqrt((h1-h

2)^2 + (s

1-s

2)^2 + (l

1-l

2)^2)

distLCH = sqrt((L1-L

2)^2 + (C

1-C

2)^2 + (H

1-H

2)^2)

distLAB = sqrt((L1-L

2)^2 + (a

1-a

2)^2 + (b

1-b

2)^2)

• More sophisticated equations that better account for human

perception can be found at

http://en.wikipedia.org/wiki/Color_difference

Images are just numbers

[

[[054,087,058], [054,116,206], [017,226,194], [234,203,215], [188,205,000], [229,156,182]],

[[214,238,109], [064,190,104], [191,024,161], [104,071,036], [222,081,005], [204,012,113]],

[[197,100,189], [159,204,024], [228,214,054], [250,098,125], [050,144,093], [021,122,101]],

[[255,146,010], [115,156,002], [174,023,137], [161,141,077], [154,189,005], [242,170,074]],

[[113,146,064], [196,057,200], [123,203,160], [066,090,234], [200,186,103], [099,074,037]],

[[194,022,018], [226,045,008], [123,023,087], [171,029,021], [040,001,143], [255,083,194]],

[[115,186,246], [025,064,109], [029,071,001], [140,031,002], [248,170,244], [134,112,252]],

[[116,179,059], [217,205,159], [157,060,251], [151,205,058], [036,214,075], [107,103,130]],

[[052,003,227], [184,037,078], [161,155,181], [051,070,186], [082,235,108], [129,233,211]],

[[047,212,209], [250,236,085], [038,128,148], [115,171,113], [186,092,227], [198,130,024]],

[[225,210,064], [123,049,199], [173,207,164], [161,069,220], [002,228,184], [170,248,075]],

[[234,157,201], [168,027,113], [117,080,236], [168,131,247], [028,177,060], [187,147,084]],

[[184,166,096], [107,117,037], [154,208,093], [237,090,188], [007,076,086], [224,239,210]],

[[105,230,058], [002,122,240], [036,151,107], [101,023,149], [048,010,225], [109,102,195]],

[[050,019,169], [219,235,027], [061,064,133], [218,221,113], [009,032,125], [109,151,137]],

[[010,037,189], [216,010,101], [000,037,084], [166,225,127], [203,067,214], [110,020,245]],

[[180,147,130], [045,251,177], [127,175,215], [237,161,084], [208,027,218], [244,194,034]],

[[089,235,226], [106,219,220], [010,040,006], [094,138,058], [148,081,166], [249,216,177]],

[[121,110,034], [007,232,255], [214,052,035], [086,100,020], [191,064,105], [129,254,207]],

]

• getting histograms

• computing median values

• standard deviations / variance

• other statistics

Any operation you can do on a set of

numbers, you can do on an image

Extracting Color Data

Tools & Libraries

• ImageMagick

• Python Image Library

• ImageJ

# python example to get a histogram from an image

import PIL

from PIL import Image

from pprint import pprint

image = Image.open('./samplephoto.jpg')

width, height = image.size

colors = image.getcolors(width*height)

hist = {}

for i, c in enumerate(colors):

hex = '%02x%02x%02x' % (c[1][0],c[1][1],c[1][2])

hist[hex] = c[0]

pprint(hist)

Indexing & Searching

in Solr

Indexing color histograms

color_txt = "cfebc2

cfebc2 cfebc2 cfebc2

cfebc2 cfebc2 cfebc2

cfebc2 cfebc2 cfebc2

95bf40 95bf40 95bf40

95bf40 95bf40 95bf40

2e6b2e 2e6b2e 2e6b2e

ff0000 …"

• index colors just like you would index text

• amount of color = frequency of the term

Solr Schema & Queries

• Can use solr’s default ranking effectively

/solr/select?q=ff0000 e2c2d2&qf=color&defType=edismax…

• or use term frequencies directly for specific sort functions:

sort=product(tf(color,"ff0000"),tf(color,"e2c2d2")) desc

<field name="color" type="text_ws" …>

Indexing color statistics

lightness:

median: 2

standard dev: 1

largest bin: 0

largest bin size: 50

saturation

median: 0

standard dev: 0

largest bin: 0

largest bin size: 100

Represent aggregate statistics of each image

Solr Fields & Queries

• Sort by the distance between input param

and median value for each image

/solr/select?q=*&sort=abs(sub($query,hue_median)) asc

<field name=”hue_median” type=”int” …>

Ranking & Relevance

How much of the image has the color ?

image from www.shutterstock.com

is this relevant if I search for ?

image from www.shutterstock.com

which image is more relevant if I search for ?

image from www.shutterstock.com

is this relevant if I search for ?

image from www.shutterstock.com

How do we account for these factors?

How much of the image contains the

selected color?

• Score each color by the number of pixels

sort=tf(color,"cfebc2") desc

Balance Precision and Recall

• Reduce your colorspace enough

to balance:

• color accuracy

• index size

• query complexity

• result counts

• only need 100-200 colors for a good UX

Weighing Multiple Colors Together

• If you search for 2 or more colors, the top result should have

the most even distribution of those colors

• simple option:sort=product(tf(color,"ff9900"),tf(color,"2280e2")) desc

• more complex: compute the standard deviation or variance

of the term frequencies of matching color values for each

image, and sort the results with the lowest variance first.

Weighing Similar & Different Colors

• The score for one color should reflect all the colors in the image.

• At indexing time, increase the score based on similar colors;

decrease it based on differing colors.

Conclusion

Conclusion• Steps for building color search in Solr:

• Extract colors using a tool like the Python Image Library

• Score colors based on the number of pixels

• Adjust scores based on similar / different colors

• Index colors into Solr as text document

• In your query, sort by the term frequency values for each

color

One more demo…