@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed around the world

233
go.indeed.com/IndeedEngTalks

Transcript of @IndeedEng: Tokens and Millicents - technical challenges in launching Indeed around the world

Tokens and MillicentsTechnical challenges in launching Indeed

around the world

Engineering DirectorDan Heller

We help people get jobs.

what where

job title, keywords or company name city, state or zip code

software Find Jobsaustin

was wo

job title, keywords or company name city, state or zip code

produktionshelfer Jobs findenmünchen

キーワード 勤務地

職種、キーワード、会社名など 都道府県名または市区町村名

登録栄養士 求人検索大阪

Αθήνατι που

τίτλος θέσης εργασίας, λέξεις-κλειδιά ή όνομα εταιρείας πόλη ή πολιτεία

βοηθός λογιστή Εύρεση θέσεων εργασίας

Software EngineerPreetha Appan

Precision and Recall

ALL JOBS

RelevantJobs

ReturnedJobs

Precision: Positive Predictive Value

# Returned and relevant

# Returned

Precision

Job seeker searches for “architect”

usually means “building architect”

Precision

Job seeker searches for “architect”

10 jobs returned:8 building architect jobs 2 software architect jobs

Precision

Job seeker searches for “architect”

10 jobs returned:8 building architect jobs Relevant2 software architect jobs Not Relevant

Precision: 8 / 10

Recall: Specificity

# Returned and relevant

# Relevant

Recall

Job seeker searches for “hr”

Jobs that mention “hr” or “human resources” are both relevant to the job seeker.

Recall

Job seeker searches for “hr”

10 jobs are relevant: 7 hr jobs 3 human resources jobs

Recall

Job seeker searches for “hr”

10 jobs are relevant: 7 hr jobs Returned 3 human resources jobs Not Returned

Recall: 7 / 10

Improving Recall in Job Search

Senior Software Engineer - SearchIndeed - Austin, TX

Indeed.com is seeking a Senior Software Engineer responsible for the information retrieval system that powers Indeed’s job search website. If you are an engineer who's passionate about building innovative products...

Job Description - English

Senior Software Engineer - SearchIndeed - Austin, TX

Indeed.com is seeking a Senior Software Engineer responsible for the information retrieval system that powers Indeed’s job search website. If you are an engineer who's passionate about building innovative products...

Tokenization

Inverted Index

● Like index in the back of a book● words = tokens, page numbers = doc ids

Inverted Index

Token Job A Job B Job C

assistant ✔

developer ✔

engineer ✔

lawyer ✔ ✔

paralegal ✔ ✔

retrieval ✔

Inverted Indexes

Allow you to:● Quickly find all documents containing a

token● Perform boolean queries, e.g “java AND

developer”

Apache Lucene Open source inverted index implementation

Fast, widely used

Tokenization with Lucene

StandardAnalyzer● Uses space and punctuation to determine

token boundaries

StandardAnalyzer - problems

● C++, C# → C

● O’Reilly → O, Reilly

Tokenization with Lucene

JobAnalyzer● forked StandardAnalyzer● Modified it to make it work for jobs

Secrétaire Saclay

Au sein de la direction de la Qualité et de l'Environnement (DQE) vous seconderez la secrétaire-assistante. Vos principales missions seront :

- organisation de réunions

- l'accueil téléphonique

- la gestion des missions ..

Job Description - French

ChineseJapanese

Korean

(CJK)

Job Description - Chinese

岗位描述:1、全厂电气设备的日常检查、记录,在操作工或主操的指导下进行工艺操作. 2、现场液体充装,现场充装安全的管理. 3、负责现场工作环境的整洁. ...

Job Description - Japanese

ちょっと想像してみてください。 ご近所のサーティワンにあなたが企画開発

Kanji

Job Description - Japanese

ちょっと想像してみてください。 ご近所のサーティワンにあなたが企画開発

Kanji

Job Description - Japanese

ちょっと想像してみてください。 ご近所のサーティワンにあなたが企画開発

Hiragana

Job Description - Japanese

ちょっと想像してみてください。 ご近所のサーティワンにあなたが企画開発

Kanji Hiragana Katakana

Chinese using JobAnalyzer

全厂电气设备的日常检查、记录,

在操作工或主操的指导下进行工艺操作.

全厂电气设备的日常检查、记录,

在操作工或主操的指导下进行工艺操作.

Chinese using JobAnalyzer

全厂电气设备的日常检查、记录,

在操作工或主操的指导下进行工艺操作.

“Daily inspection of electrical equipment plant-wide”

Chinese using JobAnalyzer

JobAnalyzer in CJK = Poor recall

CJKAnalyzer - bigrams

医療事務兼検査助手

医療事務兼検査助手

medical

医療事務兼検査助手

????

医療事務兼検査助手

affairs

Use bigram tokenizer on query

“東京都”Tokyo prefecture

“東京都”Tokyo prefecture

東京都東京都

東京都

“東京都”Tokyo prefecture

東京都

Tokyo

東京都東京都

“東京都”Tokyo prefecture

Tokyo Kyoto

Bigram tokenizer Drawbacks

● Poor precision

Bigram tokenizer Drawbacks

● Poor precision● Too many terms

Properly tokenize CJK

Accent and gender normalization

● secrétaire, secretaire

Accent and gender normalization

● secrétaire, secretaire● vendeur, vendeuse

Accent and gender normalization

● secrétaire, secretaire● vendeur, vendeuse● promotor@s

Language Detection

Language Detection options

● HTTP Content-Language response header○ Most sites don’t provide this header○ May not be accurate

Language Detection - ICU4J

● ICU4J’s CharsetDetector○ Works well for languages with single byte

encoded characters○ Detect that language is one of

Danish, Dutch, English, French, German, Italian, Portuguese, Swedish

Naive Bayesian classifier

● Features - words

● Strong independence assumption

● Class label - language

Naive Bayesian Language detector

Hand labelled training data in each language

Naive Bayesian Language detector

For each language, calculate P(wi ϵ Lj)● P(“experience” ϵ en) = 0.85

Naive Bayesian Language detector

P(w1 ϵ Lj) * P(w2 ϵ Lj) * P(w3 ϵ Lj)*..

Using Unicode Blocks

Thai

min

max

min

max

Greek

● 100% accurate● Used in:

○ Thai○ Greek○ Korean○ Hebrew

Using Unicode Blocks

CJ language detection

● Strongly weight Hiragana and Katakana

● Some characters (Kanji) common between Chinese and Japanese

● p(卒 ϵ ja) = 0.99 p(卒 ϵ zh) = 0.000001

Language Results

● Did cross validation on hand labeled testing data

● 99% accurate for text > 30 characters○ Average job description is 200 characters

● Fast - 0.6ms per job

Other language detectors

Google - https://code.google.com/p/cld2/

CJK Tokenization

CJK tokenizers

● Dictionary-based● Statistical model & dictionary

Dictionary-based tokenizers

● Dictionary of words in language

● Scan input sentence, return all possible tokenizations

Context matters

北京大学生前来应聘

北京 大学生前来应聘 Beijing

北京 大学生 前来应聘 Beijing college students

北京 大学生 前来 应聘 Beijing college students come to

北京 大学生 前来 应聘 Beijing college students come to apply jobs

北京 大学生 前来 应聘 Beijing college students come to apply jobs

北京大学生前来应聘

北京 大学生 前来 应聘 Beijing college students come to apply jobs

北京大学 生前来应聘 Peking University

北京 大学生 前来 应聘 Beijing college students come to apply jobs

北京大学 生前 来应聘 Peking University before death

北京 大学生 前来 应聘 Beijing college students come to apply jobs

北京大学 生前 来应聘 Peking University before death come to apply jobs

北京 大学生 前来 应聘 Beijing college students come to apply jobs

北京大学 生前 来应聘 Peking University before death come to apply jobs

Hidden Markov Model

北京大学Peking University

中国China

生前before death

北京大学生前来应聘

北京大学Peking

University

生前before death

北京Beijing

大学生

college student

✘北京大学

Peking University

生前before death

北京Beijing

大学生

college student

北京大学生前来应聘

CJK tokenizers

● Chinese - Imdict● Japanese - Sen● Korean - LuceneKorean

Chinese tokenization

http://nlp.stanford.edu/projects/chinese-nlp.shtml

● Different rules per language around○ Gender○ Plurals○ Collation

More recall challenges

Apply language specific rules to transform words to canonical form

Use detected language

Stemming

What is stemming?

the process of turning multiple variations of a word into a single equivalent root

Stemming examples

● driver, drivers → driver● secretaire, secrétaire → secretaire● vendeur, vendeuse → vendeur

Why stemming matters

● Return all possible relevant jobs given the user’s query, not just exact matches

Stemming - Lucene Analyzers

● Do stemming before adding to inverted index

● Examples○ PorterStemFilter○ SnowballAnalyzer○ EnglishMinimalStemmer

Inverted Index

Job A: Directrice de Documentaires Job B: Directeur de production

Token Job A Job B

de ✔ ✔

directeur ✔ ✔

documentaires ✔

production ✔

Search with stemming tokenizers

● At search time, use the same analyzer on the query○ “directrice” → “directeur”

● Search for “directrice” returns both jobs

Modifying stem rules require full index rebuild

● If roots have changed need to re-process all jobs

Token Job A Job B

de ✔ ✔

directeur ✔ ✔

documentaires ✔

production ✔

Drawbacks

● Loss of precise information○ “Directrice” search should return exact match only

Decouple stemming from indexing

Term Expansion Maps

Term Expansion Maps

● Map from String->List<String>

● Key is root, values are tokens that stem to that root● driver → driver, drivers● vendeur → vendeur, vendeuse

Stemmer interface

● One method ● String stem(String token)

● Many implementations● EnglishStemmer● FrenchStemmer● GermanStemmer● SpanishStemmer

Building term expansion map

for each language

for each term in language

root = Stemmer.stem(term)

termMap[root].append(term)

● Takes ~1.5 minutes on index with 2 million tokens and 18 languages

Using term expansion map

Job A: Directrice de documentaires

Job B: Directeur de production

Token Job A Job B

de ✔ ✔

directeur ✔ ✔

documentaires ✔

production ✔

Job A: Directrice de documentaires

Job B: Directeur de production

Token Job A Job B

de ✔ ✔

directeur ✔

directrice ✔

documentaires ✔

production ✔

Search Service

“directrice”

“directrice”

“directeur”

French Stemmer

“directrice”

“directeur”Term

Expansion Map

French Stemmer

Query Rewriter

“directrice”

“directeur”Term

Expansion Map

French Stemmer

Query Rewriter

“directrice” OR “directeur”

Job A: Directrice de documentaires

Job B: Directeur de production

Token Job A Job B

de ✔ ✔

directeur ✔

directrice ✔

documentaires ✔

production ✔

Benefits

● Modifying stem rules don’t require index rebuilds○ Takes minutes on index with millions of jobs○ Had flexibility to iteratively implement stemming

rules as we come across different use cases

Benefits

● Precise information○ “directrice” search query returns exact match only

Code deploy to change rules or add languages

49 team members26 nationalities18 languages

Scale Stemming

● Indeed continued international expansion

● Needed stemming to scale without code deploys and coordination between developers and country managers.

Goal● Efficient

○ Store term expansion maps efficiently○ Search time as fast as possible

Goal● Generic

○ identify patterns common to all languages.■ ies→y in English, se→r in French

Goal● Comprehensive

○ Support all use cases we care about:■ plurals■ synonyms■ abbreviations■ accent collation■ gender suffixes

Goal● Scalable

○ Adding a new language shouldn’t need a code deploy

Rule driven stemmingone stemmer. all languages.

What is a stemming rule

● Rules transform tokens into their root form

Rule attributes

● Rules have “from” (origin) and “to” (replacement)

Rule attributes

● Rules have a type○ Types define exactly how the text transformation

happens

Rule type - exact

● Change origin to replacement when its an exact match

Exact rule

Englishsr→seniorattorney→lawyer

Italiancolf→domestica

Dutchleraar→docent

Rule type - substring

● Change all occurrences of origin to replacement

Substring rule

English - é→e résumé → resumecafé → cafe

German - ä → averkäufer → verkaufer

French - ô→ohôtesse → hotesse

Rule type - suffix

● Change origin to replacement if it matches at the end of token

Suffix Rule - English

● ies→y○ families → family○ policies → policy

● s→’’ ○ nurses → nurse○ drivers → driver

Suffix Rule - French

● euse→eur○ serveuse→serveur

● ienne→ien ○ gardienne→ gardien

Rules are ordered

Order matters

Stem “families”Rules

● s→’’● ies→y

Apply s→’’

Order matters

Stem “families”Rules

● s→’’● ies→y

Apply s→’’● families → familie

Stem “families”Rules

● s→’’● ies→y

Apply s→’’● families → familie

Order matters

Rules can be marked as terminal

● No more rules applied after terminal rule

Prevent over-stemming

● s → ‘’ can cause this → thi

● Min Length - special terminal rule

● Usually set to anywhere from 3 to 5

Babelfish: Stem rule editor

● Webapp to edit and publish rules

● Rules interpreted by generic stemmer

● 27 languages

Stem rule editor

Stem rule editor

Stem rule editor

Ability to audit rules

directricesdirectrice suffix rule “s” → “”directeur suffix rule “trice” → “teur”

ingénieuringenieur substring rule “é” → “e”

JobSeekers

Stem Rule EditorEN s → ‘’, ces → y, …FR e → é, u → ù, …

Jobs Index Builder

Term Expansion Mapsale → sale, salespolicy → policy, policies

Search Service

Country Managers

query

results

Term expansion map storage

● Custom serialization format ○ Store string array as UTF8 bytes and offsets○ Front encoding for additional compression

● 2X smaller than using Java native serialization

Comprehensive

● Gender ● Accents ● Plurals● Synonyms

Scalable

27 languages use stemming rules

Re-used language detection and stemming libraries in resume search

Efficient

● Term expansion map in Europe index has 2 million terms in 18 languages - 60MB on disk

● Building term expansion maps takes ~ 1.5 minutes

● Doing boolean query for stemming adds ~5ms to median search time (~35ms)

Stemming helps job seekers

Searches that return no jobs reduced by 60% with stemming

3% to 5% more clicks

Multi-currencySponsored Jobs

Sponsored Jobs at Indeed

Real-time auction used to determine Sponsored Job impressions

Sponsored Jobs at Indeed

Real-time auction used to determine Sponsored Job impressions

Auction winner based on expected value

ExpectedValue

= Bid x eCTR

ExpectedClick-Through

Rate*

ExpectedValue

= Bid x eCTR

ExpectedClick-Through

Rate*

ExpectedValue

= Bid x eCTR

ExpectedClick-Through

Rate*

Job Bid

A $3.00

B $2.00

C $1.00

Job Bid eCTR

A $3.00 5%

B $2.00 10%

C $1.00 8%

Job Bid x eCTR = Value

A $3.00 5% $0.15

B $2.00 10% $0.20

C $1.00 8% $0.08

Job Bid x eCTR = Value → Rank

A $3.00 5% $0.15 2

B $2.00 10% $0.20 1

C $1.00 8% $0.08 3

Job Bid x eCTR = Value → Rank

B $2.00 10% $0.20 1

A $3.00 5% $0.15 2

Job Bid x eCTR = Value → Rank

B $2.00 10% $0.20 1

A $3.00 5% $0.15 2

B could win the auction with a lower bid...

B could win the auction with a lower bid...…only charge what’s needed to win!

Job Bid x eCTR = Value → Rank

B $2.00 10% $0.20 1

A $3.00 5% $0.15 2

B could win the auction with a lower bid...…only charge what’s needed to win!

Job Bid x eCTR = Value → Rank

B $2.00 10% $0.20 1

A $3.00 5% $0.15 2

$1.50 x 10% = $0.15

B could win the auction with a lower bid...…only charge what’s needed to win!

Cost = $1.51

Job Bid x eCTR = Value → Rank

B $2.00 10% $0.20 1

A $3.00 5% $0.15 2

B could win the auction with a lower bid...…only charge what’s needed to win!

Cost = $1.51

Job Bid x eCTR = Value → Rank

B $2.00 10% $0.20 1

A $3.00 5% $0.15 2

B could win the auction with a lower bid...…only charge what’s needed to win!

Cost = $1.51

Job Bid x eCTR = Value → Rank

B $2.00 10% $0.20 1

A $3.00 5% $0.15 2

Sponsored Jobs at Indeed

“Generalized Second Price Auction”

Sponsored Jobs at Indeed

“Generalized Second Price Auction”● Fair for employers● Ensures sponsored results are relevant and

useful for job seekers

Sponsored Jobs at Indeed

Employers set their bid & budget

Sponsored Jobs at Indeed

Employers set their bid & budget

employer_id int(10) unsigned,

bid decimal(10,2) unsigned,

daily_budget decimal(10,2) unsigned,

Sponsored Jobs at Indeed

A builder process creates read-optimized data structures for the auction system

On search results page, execute auction to determine sponsored impressions

Sponsored Jobs at Indeed

Sponsored Jobs at Indeed

When job seeker clicks on sponsored result, log information from the auction

employerId

jobId

bid

cost

Sponsored Jobs at Indeed

Process click logs to update budgets and charge employers

Sponsored Jobs at Indeed

Process click logs to update budgets and charge employers

Apply business rules during click processing:● Fraud detection● Duplicate click detection

SJ outside the US

Non-US employers wanted their jobs in sponsored results...

SJ outside the US

Non-US employers wanted their jobs in sponsored results...

...but they don’t have US Dollars

SJ outside the US

v1: Use credit cards

Credit card company convert charges to employer’s currency

SJ outside the US

Credit Cards

+ No changes needed

SJ outside the US

Credit Cards

+ No changes needed

- Bad UX for employers

SJ outside the US

Credit Cards

+ No changes needed

- Bad UX for employers- Disadvantaged exchange rates

SJ outside the US

Credit Cards

+ No changes needed

- Bad UX for employers- Disadvantaged exchange rates- Employers bear currency risk

Credit Cards: Currency Risk

Desired Daily Budget: CA $100.00

Exchange rate on Jan 1: 0.9351

Set Daily Budget to: $93.51

Credit Cards: Currency Risk

Desired Daily Budget: CA $100.00

Exchange rate on Jan 1: 0.9351

Set Daily Budget to: $93.51

Exchange rate on Jan 31: 0.8970

Effective Daily Budget: CA $104.25

Credit Cards: Currency Risk

+4.25%

Desired Daily Budget: CA $100.00

Exchange rate on Jan 1: 0.9351

Set Daily Budget to: $93.51

Exchange rate on Jan 31: 0.8970

Effective Daily Budget: CA $104.25

Multi-currencySponsored Jobs

Auction

Multi-currency SJ

Employers can set bids and budgets in preferred currency

Canadian Dollars CAD

Australian Dollars AUD

Japanese Yen JPY

Euro EUR

British Pounds GBP

Swiss Francs CHF

Multi-currency SJ

Single auction for all employers using any currency

Multi-currency SJ

Fair exchange rates for employers

Multi-currency SJ

Transparent and repeatable calculations

Multi-currency SJ

Create a new “pseudo-currency” for use within the auction:

millicent

Millicents

Exchange rate between USD and millicents is fixed:

$0.01 == 1000 millicents $1.00 == 105 millicents

Millicents

Exchange rates between other currencies and millicents can vary over time:

€1.00 == 136,170 millicents ¥100 == 98,350 millicents

Millicents

Provide enough granularity to differentiate similar values in different currencies

Millicents

Provide enough granularity to differentiate similar values in different currencies

All of these are about $1.00 (USD): £0.60 (GBP) €0.73 (EUR) ¥102 (JPY)

Millicents

Provide enough granularity to differentiate similar values in different currencies

All of these are about $1.00 (USD): £0.60 (GBP) €0.73 (EUR) Which is larger? ¥102 (JPY)

Millicents

Converting to USD doesn’t help

USD: $1.00 → $1.00

GBP: £0.60 → $1.00

EUR: €0.73 → $1.00

JPY: ¥102 → $1.00

Millicents

Millicents provide granularity to rank values

USD: $1.00 → 100000 mc

GBP: £0.60 → 100450 mc

EUR: €0.73 → 99519 mc

JPY: ¥102 → 100317 mc

Millicents

32 bit signed values $21,474 USD equivalent

64 bit signed values $9.2 trillion USD equivalent

Local Currency Values

Values in specific currency are represented with currency code and an integer

Integer represents “minor unit”, depends on the currency type: (USD, 543) == $5.43 (EUR, 543) == €5.43 (JPY, 543) == ¥543

Local Currency Values

For each currency, preferable that the “minor unit” is roughly equal to $0.01 USD● Exchange rate representation● Fairness in auction competition

Local Currency Values

32 bit signed values $21 million USD (and others) ¥2.1 billion JPY

64 bit signed values $90 quadrillion USD (and others) ¥9 quintillion JPY

Multi-currency SJ

Change bid and budget representations to use [currency, integer]

Multi-currency SJ

Create process to retrieve and record exchange rates every day

Multi-currency SJ

Auction builder process converts bids to millicents, saves the exchange rate used

Multi-currency SJ

Execute auction in millicents

Multi-currency SJ

Record results in millicents & local currency

Multi-currency SJ

Add multi-currency data to click logs:

employerId

jobId

bid

cost

...

employerId

jobId

currency

exchangeRate

bidInCurrency

bidMillicents

costMillicents

...

Multi-currency SJ

During click processing, convert auction cost (in millicents) back to employer’s currency using same exchange rate

costInMillicents

currency

exchangeRate

→ costInCurrency

“How much revenue did we make today?”

$1,000

“How much revenue did we make today?”

$1,000

$548 USD€273 EUR

¥8,253 JPY

“How much revenue did we make today?”

$1,000

$548 USD€273 EUR

¥8,253 JPY

100,000,000 mc

Revenue Reporting

If the auction millicent cost is used, there could be errors!

Revenue Reporting

If the auction millicent cost is used, there could be errors!

Millicent Cost: 53,826 millicentsEuro Cost: €0.39483

Revenue Reporting

If the auction millicent cost is used, there could be errors!

Millicent Cost: 53,826 millicentsEuro Cost: €0.39483

Revenue Reporting

If the auction millicent cost is used, there could be errors!

Millicent Cost: 53,826 millicentsEuro Cost: €0.39Actual Millicent Cost: 53,168 millicents

Revenue Reporting

If the auction millicent cost is used, there could be errors!

Millicent Cost: 53,826 millicentsEuro Cost: €0.39Actual Millicent Cost: 53,168 millicents

1.2% difference!

Active Non-US Employers

Great Britain

Japan

Canada

30% of Sponsored clicks non-USD

2004

2014

International SuccessUnited Kingdom 1.) Indeed 2.) Reed 3.) Totaljobs

France 1.) Indeed 2.) Cadremploi 3.) Monster

Netherlands 1.) Indeed 2.) NVB 3.) Monsterboard

Canda 1.) Indeed 2.) Workopolis 3.) Monster

Italy 1.) Indeed 2.) Infojobs 3.) Jobrapido

Brazil 1.) Indeed 2.) Catho 3.) Infojobs

Japan 1.) Rikunabi 2.) Indeed 3.) Rikunabi Next

Australia 1.) Seek 2.) Indeed 3.) Careerone

India 1.) Naukri 2.) Timesjobs 3.) Indeed

Next @IndeedEng Talk

August 27th, 2014

http://engineering.indeed.com/talkshttps://twitter.com/IndeedEng