Poster Patil

Learning to Rank Resumes

Sangameshwar Patil, Girish K. Palshikar, Rajiv Srivastava, Indrajit Das

Tata Research Development and Design Center (TRDDC)Tata Consultancy Services, 54-B, Hadapsar I.E., Pune, 411013, India

{sangameshwar.patil, gk.palshikar, rajiv.srivastava, indrajit.das}@tcs.com

ABSTRACTRecruiting and assigning right candidate for right job con-sumes significant time and efforts in an organization. Auto-mated resume information retrieval systems are increasinglylooked upon as the solution to this problem. In this paper,we focus on problem of learning an end-user specific rankingin such a resume search engine. We provide experimentalresults of SVMrank algorithm [2] for this task.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval; I.2.6 [Learning]: Concept Learning;H.3.4 [Information Storage and Retrieval]: Systemsand SoftwareUser Profiles

KeywordsLearning to rank, Resume information retrieval

1. INTRODUCTIONResumes contain vital information that aids the decision

making process during recruiment in an organization. Theproject managers/HR regularly face the daunting task ofselecting right person for right job from hundreds or eventhousands of candidate resumes. Incorrect recruitment ortask assignment decisions often have costly business impact.

In the simplest retrieval scenario, values for a set of struc-tured attributes are extracted from each free-form textualresume; thus each resume is represented as a record in atable. Also, the job description can be specified as a struc-tured condition (query) over the same attributes. In morecomplex situations, the structured query itself could be sup-plemented in terms of a free-form textual job description.

Given a query, the search process retrieves and presentstop K closely matching resumes to the end-user, who thenmanually filters and selects a subset (say, of size M < K)of these resumes for further processing (e.g. interviews).The selected resumes are not necessarily the top M resumes

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FIRE 2012, ISI Kolkata, IndiaCopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

in the retrieved list. Some domain knowledge is used inmaking such a selection, which can often be explained interms of background quality attributes. For example, theend-user may prefer persons with higher academic qualifica-tions, higher marks, or those with additional certifications,or those who have won any awards etc. Thus the end-userimplicitly re-ranks the K retrieved resume using an unknownranking function. This ranking function may vary from userto user or may depend on the nature of the recruitment tar-get position. The question that we address here is: howcan this unknown resume ranking function be learnt afterobserving a specific users selections from the resumes re-trieved for different queries.

2. RESUME RANKING PROBLEMLet A = {A1, A2, . . . , Am} be a set of m structured at-

tributes for resumes; let DOM(Ai) denote the set of possi-ble values for attribute Ai. Let D = {D1, D2, . . . , DN} bethe given set of N resumes, where each Dj is a m-tuple inDOM(A1) DOM(Am). A basic query has the formAi Xi where Xi DOM(Ai). A query is a conjunctionof basic queries such that each attribute in A occurs in atmost one basic query. Let K and M K be two givenpositive integer constants. Let S denote the non-empty setof at most K resumes retrieved from D for the given queryQ. A ranking function f returns an ordered subset L S ofsize at most M, from the given set S of at most K resumes.Let {Q1, Q2, . . . , Qn} denote a given set of n queries, alongwith the sets Si of retrieved resumes for each query Qi. LetL1, . . . , Ln(Li Si) be the corresponding ordered subsetsof the retrieved resumes, as ranked by the user. The rank-ing function f used by the user is assumed to be fixed butunknown and needs to be estimated using this training data.

Learning such ranking functions is an active area of re-search [3]. In this paper, we report results obtained bya well known learning algorithm SVMrank [1] to learn auser-specific resume ranking function. SVMrank is a pair-wise ranking method which learns an unbiased linear clas-sification rule. Let Di Dj denote that the user prefersDi over Dj . Training data for SVMrank contains set R ={(D1, r1), (D2, r2), . . . , (DM , rM )} where ri is the ranking ofresume (or document) Di i.e. if Di Dj then ri < rj . Froma given ordering of the result-set per query, a set of pair-wiseconstraints are generated. Let ij denote the slack variablesin soft-margin formulation of SVM. Goal for SVMrank isto learn a ranking function f minimizes the objective func-tion: L1(w, ij) =

12w.w + C

ij subject to constraints:

(Di, Dj) : ri < rjw.Di w.Dj+1ij and ij 0 (i, j).

3. EXPERIMENTSUsing the resume information extration engine built in-

house, we created a repository of more than 2000 resumes ofIT professionals. Each resume is represented using a stan-dardized set of 28 features such as work experience, top 3technology skills, top 3 business domains, number of jobchanges done, average job duration with a company, num-ber of professional certifications (e.g. CCNA, SCJP etc.),academic qualification, average marks obtained and so on.We then created a benchmark set of 396 queries representingthe typical information needs of HR and workforce manage-ment teams. Some of the sample queries are: Q1 = (Javaprofessionals with 0-2 years of experience), Q2 = (ComputerScience graduates with Java and Oracle skills and at least 2years of project management experience).

Depending on the target role/position for a query, the end-users (e.g. HR personnel, Project managers) will use addi-tional domain knowledge and/or prior experience to form apreference list from the result-set of a query. For instance,an HR manager may give more preference to the average per-centage marks received in college and the candidates fromtop-tier colleges, but a seasoned project manager may prefercandidates who have not done frequent job changes.

In case of learning to rank problems, the relevance judge-ments are typically done by human assessors. However, thefeature of learning to rank resumes is still in-process of beingadded to our resume information extraction tool. Hence thehuman judgement data regarding relevance is not availablecurrently. To model the end-user specific ranking behavior,we used following three reprentative functions (f1, f2, f3) asapproximations of three end-user categories. These wouldbe used to simulate the human assessors and obtain prefer-ence lists for experimental verification.f1 = 0.4 csGrad+ 0.8 experience 0.7nJobChanges,f2 = (0.4 (experience

2 + avgMarks))/(nJobChanges+ 1),f3 = if (technology is searched) then(0.75 nRelevantCertifications + 0.25 nOtherCertifications)else (0.25 nOtherCertifications);

Note that the constants (e.g. 0.4, 0.8 etc.) used in defini-tions of f1, f2, f3 are chosen to indicate relative importancegiven to a specific attribute vis-a-vis other attributes in a re-sume. One could try out different values of such constants.Further, we also assume that there is already a pre-specifieddatabase of top-tier colleges, desirable certifications for atechnology etc. available to the end-users.

We evaluated SVMrank algorithm using 10-fold cross vali-dation. For training, we randomly selected 80% queries fromthe query set. The result-set for each query was ranked us-ing the ranking functions f1, f2, f3 described above. Top20 resumes from this ranked set per query were given toSVMrank as training data. The evaluation was done us-ing the remaining 20% queries from the query set. Theevaluation procedure is outlined in the Figure 1. To mea-sure how well SVMrank has learned the original rankingfunction fi, we use Jaccard coefficient (JC) and Kendalls rank correlation coefficient. JC between two sets A,B

is defined as J(A,B) = |AB|

|AB| and it measures the sim-ilarity between A and B without regard to any orderingof the elements. Kendalls takes into account the order-ing of the elements within A and B. It is defined as =(number of concordant pairs - number of discordant pairs)

12n(n1) . We com-

pare the set of top-M resumes as ranked by SVMrank with

Figure 1: Experiment evaluation setup

Figure 2: Experimental Results

the original list of top-M resumes by each fi. Figure 2 sum-marizes the experimental evaluation. As we increase M , thenumber of top-M resumes used for comparison, the Kendalls increases with marginal decrease in JC. For f1, the rela-tively low values of and JC are obtained because of rela-tively large number of ties that due to similar ranking scoresassigned by f1 to resumes. The reasonably high values ofboth measures (for f2 and f3) give us the confidence thatSVMrank is able to identify and rank most of the resumesin the top-M list as per original ordering.

4. CONCLUSIONS AND FURTHERWORKIn this paper we have identified the important problem

with significant business impact: learning to rank the re-sumes. Experiments using approximate models of humanrelevance judgement showed that the SVMrank algorithm isable to get predict and rank the set of top-M resumes withgood accuracy. Experiments with real-life relevance judege-ments from humans and comparison with other algorithmsfor learning to rank problems is planned as future work.

5. REFERENCES[1] T. Joachims. Svmrank implementation,http://www.cs.

cornell.edu/people/tj/svm_light/svm_rank.html.[2] T. Joachims. Training linear svms in linear time. In

Proceedings of the ACM Conference on KnowledgeDiscovery and Data Mining (KDD), 2006.

[3] T. Y. Liu. Learning to rank for information retrieval.Foundations and Trends in Information Retrieval,3(33):225331, 2009.

Poster Patil

Documents

Transcript of Poster Patil