Self-learned Relevancy with Apache Solr

Self-learned relevancy with Apache SolrTrey Grainger

SVP of Engineering, Lucidworks

NYC Lucene/Solr2017.03.30

Trey GraingerSVP of Engineering

• Previously Director of Engineering @ CareerBuilder

• MBA, Management of Technology – Georgia Tech

• BA, Computer Science, Business, & Philosophy – Furman University

• Information Retrieval & Web Search - Stanford University

Other fun projects:

• Co-author of Solr in Action, plus numerous research papers

• Frequent conference speaker

• Founder of Celiaccess.com, the gluten-free search engine

• Lucene/Solr contributor

About Me

• Apache Solr Overview

Lucidworks Fusion Overview

• Core Search / Relevancy

- Keyword Search

- Multi-lingual Text Analysis

- Relevancy

• Reflected Intelligence

- Signals (Demo)

- Recommendations (Demo)

- Relevancy Tuning

- Learning to Rank (Demo)

…

Agenda…

• Semantic Search

- Entity Extraction (Demo)

- Query Parsing (Demo)

- Semantic Knowledge Graph (Demo)

• Streaming Expressions

NYC Lucene/Solr

Basic Keyword Search(inverted index, tf-idf, bm25, multilingual text analysis, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningData-driven App Sophistication

NYC Lucene/Solr

what do you do?

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online RetailDigital Content

Lucidworks enables Search-Driven Everything

Data Acquisition

Indexing & Streaming

Smart Access API

Recommendations &

AlertsAnalytics & InsightsExtreme Relevancy

CUSTOMER

SERVICE

RESEARCH

PORTAL

DIGITAL

CONTENT

CUSTOMER

INSIGHTS

FRAUD

SURVEILLANCE

ONLINE

RETAIL

• Access all your data in a

number of ways from one

place.

• Secure storage and

processing from Solr and

Spark.

• Acquire data from any source

with pre-built connectors and

adapters.

Machine learning and

advanced analytics turn all

of your apps into intelligent

data-driven applications.

Apache Solr

“Solr is the popular, blazing-fast,

open source enterprise search

platform built on Apache Lucene™.”

Key Solr Features:

● Multilingual Keyword search

● Relevancy Ranking of results

● Faceting & Analytics (nested / relational)

● Highlighting

● Spelling Correction

● Autocomplete/Type-ahead Prediction

● Sorting, Grouping, Deduplication

● Distributed, Fault-tolerant, Scalable

● Geospatial search

● Complex Function queries

● Recommendations (More Like This)

● Graph Queries and Traversals

● SQL Query Support

● Streaming Aggregations

● Batch and Streaming processing

● Highly Configurable / Plugins

● Learning to Rank

● Building machine-learning models

● … many more*source: Solr in Action, chapter 2

The standard

for enterprise

search.of Fortune 500

uses Solr.

90%

Lucidworks Fusion

DFW Data Science

All Your Data

• Over 50 connectors to

integrate all your data

• Robust parsing framework

to seamlessly ingest all your

document types

• Point and click Indexing

configuration and iterative

simulation of results for full

control over your ETL

process

• Your security model

enforced end-to-end from

ingest to search across your

different datasources

Experience

Management

• Relevancy tuning: Point-and-click

query pipeline configuration allow

fine-grained control of results.

• Machine-driven relevancy:

Signals aggregation learn and

automatically tune relevancy and

drive recommendations out of the

box .

• Powerful pipeline stages:

Customize fields, stages,

synonyms, boosts, facets,

machine learning models, your

own scripted behavior, and

dozens of other powerful search

stages.

• Turnkey search UI

(Lucidworks View): Build a

sophisticated end-to-end search

application in just hours.

Operational Simplicity

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache Zookeeper

ZK 1

Leader Election

Load Balancing

Shared Config Management

Worker Worker

Apache Spark

Cluster Manager

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

RE

ST

AP

I

Admin UI

Lucidworks

View

LOGS FILE WEB DATABASE CLOUD

HD

FS

(O

ptio

na

l)

• 75% decrease in

development time

• Licensing costs cut

by 50%

With Fusion’s out-of-the-box capabilities, we skipped

months in our dev cycle so we could focus our team

where they would have the most impact.

We cut our licensing costs by 50% and improved

application usability. The Lucidworks professional

services team amplified our success even further. We’re

all Fusion from here on out!”

“

Lourduraju Pamishetty

Senior IT Application Architect

—

• Seamless integration of your

entire search & analytics

platform

• All capabilities exposed

through secured API's, so

you can use our UI or build

your own.

• End-to-end security policies

can be applied out of the

box to every aspect of your

search ecosystem.

• Distributed, fault-tolerant

scaling and supervision of

your entire search

application

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

RE

ST

AP

I

Admin UI

Lucidworks

View

LOGS FILE WEB DATABASE CLOUD

• Seamless integration of your

entire search & analytics

platform

• All capabilities exposed

through secured API's, so

you can use our UI or build

your own.

• End-to-end security policies

can be applied out of the

box to every aspect of your

search ecosystem.

• Distributed, fault-tolerant

scaling and supervision of

your entire search

application

Lucidworks Fusion

Fusion powers search for the brightest companies in the world.

search & relevancy

Basic Keyword Search

The beginning of a typical search journey

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x],

doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far,

far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over

the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo”

once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

NYC Lucene/Solr

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4,

doc5

…

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4,

doc7, doc8

… …

doc5

doc7 doc8

doc1 doc3 doc4

solr

apache

apache solr

Matching queries to documents

NYC Lucene/Solr

Text Analysis

Generating terms to index from raw text

Text Analysis in Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

NYC Lucene/Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr


NYC Lucene/Solr

Multi-lingual Text Analysis

Analyzing text across multiple languages

Example English Analysis Chains

<fieldType name="text_en" class="solr.TextField"positionIncrementGap="100">

<analyzer><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory"

words="lang/stopwords_en.txt”ignoreCase="true" />

<filter class="solr.LowerCaseFilterFactory"/><filter class="solr.EnglishPossessiveFilterFactory"/><filter class="solr.KeywordMarkerFilterFactory"

protected="lang/en_protwords.txt"/><filter class="solr.PorterStemFilterFactory"/>

</analyzer></fieldType>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">

<analyzer><charFilter class="solr.HTMLStripCharFilterFactory"/><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory"

synonyms="lang/en_synonyms.txt" IignoreCase="true" expand="true"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.ASCIIFoldingFilterFactory"/><filter class="solr.KStemFilterFactory"/><filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

</analyzer></fieldType>

NYC Lucene/Solr

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

NYC Lucene/Solr

Which Stemmer do I choose?


NYC Lucene/Solr

Common English Stemmers


NYC Lucene/Solr

When Stemming goes awry

Fixing Stemming Mistakes:

• Unfortunately, every stemmer will have problem-cases that aren’t handled as you would expect

• Thankfully, Stemmers can be overriden

• KeywordMarkerFilter: protects a list of terms you specify from being stemmed

• StemmerOverrideFilter: applies a list of custom term mappings you specify

Alternate strategy:

• Use Lemmatization (root-form analysis) instead of Stemming

• Commercial vendors help tremendously in this space

• The Hunspell stemmer enables dictionary-based support of varying quality in over 100 languages

NYC Lucene/Solr

Relevancy

Scoring the results, returning the best matches

Classic Lucene Relevancy Algorithm (now switched to BM25):

*Source: Solr in Action, chapter 3

Score(q, d) =

∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q

Where:t = term; d = document; q = query; f = field

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery

queryNorm(q) = 1 / (sumOfSquaredWeights ½ )

sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2

t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

NYC Lucene/Solr

• Term Frequency: “How well a term describes a document?”

– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”

– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

NYC Lucene/Solr

News Search : popularity and freshness drive relevance

Restaurant Search: geographical proximity and price range are critical

Ecommerce: likelihood of a purchase is key

Movie search: More popular titles are generally more relevant

Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good

domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

NYC Lucene/Solr

what is “reflected intelligence”?

The Three C’s

Content:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence“Leveraging previous data and interactions to improve how

new data and interactions should be interpreted”

NYC Lucene/Solr

Feedback Loops

User

Searches

User

Sees

ResultsUser

takes an

action

Users’ actions

inform system

improvements

NYC Lucene/Solr

● Recommendation Algorithms

● Building user profiles from past searches, clicks, and other actions

● Identifying correlations between keywords/phrases

● Building out automatically-generated ontologies from content and queries

● Determining relevancy judgements (precision, recall, nDCG, etc.) from click

logs

● Learning to Rank - using relevancy judgements and machine learning to train

a relevance model

● Discovering misspellings, synonyms, acronyms, and related keywords

● Disambiguation of keyword phrases with multiple meanings

● Learning what’s important in your content

Examples of Reflected Intelligence

NYC Lucene/Solr

John lives in Boston but wants to move to New York or possibly another big city. He is

currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location

in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a

Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

NYC Lucene/Solr

http://localhost:8983/solr/jobs/select/?

fl=jobtitle,city,state,salary&

q=(

jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10

)

AND (

(city:"Boston" AND state:"MA")^15

OR state:"MA")

AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

NYC Lucene/Solr

{ ...

"response":{"numFound":22,"start":0,"docs":[

{"jobtitle":" Clinical Educator

(New England/ Boston)",

"city":"Boston",

"state":"MA",

"salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator",

"city":"Braintree",

"state":"MA",

"salary":56183},

{"jobtitle":"Nurse Educator",

"city":"Brighton",

"state":"MA",

"salary":71359}

NYC Lucene/Solr

http://github.com/treygrainger/solr-in-action/

You just built a

recommendation engine!

NYC Lucene/Solr

Can also integrate user behavior (Ships with Fusion

3.1):

Demo:

Signals & Recommendations

• 200%+ increase in

click-through rates

• 91% lower TCO

• Fewer support tickets

• Increased customer

satisfaction

Relevancy Tuning

Improving ranking algorithms through experiments and models

How to Measure Relevancy?

A B C

Retrieved

Documents

Related

Documents

Precision = B/A

Recall = B/C

Problem:

Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at

the top of the retrieved documents, is that OK?

NYC Lucene/Solr

Normalized Discounted Cumulative Gain

Rank Relevancy

3 0.95

1 0.70

2 0.60

4 0.45

Rank Relevancy

1 0.95

2 0.85

3 0.80

4 0.65

Ranking

IdealGiven

• Position is

considered in

quantifying

relevancy.

• Labeled dataset

is required.

NYC Lucene/Solr

Learning to Rank

Learning to Rank (LTR)

● It applies machine learning techniques to discover the best combination

of features that provide best ranking.

● It requires labeled set of documents with relevancy scores for given set

of queries

● Features used for ranking are usually more computationally expensive

than the ones used for matching

● It typically re-ranks a subset of the matched documents (e.g. top 1000)

NYC Lucene/Solr

NYC Lucene/Solr

Common LTR Algorithms

• RankNet* (Neural Network, boosted trees)

• LambdaMart* (set of regression trees)

• SVM Rank** (SVM classifier)

** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf

* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

NYC Lucene/Solr

LambdaMart Example

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

NYC Lucene/Solr

Demo: Learning to Rank

Obtaining Relevancy JudgementsTypical Methodologies

1) Hire employees, contractors, or interns

-Pros:

Accuracy

-Cons:

Expensive

Not scalable (cost or man-power-wise)

Data Becomes Stale

2) Crowdsource-Pros:

Less cost, more scalable

-Cons:

Less accurate

Data still becomes staleSource: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

NYC Lucene/Solr

Reflected Intelligence: Possible to infer relevancy judgements?

Rank Document ID

1 Doc1

2 Doc2

3 Doc3

4 Doc4

QueryQuery

Doc1 Doc2 Doc3

01 1

Query

Doc1 Doc2 Doc3

10 0

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

NYC Lucene/Solr

Automated Relevancy Benchmarking

DefaultAlgorithm

0.610.59

0.580.60

0.61 0.610.60

0.610.60

0.750.74

0.750.74

0.750.73

0.750.76

0.750.74

0.79 0.790.78

0.790.80

0.810.80

0.810.79 0.79

0.700.71 0.71

0.690.70 0.70

0.690.70

0.710.70

0.750.76

0.770.76 0.76

0.770.76

0.750.76 0.76

0.300.31

0.320.33

0.320.30

0.31 0.31 0.310.32

10/1/16 10/2/16 10/3/16 10/4/16 10/5/16 10/6/16 10/7/16 10/8/16 10/9/16 10/10/16

DefaultAlgorithm Algorithm1 Algorithm2 Algorithm3 Algorithm4 Algorithm5

NYC Lucene/Solr

Traditional

Keyword

SearchRecommendations

Semantic

Search

User Intent

Personalized

Search

Augmented

SearchDomain-aware

Matching

The Relevancy

Spectrum

NYC Lucene/Solr

semantic search

NYC Lucene/Solr

Building a Taxonomy of Entities

Many ways to generate this:• Topic Modelling

• Clustering of documents

• Statistical Analysis of interesting phrases

- Word2Vec / Glove / Dice Conceptual Search

• Buy a dictionary (often doesn’t work for

domain-specific search problems)

• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*

* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

NYC Lucene/Solr

NYC Lucene/Solr

entity extraction

NYC Lucene/Solr

Demo: Solr Text Tagger

semantic query parsing

NYC Lucene/Solr

Probabilistic Query Parser

Goal: given a query, predict which

combinations of keywords should be

combined together as phrases

Example:

senior java developer hadoop

Possible Parsings:senior, java, developer, hadoop

"senior java", developer, hadoop

"senior java developer", hadoop

"senior java developer hadoop”

"senior java", "developer hadoop”

senior, "java developer", hadoop

senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

NYC Lucene/Solr

Demo: Probabilistic Query Parser

Semantic Query Parsing

Identification of phrases in queries using two steps:

1) Check a dictionary of known terms that is continuously

built, cleaned, and refined based upon common inputs from

interactions with real users of the system. The SolrTextTagger

works well for this.*

2) Also invoke a probabilistic query parser to dynamically

identify unknown phrases using statistics from a corpus of data

(language model)

*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation

through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

NYC Lucene/Solr

query augmentation

NYC Lucene/Solr

Knowledge Graph

Semantic Data Encoded into Free Text Content

e en eng engi engineer engineers

engineer engineersNodeType:Term

softwareengineer

softwareengineers

electricalengineering

engineer

engineering software

…

…

…

NodeType:

CharacterSequence

NodeType:

TermSequence

NodeType:

Document

id:1

text:lookingforasoftwareengineerwithdegreeincomputerscienceorelectricalengineering

id:2

text:applytobeasoftwareengineerandworkwithothergreatsoftwareengineers

id:3

text:startagreatcareerinelectricalengineering

…

…

NYC Lucene/Solr

id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java

id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy

id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate

field term postings list

doc pos

desc

a

1 4

2 1

3 1, 5

at1 3

2 4

company 1 6

doing2 6

3 8

engineer1 2

3 3, 7

great 1 5

hard 2 7

hospital 2 5

java 3 6

nurse 2 3

or 3 4

registered 2 2

software1 1

3 2

work2 10

3 9

job_title java developer 3 1

… … … …

field doc term

desc

1a

at

company

engineer

great

software

2a

at

doing

hard

hospital

nurse

registered

work

3a

doing

engineer

java

or

software

work

job_title 1Software Engineer

… … …

Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

NYC Lucene/Solr


Knowledge Graph

Set-theory View

Graph View

How the Graph Traversal Works

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

Data Structure View

Java

Scala Hibernate

docs1, 2, 6

docs 3, 4

Oncology

doc 5

NYC Lucene/Solr

Knowledge Graph

Graph Model

Structure:

Single-level Traversal / Scoring:

Multi-level Traversal / Scoring:


Knowledge Graph

Multi-level Traversal

Data Structure View

Graph View

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

job_title: Software Engineer

job_title: Data

Scientist

job_title: Java

Developer

……

Inverted Index Lookup

Forward Index Lookup

Forward Index Lookup

Inverted Index Lookup

Java

Java Developer

Hibernate

Scala

Software Engineer

Data Scientist

ha

s_re

late

d_job_title

ha

s_re

late

d_job_title

NYC Lucene/Solr

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Scoring nodes in the Graph

Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.

countFG(x) - totalDocsFG * probBG(x)

z = --------------------------------------------------------

sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[

{ "value":"hive", "relatedness": 0.9765, "popularity":369 },

{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },

{ "value":".net", "relatedness": 0.5417, "popularity":17683 },

{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },

{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },

{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }

+-

Foreground Query: "Hadoop"

NYC Lucene/Solr


Knowledge Graph

Multi-level Graph Traversal with Scores

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

Hibernate

ScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

NYC Lucene/Solr

Knowledge Graph

Use Case: Document Summarization

Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.

Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document

Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring

Demo: Semantic Knowledge Graph

Knowledge Graph

NYC Lucene/Solr

NYC Lucene/Solr

streaming expressions

• Perform relational operations on

streams

• Stream sources: search, jdbc, facets,

features, gatherNodes, shortestPath,

train, features, model, random, stats,

topic

• Stream decorators: classify, commit,

complement, daemon, executor, fetch,

having, leftOuterJoin, hashJoin,

innerJoin, intersect, merge, null,

outerHashJoin, parallel, priority,

reduce, rollup, scoreNodes, select,

sort, top, unique, update

Streaming Expressions

Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.

NYC Lucene/Solr

Streaming Expressions - Examples

Shortest-path Graph

Traversal

Parallel Batch

Procesing

Train a Logistic Regression

Model

Distributed Joins

Rapid Export of all

Search Results

Pull Results from External Database

Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Classifying

Search Results

https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Additional References:

Southern Data Science

http://www.treygrainger.com/posts/presentations/crowdsourced-query-augmentation-through-the-semantic-discovery-of-domain-specific-jargon/

http://www.treygrainger.com/posts/presentations/crowdsourced-query-augmentation-through-the-semantic-discovery-of-domain-specific-jargon/

http://www.treygrainger.com/posts/presentations/building-a-real-time-big-data-analytics-platform-with-solr/

http://www.treygrainger.com/posts/presentations/building-a-real-time-big-data-analytics-platform-with-solr/

http://www.treygrainger.com/posts/presentations/scaling-recommendations-semantic-search-data-analytics-with-solr/

http://www.treygrainger.com/posts/presentations/scaling-recommendations-semantic-search-data-analytics-with-solr/

http://www.treygrainger.com/posts/presentations/enhancing-relevancy-through-personalization-semantic-search/

http://www.treygrainger.com/posts/presentations/enhancing-relevancy-through-personalization-semantic-search/

http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/

http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/

http://www.treygrainger.com/posts/presentations/reflected-intelligence-evolving-self-learning-data-systems/

http://www.treygrainger.com/posts/presentations/reflected-intelligence-evolving-self-learning-data-systems/

http://www.treygrainger.com/posts/presentations/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation/

http://www.treygrainger.com/posts/presentations/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation/

http://www.treygrainger.com/posts/presentations/building-a-cloud-like-knowledge-discovery-platform/

http://www.treygrainger.com/posts/presentations/building-a-cloud-like-knowledge-discovery-platform/

http://www.treygrainger.com/posts/presentations/reflected-intelligence-lucene-solr-as-a-self-learning-data-system/

http://www.treygrainger.com/posts/presentations/reflected-intelligence-lucene-solr-as-a-self-learning-data-system/

http://www.treygrainger.com/posts/presentations/building-a-real-time-solr-powered-recommendation-engine/

http://www.treygrainger.com/posts/presentations/building-a-real-time-solr-powered-recommendation-engine/

http://www.treygrainger.com/posts/resume/south-big-data-hub-text-data-analysis-panel/

http://www.treygrainger.com/posts/resume/south-big-data-hub-text-data-analysis-panel/

https://www.researchgate.net/publication/265512095_Augmenting_recommendation_systems_using_a_model_of_semantically-related_terms_extracted_from_user_behavior

https://www.researchgate.net/publication/265512095_Augmenting_recommendation_systems_using_a_model_of_semantically-related_terms_extracted_from_user_behavior

https://www.researchgate.net/publication/264160850_PGMHD_A_Scalable_Probabilistic_Graphical_Model_for_Massive_Hierarchical_Data_Problems

https://www.researchgate.net/publication/264160850_PGMHD_A_Scalable_Probabilistic_Graphical_Model_for_Massive_Hierarchical_Data_Problems

https://www.researchgate.net/publication/283329737_Query_Sense_Disambiguation_Leveraging_Large_Scale_User_Behavioral_Data

https://www.researchgate.net/publication/283329737_Query_Sense_Disambiguation_Leveraging_Large_Scale_User_Behavioral_Data

https://www.researchgate.net/publication/283980991_Improving_the_Quality_of_Semantic_Relationships_Extracted_from_Massive_User_Behavioral_Data

https://www.researchgate.net/publication/283980991_Improving_the_Quality_of_Semantic_Relationships_Extracted_from_Massive_User_Behavioral_Data

https://www.researchgate.net/publication/282816550_Crowdsourced_Query_Augmentation_through_Semantic_Discovery_of_Domain-specific_Jargon

https://www.researchgate.net/publication/282816550_Crowdsourced_Query_Augmentation_through_Semantic_Discovery_of_Domain-specific_Jargon

https://www.researchgate.net/publication/306926620_Entity_Type_Recognition_Using_an_Ensemble_of_Distributional_Semantic_Models_to_Enhance_Query_Understanding

https://www.researchgate.net/publication/306926620_Entity_Type_Recognition_Using_an_Ensemble_of_Distributional_Semantic_Models_to_Enhance_Query_Understanding

https://www.researchgate.net/publication/304859620_Application_of_Statistical_Relational_Learning_to_Hybrid_Recommendation_Systems

https://www.researchgate.net/publication/304859620_Application_of_Statistical_Relational_Learning_to_Hybrid_Recommendation_Systems

https://www.researchgate.net/publication/288529613_Mining_Massive_Hierarchical_Data_Using_a_Scalable_Probabilistic_Graphical_Model

https://www.researchgate.net/publication/288529613_Mining_Massive_Hierarchical_Data_Using_a_Scalable_Probabilistic_Graphical_Model

https://www.researchgate.net/publication/308368512_Macro-optimization_of_email_recommendation_response_rates_harnessing_individual_activity_levels_and_group_affinity_trends

https://www.researchgate.net/publication/308368512_Macro-optimization_of_email_recommendation_response_rates_harnessing_individual_activity_levels_and_group_affinity_trends

https://www.researchgate.net/publication/307604163_The_Semantic_Knowledge_Graph_A_compact_auto-generated_model_for_real-time_traversal_and_ranking_of_any_relationship_within_a_domain

https://www.researchgate.net/publication/307604163_The_Semantic_Knowledge_Graph_A_compact_auto-generated_model_for_real-time_traversal_and_ranking_of_any_relationship_within_a_domain

http://www.treygrainger.com/posts/presentations/apache-solr-smart-data-ecosystem

http://www.treygrainger.com/posts/presentations/apache-solr-smart-data-ecosystem

http://www.treygrainger.com/posts/presentations/the-semantic-knowledge-graph/

http://www.treygrainger.com/posts/presentations/the-semantic-knowledge-graph/

Contact Info

Trey [email protected]@treygrainger

http://solrinaction.comMeetup discount (39% off): 39grainger

Other presentations: http://www.treygrainger.com

NYC Lucene/Solr

http://solrinaction.com/

http://www.treygrainger.com/

Self-learned Relevancy with Apache Solr

Software

Transcript of Self-learned Relevancy with Apache Solr