Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

Kitenga reinventing information

Mark Davis Founder/CTO

Enabling Big Data Search via the Lucid ReST API

Big Data

Enormous transactional data Enormous unstructured information Too big for databases New tools are needed

kilobyte (kB) 103 210 kibibyte (KiB) 210 megabyte (MB) 106 220 mebibyte (MiB) 220 gigabyte (GB) 109 230 gibibyte (GiB) 230 terabyte (TB) 1012 240 tebibyte (TiB) 240 petabyte (PB) 1015 250 pebibyte (PiB) 250 exabyte (EB) 1018 260 exbibyte (EiB) 260 zettabyte (ZB) 1021 270 zebibyte (ZiB) 270 yottabyte (YB) 1024 280 yobibyte (YiB) 280

Volume Velocity Variety

Gather Resources

•  Crawl •  Crack formats

Extract Metadata

•  Named entities

•  Categories •  Machine learning

•  Semantic analysis

Index

•  Schema definition

•  Collection management

Indexing Challenges

Complex, varied data Compute-‐intensive metadata generation Schema and collection management

Initial Query

•  Keyword guesses

•  Category guidance

Refine Query

•  Analytic tools

•  Facetted guidance

Evaluate Relevance

•  Read KWIC •  Read metadata

•  Read document

Search Experience Challenges

Complex, varied data Resource discovery Facetted search experience management

The Solution

Enable fast metadata generation:

Hadoop Mahout GPUs

Manage and control collections and schema:

LucidWorks Enterprise API

SQL RDBMS

Transactional Data BI Tools

Search Documents Text Classification Taxonomies Ontologies

Parts-‐of-‐Speech Tagging

Tokenization

Lemmatization

Finite State Transducer Finite State Transducer

Finite State Transducer

Machine-‐Learning

Query Language

Metadata Extraction

Indexing

Facet Browsing Facet Charting

Resource Integration

Autosuggest Spellcheck

¡  Start to POC in a week ¡  Open source intelligence problems

GOAL: Be more competitive

SOURCES: Patents, PR

announcements, legal documents,

whitepapers, crawled websites

ANALYSIS: Extract named entities and

relationships, classify and label;

visually understand relationships and

trends

ACTION: Change R&D priorities and

improve marketing approaches

13

ZettaS

earch

Facetted Search and Analytics

ZettaV

ox metadata

relationships

data entities

Source

s

¡  Understand IP among competitors ¡  Assist legal team with litigation ¡  Custom search experience ¡  Custom extractors:

§  Electronic parts § Memory types §  Flash memory

5/15/12 . 14

5/15/12 . 15

Documents Size

Dell 102,508 9Gb

EMC 303,678 14Gb

Huawei 11,912 890Mb

Kingston 2,534 134Mb

Lenovo 8,305 542Mb

NEC 3,900 252Mb

Nokia 174,681 22Gb

Panasonic 5,804 473Mb

Rim 181 8Mb

Sharp USA 31,918 4.9Gb

645,421 60.2Gb

GOAL: Discover new drugs, detect side-‐

effects, speed R&D

SOURCES: Published research reports,

patents, adverse effects databases,

genomics and proteomics databases


relationships, classify and label; visually

discover trends and relationships

ACTION: Change R&D priorities

16

ZettaS

earch


Source

s Ze

ttaV

ox

relationships

data entities pathways

sequences

¡  Lousy search (Google Search Appliance) ¡  Internal regulators can’t find by accession number

¡  Custom extractors: §  Accession number §  Ontology of active ingredients §  Drug names

© 2012 Kitenga Proprietary 17

GOAL: Build “second screen

experiences”

SOURCES: wikipedia, IMDB, blogs


relationships, preserve existing

structural metadata

ACTION: Enable new media experiences

18

ZettaS

earch


ZettaV

ox metadata

relationships

data entities

Source

s

¡  Crawlers on Hadoop ¡  Document format crackers on Hadoop ¡  Extractors on Hadoop ¡  Filters on Hadoop ¡  HTTP documents to Solr sharded cluster ¡  Intermediary files remain on HDFS for reprocessing

¡ Missing piece of the puzzle ¡  Addresses the impedance mismatch between Big Data technologies and Solr search

¡ Manage collections ¡ Manage schema

¡  Create collections ¡  Delete collections ¡  Update collection properties ¡  Create schema ¡ Modify schema

¡  Schema interrogation ¡  Schema binding to user experience ¡  Facetted search ¡  Embedded analytics

¡  Big Data search and analytics has many challenges: §  Volume of data §  Variety of data §  Velocity of data §  Extracting structure from unstructured information

¡  Hadoop processing enables each of these aspects ¡  Controlling indexing and search is enabled by the

Lucid Imagination search API ¡  We can enable complex user interactions with Big

Data on a self-‐serve basis

ZettaVox Author RIA

Tomcat App Server

Tomcat Web Services

ZettaVoxServices Manager XML

+ JSON

Amazon S3

GPU Services Manager

Hadoop Services Manager

Analyst Browser Enterprise servers Cloud services

GPU MR Service Manager

GPU

GPU

Enterprise Cloud

Hadoop Server Job Tracker

Hadoop Task Manager Hadoop

Task Manager Hadoop

Task Manager

Hadoop Server Name node

Search Indexing

© 2012 Kitenga Proprietary Mahout

Entity Extraction Crawling

Quantum4D

RDBMS

ReST JSON

ZettaVox Author RIA

Analyst Browser Enterprise servers

Hadoop Server Job Tracker

Hadoop Task Manager Hadoop

Task Manager Hadoop

Task Manager

Hadoop Server Name node

Search Indexing

© 2012 Kitenga Proprietary Mahout

Entity Extraction Crawling

ReST

JSON

• Get collection information • Create new collection • Create fields • Delete fields • Edit fields

Indexing

Questions?

Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

Technology

Transcript of Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience