Integrating Hadoop & Solr

27

description

Silicon Valley Code Camp 2014: presented by Yann Yu, Systems Engineer, Lucidworks.

Transcript of Integrating Hadoop & Solr

Page 1: Integrating Hadoop & Solr
Page 2: Integrating Hadoop & Solr

Yann Yu Systems Engineer @ Lucidworks

Who am I?

Page 3: Integrating Hadoop & Solr

Lucidworks is search.

Technology Retail Financial Services IndustrialHealthcare

Page 4: Integrating Hadoop & Solr

Lucidworks is the commercial entity of the Lucene/Solr project.

8M+total downloads

Solr is both established & growing

250,000+monthly downloads

Largest community of developers.

2500+open Solr jobs.

Solrmost widely used search solution on the planet.

LucidworksUnmatched Solr expertise.

1/3of the active committers

70%of the open source code is committed

Lucene/Solr Revolutionworld’s largest open source user conference dedicated to Lucene/

Solr.

Solr has tens of thousands of applications in production.

You use Solr everyday.

Page 5: Integrating Hadoop & Solr

Why would you integrate Hadoop and Solr?(and how would you do that?)

Page 6: Integrating Hadoop & Solr

• Open-source • Enterprise support • Cheap, scalable storage • Distributed computation • Farm animals and many other

related projects for extensibility

• Open-source, Lucene based • Enterprise support • Real-time queries • Full-text search • NoSQL capabilities • Repeatedly proven in production

environments at massive scales • Uses ZooKeeper for clustering

Page 7: Integrating Hadoop & Solr

I have Hadoop, why do I need Solr?

• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data

• Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.)

• Preliminary data exploration and analysis • Near real-time indexing and querying • Thousands of simultaneous, parallel requests

• Share machine-learning insights created on Hadoop to a broad audience through an interactive medium

Hadoop excels in storing and working with large amounts of data, but has difficulty with frequent, random access to it

Page 8: Integrating Hadoop & Solr

I have Solr, why do I need Hadoop?

• Least expensive storage solution in market • Leverage Hadoop processing power (MapReduce) to build

indexes or send document updates to Solr • Store Solr indexes and transaction logs within HDFS • Augment Solr data by storing additional information for last-

second retrieval in Hadoop

As Solr indexes grow in size, the size and number of the machines hosting Solr must also grow, increasing index time and complexity

Page 9: Integrating Hadoop & Solr

?

So what does this solve?

Page 10: Integrating Hadoop & Solr

The enterprise storage situation today

⚒• Large enterprises often have data

distributed in many different stores, making it hard to know where to start looking

• Employees have to check with others to verify versions of documents

• Even with hosting, knowledge is still largely tribal

Page 11: Integrating Hadoop & Solr

Enterprise data deployment

Lucidworks HDFS connector processes documents and

sends to SolrCloud

Enterprise documents are stored in HDFS

Users make ad-hoc, full-text queries across the full content

of all documents in Solr

And retrieve source files directly from

HDFS as necessary

Standard document storage and search

Page 12: Integrating Hadoop & Solr

• Documents can be migrated from other file storage systems via Flume or other scripts

• MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.)

Sink documents into HDFS

Page 13: Integrating Hadoop & Solr

Index document contents into Solr

• The Lucidworks Hadoop connector parses content from files using many different tools

• Tika, GrokIngest, CSV mapping, Pig, etc.

• Content and data are added to fields in a Solr document

• The resulting document is sent to Solr for indexing

Page 14: Integrating Hadoop & Solr

• Users are empowered with ad-hoc, full-text search in Solr

• Provides standard search tools such as autocomplete, more-like-this, spellchecking, faceting, etc.

• Users only access HDFS as needed

Enable users to search and access content

Page 15: Integrating Hadoop & Solr

The data warehouse

• Enterprises are storing data without a clear plan on how to access it

• The “data warehouse” is full of files, but with no way to pull documents, or to find what you’re looking for

• In some cases, the data is required for compliance and isn’t used otherwise

Page 16: Integrating Hadoop & Solr

Log record search

Machine generated log records are sent to Flume.

Flume forwards raw log record to Hadoop for archiving.

Flume simultaneously parses out data in record into a Solr document,

forwarding resulting document to Solr

Lucidworks SiLK exposes real-time statistics and analytics to end-users,

as well as full-text search

High volume indexing of many small records

Page 17: Integrating Hadoop & Solr

Flume archives data in HDFS

• Flume performs minimal work on log files and sends them directly into HDFS for archival

• Under optimal circumstances, the log files are sized to the block size of HDFS

Page 18: Integrating Hadoop & Solr

Flume submits records to Solr

• Flume processes records, extracting strings, ints, dates, times, and other information into Solr fields

• Once the Solr document is created, it is submitted to Solr for indexing

• This process happens in real-time, allowing for near real-time search

Page 19: Integrating Hadoop & Solr

Real-time analytics dashboard

• Lucidworks SiLK allows users to create simple dashboards through a GUI

• The SiLK dashboard will issue queries to Solr, rendering the received data in tables, graphs, and other plots

• Users can also perform full-text search across the data, allowing for extremely fine granularity

Page 20: Integrating Hadoop & Solr

High traffic Solr deployments

• Some users of Solr, especially in the e-commerce case, are running high query volume sites with small document sets

• Master-slave works well enough, but doesn’t allow for NRT and similar features form SolrCloud

Page 21: Integrating Hadoop & Solr

E-commerce search Lots of queries, not a lot of updates

Solr is pointed at an index on HDFS, and pulls it up to begin serving queries

Additional Solr machines can be spun-up on demand, pulling the

index directly from HDFS

Load balancer (or SolrJ) distributes query to active nodes

Page 22: Integrating Hadoop & Solr

MapReduce Solr index generation

• Existing product tables or catalogs can stored in HDFS or HBase, and can continue to be updated as necessary

• Hadoop can utilize the MapReduceIndexerTool to parallelize building of indexes

• As many indexes as necessary can be built in this way

Page 23: Integrating Hadoop & Solr

Ad-hoc scaling without manual replication

• Independent Solr nodes (not SolrCloud) can be started up and use the stored index data on HDFS

• These can be spun up in an ad-hoc fashion, allowing for an elastically scalable cluster

• Updates to indexes are versatile, can be pushed in via new collections or as updates to existing collections

Page 24: Integrating Hadoop & Solr

Highly-available search

• New search nodes are simply added to the load balancer or smart-client

• Distributed queries allow for sharded data-sets • Results from all nodes are guaranteed to be

consistent with one-another

Page 25: Integrating Hadoop & Solr
Page 26: Integrating Hadoop & Solr

End

Any questions?

Find me at: [email protected]

@yawnyou

Page 27: Integrating Hadoop & Solr