Integrating Hadoop & Solr

19

Transcript of Integrating Hadoop & Solr

Page 1: Integrating Hadoop & Solr
Page 2: Integrating Hadoop & Solr

Yann Yu Systems Engineer @ Lucidworks

Who am I?

Page 3: Integrating Hadoop & Solr

Lucidworks is Search.

Technology Retail Financial Services IndustrialHealthcare

Page 4: Integrating Hadoop & Solr

Why would you integrate Hadoop and Solr?(and how would you do that?)

Page 5: Integrating Hadoop & Solr

• Open-source • Enterprise support • Cheap, scalable storage • Distributed computation • Farm animals for extensibility

• Open-source, Lucene based • Enterprise support • Real-time queries • Full-text search • NoSQL capabilities • Repeatedly proven in production

environments at massive scales

Page 6: Integrating Hadoop & Solr

I have Hadoop, why do I need Solr?

• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data

• Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.)

• Preliminary data exploration and analysis • Near real-time indexing and querying • Thousands of simultaneous, parallel requests

• Share machine-learning insights created on Hadoop to a broad audience through an interactive medium

Hadoop excels in storing and working with large amounts of data, but has difficulty with frequent, random access to it

Page 7: Integrating Hadoop & Solr

I have Solr, why do I need Hadoop?

• Least expensive storage solution in market • Leverage Hadoop processing power (MapReduce) to build

indexes or send document updates to Solr • Store Solr indexes and transaction logs within HDFS • Augment Solr data by storing additional information for last-

second retrieval in Hadoop

As Solr indexes grow in size, the size and number of the machines hosting Solr must also grow, increasing index time and complexity

Page 8: Integrating Hadoop & Solr

?

So what does this actually look like?

Page 9: Integrating Hadoop & Solr

The enterprise storage situation today

Page 10: Integrating Hadoop & Solr

Enterprise data deployment

Lucidworks HDFS connector processes documents and

sends to SolrCloud

Enterprise documents are stored in HDFS

Users make ad-hoc, full-text queries across the full content

of all documents in Solr

And retrieve source files directly from

HDFS as necessary

Standard document storage and search

Page 11: Integrating Hadoop & Solr

• Documents can be migrated from other file storage systems via Flume or other scripts

• MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.)

Sink documents into HDFS

Page 12: Integrating Hadoop & Solr

Index document contents into Solr

• The Lucidworks Hadoop connector parses content from files using many different tools

• Tika, GrokIngest, CSV mapping, Pig, etc.

• Content and data are added to fields in a Solr document

• The resulting document is sent to Solr for indexing

Page 13: Integrating Hadoop & Solr

• Users are empowered with ad-hoc, full-text search in Solr

• Provides standard search tools such as autocomplete, more-like-this, spellchecking, faceting, etc.

• Users only access HDFS as needed

Enable users to search and access content

Page 14: Integrating Hadoop & Solr

Log record search

Machine generated log records are sent to Flume.

Flume forwards raw log record to Hadoop for archiving.

Flume simultaneously parses out data in record into a Solr document,

forwarding resulting document to Solr

Lucidworks SiLK exposes real-time statistics and analytics to end-users,

as well as full-text search

High volume indexing of many small records

Page 15: Integrating Hadoop & Solr

Flume archives data in HDFS

• Flume performs minimal work on log files and sends them directly into HDFS for archival

• Under optimal circumstances, the log files are sized to the block size of HDFS

Page 16: Integrating Hadoop & Solr

Flume submits records to Solr

• Flume processes records, extracting strings, ints, dates, times, and other information into Solr fields

• Once the Solr document is created, it is submitted to Solr for indexing

• This process happens in real-time, allowing for near real-time search

Page 17: Integrating Hadoop & Solr

Real-time analytics dashboard

• Lucidworks SiLK allows users to create simple dashboards through a GUI

• The Banana dashboard will issue queries to Solr, rendering the received data in tables, graphs, and other plots

• Users can also perform full-text search across the data, allowing for extremely fine granularity

Page 18: Integrating Hadoop & Solr
Page 19: Integrating Hadoop & Solr

End

Any questions?

Find me at: [email protected]

@yawnyou