Solr + Hadoop: Interactive Search for Hadoop

53
1 Solr + Hadoop: Interactive Search for Hadoop Gregory Chanan (gchanan AT cloudera.com) OC Big Data Meetup 07/16/14

description

Solr + Hadoop: Interactive Search for Hadoop

Transcript of Solr + Hadoop: Interactive Search for Hadoop

Page 1: Solr + Hadoop: Interactive Search for Hadoop

1

Solr + Hadoop: Interactive Search for Hadoop

Gregory Chanan (gchanan AT cloudera.com)OC Big Data Meetup 07/16/14

Page 2: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive• Security• Conclusion

Page 3: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive• Security• Conclusion

Page 4: Solr + Hadoop: Interactive Search for Hadoop

Why Search?

• Hadoop for everyone• Typical case:

• Ingest data to storage engine (HDFS, HBase, etc)• Process data (MapReduce, Hive, Impala)

• Experts know MapReduce• Savvy people know SQL• Everyone knows Search!

Page 5: Solr + Hadoop: Interactive Search for Hadoop

Why Search?

An Integrated Part of the Hadoop System

One pool of data

One security framework

One set of system resources

One management interface

Page 6: Solr + Hadoop: Interactive Search for Hadoop

Benefits of Search

• Improved Big Data ROI• An interactive experience without technical knowledge

• Faster time to insight• Exploratory analysis, esp. unstructured data• Broad range of indexing options to accommodate needs

• Cost efficiency• Single scalable platform; no incremental investment• No need for separate systems, storage

Page 7: Solr + Hadoop: Interactive Search for Hadoop

What is Cloudera Search?

• Full-text, interactive search with faceted navigation• Apache Solr integrated with CDH

• Established, mature search with vibrant community• In production environments for years

• Open Source• 100% Apache, 100% Solr• Standard Solr APIs

• Batch, near real-time, and on-demand indexing• Available for CDH4 and CDH5

Page 8: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive• Security• Conclusion

Page 9: Solr + Hadoop: Interactive Search for Hadoop

Apache Hadoop

• Apache HDFS• Distributed file system• High reliability• High throughput

• Apache MapReduce• Parallel, distributed programming model• Allows processing of large datasets• Fault tolerant

Page 10: Solr + Hadoop: Interactive Search for Hadoop

Apache Lucene

• Full text search library• Indexing• Querying

• Traditional inverted index• Batch and Incremental indexing• We are using version 4.4 in current release

Page 11: Solr + Hadoop: Interactive Search for Hadoop

Apache Solr

• Search service built using Lucene• Ships with Lucene (same TLP at Apache)

• Provides XML/HTTP/JSON/Python/Ruby/… APIs• Indexing• Query• Administrative interface• Also rich web admin GUI via HTTP

Page 12: Solr + Hadoop: Interactive Search for Hadoop

Apache SolrCloud

• Provides distributed Search capability• Part of Solr (not a separate library/codebase)• Shards – provide scalability

• partition index for size• replicate for query performance

• Uses ZooKeeper for coordination• No split-brain issues• Simplifies operations

Page 13: Solr + Hadoop: Interactive Search for Hadoop

SolrCloud Architecture

• Updates automatically sent to the correct shard

• Replicas handle queries, forward updates to the leader

• Leader indexes the document for the shard, and forwards the index notation to itself and any replicas.

Page 14: Solr + Hadoop: Interactive Search for Hadoop

SolrCloud Architecture

Visual representation via admin UI

Page 15: Solr + Hadoop: Interactive Search for Hadoop

Distributed Search on Hadoop

FlumeHue UI

Custom UI

Custom App

Solr

Solr

Solr

SolrCloudquery

query

query

index

Hadoop Cluster

MR

HDFS

index

HBaseindex

ZK

Page 16: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive

• Indexing• ETL - morphlines• Querying

• Security• Conclusion

Page 17: Solr + Hadoop: Interactive Search for Hadoop

Indexing

• Near Real Time (NRT)• Flume• HBase Indexer

• Batch• MapReduceIndexerTool• HBaseBatchIndexer

Page 18: Solr + Hadoop: Interactive Search for Hadoop

Near Real Time Indexing with Flume

Log File Solr and Flume• Data ingest at scale• Flexible extraction and

mapping• Indexing at data ingest

HDFS

Flume Agent

Indexer

OtherLog File

Flume Agent

Indexer

18

Page 19: Solr + Hadoop: Interactive Search for Hadoop

Apache Flume - MorphlineSolrSink

• A Flume Source…• Receives/gathers events

• A Flume Channel…• Carries the event – MemoryChannel or reliable FileChannel

• A Flume Sink…• Sends the events on to the next location

• Flume MorphlineSolrSink• Integrates Cloudera Morphlines library

• ETL, more on that in a bit• Does batching• Results sent to Solr for indexing

Page 20: Solr + Hadoop: Interactive Search for Hadoop

Indexing

• Near Real Time (NRT)• Flume• HBase Indexer

• Batch• MapReduceIndexerTool• HBaseBatchIndexer

Page 21: Solr + Hadoop: Interactive Search for Hadoop

Near Real Time Indexing of Apache HBase

HDFS

HBase

inte

racti

ve lo

ad

HBase Indexer(s)

Repl

icati

on Solr serverSolr serverSolr serverSolr serverSolr server

Sear

ch+ =planet-sized tabular dataimmediate access & updatesfast & flexible informationdiscovery

B I G DATA D ATA M A N A G E M E N T

Page 22: Solr + Hadoop: Interactive Search for Hadoop

Lily HBase Indexer

• Collaboration between NGData & Cloudera• NGData are creators of the Lily data management platform

• Lily HBase Indexer• Service which acts as a HBase replication listener

• HBase replication features, such as filtering, supported• Replication updates trigger indexing of updates (rows)• Integrates Cloudera Morphlines library for ETL of rows• AL2 licensed on github https://github.com/ngdata

Page 23: Solr + Hadoop: Interactive Search for Hadoop

Indexing

• Near Real Time (NRT)• Flume• HBase Indexer

• Batch• MapReduceIndexerTool• HBaseBatchIndexer

Page 24: Solr + Hadoop: Interactive Search for Hadoop

Scalable Batch Indexing

Index shard

Files

Index shard

Indexer

Files

Solr server

Indexer

Solr server

GOLIVE

24

HDFS

Solr and MapReduce• Flexible, scalable batch

indexing• Start serving new indices

with no downtime• On-demand indexing, cost-

efficient re-indexing

Page 25: Solr + Hadoop: Interactive Search for Hadoop

MapReduce Indexer

MapReduce Job with two parts

1) Scan HDFS for files to be indexed• Much like Unix “find” – see HADOOP-8989• Output is NLineInputFormat’ed file

2) Mapper/Reducer indexing step• Mapper extracts content via Cloudera Morphlines• Reducer indexes documents via embedded Solr server• Originally based on SOLR-1301

• Many modifications to enable linear scalability

Page 26: Solr + Hadoop: Interactive Search for Hadoop

MapReduce Indexer “golive”

• Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing

• Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster• No downtime for users• No NRT expense• Linear scale out to the size of your MR cluster

Page 27: Solr + Hadoop: Interactive Search for Hadoop

Indexing

• Near Real Time (NRT)• Flume• HBase Indexer

• Batch• MapReduceIndexerTool• HBaseBatchIndexer

Page 28: Solr + Hadoop: Interactive Search for Hadoop

HBase + MapReduce

• Run MapReduce job over HBase tables• Same architecture as running over HDFS• Similar to HBase’s CopyTable• Support for go-live

Page 29: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive

• Indexing• ETL - morphlines• Querying

• Security• Conclusion

Page 30: Solr + Hadoop: Interactive Search for Hadoop

Cloudera Morphlines

• Open Source framework for simple ETL• Simplify ETL

• Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages)

• Configuration over coding

• Standardize ETL• Ships as part of Kite SDK, formerly Cloudera

Developer Kit (CDK)• It’s a Java library• AL2 licensed on github https://github.com/kite-sdk

Page 31: Solr + Hadoop: Interactive Search for Hadoop

Cloudera Morphlines Architecture

Solr

Solr

Solr

SolrCloud

Logs, tweets, social media, html,

images, pdf, text….

Anything you want to index

Flume, MR Indexer, HBase indexer, etc... Or your application!

Morphline Library

Morphlines can be embedded in any application…

Page 32: Solr + Hadoop: Interactive Search for Hadoop

Extraction and Mapping

• Modeled after Unix pipelines (records instead of lines)

• Simple and flexible data transformation

• Reusable across multiple index workloads

• Over time, extend and re-use across platform workloads

syslog Flume Agent

Solr sink

Command: readLine

Command: grok

Command: loadSolr

Solr

Event

Record

Record

Record

Document

Mor

phlin

e Li

brar

y

Page 33: Solr + Hadoop: Interactive Search for Hadoop

Morphline Example – syslog with grok

morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] }]

Example Input<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22Output Recordsyslog_pri:164syslog_timestamp:Feb 4 10:46:14syslog_hostname:syslogsyslog_program:sshdsyslog_pid:607syslog_message:listening on 0.0.0.0 port 22.

Page 34: Solr + Hadoop: Interactive Search for Hadoop

Current Command Library

• Integrate with and load into Apache Solr• Flexible log file analysis• Single-line record, multi-line records, CSV files • Regex based pattern matching and extraction • Integration with Avro • Integration with Apache Hadoop Sequence Files• Integration with SolrCell and all Apache Tika parsers • Auto-detection of MIME types from binary data using

Apache Tika

Page 35: Solr + Hadoop: Interactive Search for Hadoop

Current Command Library (cont)

• Scripting support for dynamic java code • Operations on fields for assignment and comparison• Operations on fields with list and set semantics • if-then-else conditionals • A small rules engine (tryRules)• String and timestamp conversions • slf4j logging• Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container

file formats• Etc…

Page 36: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive

• Indexing• ETL - morphlines• Querying

• Security• Conclusion

Page 37: Solr + Hadoop: Interactive Search for Hadoop

Querying

• Built-in solr web UI• Write your own• Hue

Page 38: Solr + Hadoop: Interactive Search for Hadoop

Simple, Customizable Search Interface

Hue• Simple UI• Navigated, faceted drill

down• Customizable display• Full text search,

standard Solr API and query language

Page 39: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive• Security• Conclusion

Page 40: Solr + Hadoop: Interactive Search for Hadoop

Security

• Upstream Solr doesn’t deal with security• Cloudera Search supports kerberos authentication

• Similar to Oozie / WebHDFS• Collection-Level Authorization via Apache Sentry• Document-Level Authorization via Apache Sentry

(new in CDH5.1)

Page 41: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive

• Indexing• ETL - morphlines• Querying

• Security• Collection-Level Authorization• Document-Level Authorization

• Conclusion

Page 42: Solr + Hadoop: Interactive Search for Hadoop

Collection-Level Authorization

• Sentry supports role-based granting of privileges• each role can be granted QUERY, UPDATE, and/or

administrative privileges on an index (collection)• Privileges stored in a “policy file” on HDFS

Page 43: Solr + Hadoop: Interactive Search for Hadoop

Policy File

[groups]# Assigns each Hadoop group to its set of rolesdev_ops = engineer_role, ops_role[roles]# Assigns each role to its set of privilegesengineer_role = collection = source_code->action=Query, collection = source_code- > action=Updateops_role = collection = hbase_logs->action=Query

Page 44: Solr + Hadoop: Interactive Search for Hadoop

Integrating Sentry and Solr

• Solr Request Handlers:

• Specified per collection in solrconfig.xml:

• Request to: http://localhost:8983/solr/collection1/select Is dispatched to an instance of solr.SearchHandler

Page 45: Solr + Hadoop: Interactive Search for Hadoop

Sentry Request Handlers

• Sentry ships with its own version of solrconfig.xml with secure handlers, called solrconfig.xml.secure

• Use a SearchComponent to implement the checking• Update Requests handled in a similar way

Page 46: Solr + Hadoop: Interactive Search for Hadoop

Agenda

• Big Data and Search – setting the stage• Cloudera Search Architecture• Component Deep Dive

• Indexing• ETL - morphlines• Querying

• Security• Collection-Level Authorization• Document-Level Authorization

• Conclusion

Page 47: Solr + Hadoop: Interactive Search for Hadoop

Document-level authorization Motivation

• Index-level authorization useful when access control requirements for documents are homogeneous

• Security requirements may require restricting access to a subset of documents

Page 48: Solr + Hadoop: Interactive Search for Hadoop

Document-level authorization Motivation

• Consider “Confidential” and “Secret” documents. How to store with only index-level authorization?

• Pushes complexity to application. Doc-level authorization designed to solve this problem

Page 49: Solr + Hadoop: Interactive Search for Hadoop

Document-level authorization model

• Instead of storing in HDFS Policy File:[groups]# Assigns each Hadoop group to its set of rolesdev_ops = engineer_role, ops_role[roles]# Assigns each role to its set of privilegesengineer_role = collection = source_code->action=Query, collection = source_code- > action=Updateops_role = collection = hbase_logs->action=Query

• Store authorization tokens in each document• Many more documents than collections; doesn’t scale to

store document-level info in Policy File• Can use Solr’s built-in filtering capabilities to restrict access

Page 50: Solr + Hadoop: Interactive Search for Hadoop

Document-level authorization model

• A configurable token field stores the authorization tokens• The authorization tokens are Sentry roles, i.e. “ops_role”

[roles]ops_role = collection = hbase_logs->action=Query

• Represents the roles that are allowed to view the document. To view a document, the querying user must belong to at least one role whose token is stored in the token field

• Can modify document permissions without restarting Solr• Can modify role memberships without reindexing

Page 51: Solr + Hadoop: Interactive Search for Hadoop

Document-level authorization impl

• Intercepts the request via a SearchComponent• SearchComponent adds an “fq” or FilterQuery

• Filter out all documents that don’t have “role1” or “role2” in authField

• Multiple “fq”s work as intersection, so malicious user can’t avoid by injection his own fq

• Filters are cached, so only construction expense once• Note: does not supersede index-level authorization

Page 52: Solr + Hadoop: Interactive Search for Hadoop

Document-level authorization config

• Configuration via solrconfig.xml.secure (per collection):

<!-- Set to true to enabled document-level authorization --> <bool name="enabled">false</bool> <!-- Field where the auth tokens are stored in the document --> <str name="sentryAuthField">sentry_auth</str> <!-- Auth token defined to allow any role to access the document. Uncomment to enable. --> <!--<str name="allRolesToken">*</str>-->• For backwards compatibility, not enabled• No tokens = no access. To allow all users to access a document,

use the allRolesToken. Useful for getting started

Page 53: Solr + Hadoop: Interactive Search for Hadoop

Conclusion

• Cloudera Search• Free Download • Extensive documentation• Send your questions and feedback to

[email protected]• Take the Search online training

• Cloudera Manager Standard (i.e. the free version)• Simple management of Search• Free Download

• QuickStart VM also available!