Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr:...

Secure Solr With Apache Sentry Gregory Chanan, Engineer @ Cloudera gchanan AT cloudera.com

Who Am I? •  Software Engineer at Cloudera •  Apache Solr Committer •  Apache Sentry Committer (incubating) •  Apache HBase Committer

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Why Security? •  Apache Solr only provides minimal security features

“Solr allows any client with access to it to add, update, and delete documents (and of course search/read too), including access to the Solr configura<on and schema files and the administra<ve user interface.”[1]

•  In the past, deployed as a single server “It is strongly recommended that the applica<on server containing Solr be firewalled such the only clients with access to Solr are your own.” [1]

Why Security? •  SolrCloud driving adoption in Big Data space

•  Now, a component of a multi-tenant Hadoop cluster •  Non-‐solr users on cluster •  Solr communicates across machines and services

Why Apache Sentry? •  Sentry already established in Hadoop ecosystem

•  Has understood authen<ca<on model (kerberos) •  Has understood privilege/ac<on model

•  Security-focused project •  Solr focus on Search Engine •  Sentry focus on Security

Authentication •  Authentication: Verifying identity of a user or service •  Solr supports authenticating with dependent services (i.e. HDFS

and ZooKeeper*) •  Sentry goal: support other services / users authenticating with

Solr •  Consistent with other HTTP-level Hadoop services (e.g. Oozie

and HttpFs), Apache Sentry uses: •  Kerberos: a mutual authentication protocol that works on the

basis of “tickets” •  SPNego: a negotiation mechanism for selecting an underlying

authentication protocol

SPNego advantages •  HTTP Tools have built-in support for SPNego/Kerberos

•  Web browsers •  curl (with --negotiate) •  HTTP libraries, including Apache HttpClient (used by solrj)

•  Although an authentication (not authorization) protocol, can be used for cluster-level access control •  Only grant kerberos credentials to users who should have access to the cluster

Authentication Setup •  Server side: use Sentry-provided web.xml which has a kerberos/

SPNego aware filter •  Have to setup keytabs/principals/JAAS configura<ons

•  Client side: Sentry provides HttpClient / HttpSolrServer configuration for communicating with kerberos/SPNego aware Solr servers •  Have to setup keytabs/principals/JAAS configura<ons

•  Cloudera Manager can do setup for you

Authorization •  Authorization: Controlling access to resources •  Solr does not provide collection/document authorization support

•  Does support “hooks” via solr.xml and solrconfig.xml to override request handler implementation

•  Sentry uses these “hooks” to implement collection and document level authorization

Collection-level Authorization •  Sentry supports role-based granting of privileges

•  each role can be granted QUERY, UPDATE, and/or administra<ve privileges on an collec<on

•  Privileges stored in a “policy file” on HDFS: [groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] # Assigns each role to its set of privileges engineer_role = collec<on = source_code-‐>ac<on=Query, collec<on = source_code -‐> ac<on=Update ops_role = collec<on = hbase_logs -‐> ac<on=Query

Integrating Sentry and Solr •  Sentry integrated via “hooks” in request handlers: •  Specified per collection in solrconfig.xml: •  Sentry ships with its own version of solrconfig.xml with secure handlers,

called solrconfig.xml.secure

Administrative requests •  That covers queries/updates of collections, but what about administrative

actions such as getting the status of the cores? •  In SolrCloud, admin looks like a collection: http://localhost:8983/solr/admin/cores?action=STATUS •  Can just follow this structure in Sentry: sample_role = collec<on = admin-‐>ac<on=Query,

•  Secure Admin Handlers controlled via cluster-wide “solr.xml” in ZooKeeper. By default, you get Secure Admin Handlers if Sentry is enabled

Administrative requests •  Full privilege model documented here •  Examples (colllection1 = arbitrary collection name):

Ac-on Required Privilege Collec-on

select QUERY collec<on1

update/json UPDATE collec<on1

ThreadDumpHandler QUERY admin

Document-level authorization motivation •  Collection-level authorization useful when access control requirements

for documents are homogeneous •  Security requirements may require restricting access to a subset of

documents •  Consider “Confidential” and “Secret” documents. How to store with only

collection-level authorization?

•  Pushes complexity to application

Document-level authorization model •  Instead of Policy File in HDFS:

[groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] # Assigns each role to its set of privileges engineer_role = collec<on = source_code-‐>ac<on=Query, collec<on = source_code-‐>ac<on=Update ops_role = collec<on = hbase_logs-‐>ac<on=Query

•  Store authorization tokens in each document •  Many more documents than collec<ons; doesn’t scale to store document-‐

level info in Policy File •  Can use Solr’s built-‐in filtering capabili<es to restrict access

Document-level authorization model •  A configurable field stores the authorization tokens •  The authorization tokens are Sentry roles, i.e. “ops_role”

[roles] ops_role = collec<on = hbase_logs-‐>ac<on=Query

•  Represents the roles that are allowed to view the document. To view a document, the querying user must belong to at least one role whose token is stored in the token field

•  Can modify document permissions without restarting Solr •  Can modify role memberships without reindexing

Document-level authorization impl •  Intercepts the request via a SearchComponent •  SearchComponent adds an “fq” or FilterQuery

•  Filter out all documents that don’t have “role1” or “role2” in authField

•  Filters are cached, so only construction expense once •  Note: does not supersede collection-level authorization

Document-level authorization config •  Configuration via solrconfig.xml.secure (per collection): <!-‐-‐ Set to true to enabled document-‐level authoriza<on -‐-‐> <bool name="enabled">false</bool> <!-‐-‐ Field where the auth tokens are stored in the document -‐-‐> <str name="sentryAuthField">sentry_auth</str> <!-‐-‐ Auth token defined to allow any role to access the document. Uncomment to enable. -‐-‐> <!-‐-‐<str name="allRolesToken">*</str>-‐-‐>

•  No tokens = no access. To allow all users to access a document,

use the allRolesToken. Useful for getting started

Secure Impersonation •  But wait! My users don’t interact with Solr directly

•  Custom web UI, load balancer, etc.

•  Authorization won’t work! •  “user” is forgotten, request to Solr from “UI”

Secure Impersonation •  Secure impersonation: the ability of a “super-user” to submit

requests on behalf of another user •  Conceptually similar to “sudo” on Unix •  Limited to only groups/hosts that are explicitly configured to support it •  Iden<cal to func<onality provided by HDFS, Oozie

Hue Search App UI •  Uses Secure Impersonation to integrate with its own security mechanisms

•  Users can login to Hue via LDAP or other auth mechanism •  Hue makes requests on behalf of logged in user •  Only Hue user requires kerberos keytab

•  Seamlessly integrates with the collection and document-level access control mechanisms

Performance Testing •  Goal is to measure overhead of:

•  Kerberos Authentication •  Sentry Collection-Level Authorization

•  Measure index, query overhead separately

Index Test Setup •  20-node cluster: 12 cores, 96 GB RAM, 12x 2TB disks, 10G Ethernet •  Cloudera Search-1.2.0, CDH 4.6, MR1, CentOS 6.4 •  260M tweets/docs, indexed across 17 fields •  116 GB, ~800 JSON .gz files, ~130MB per file, 3-fold HDFS

replication •  1 Solr server and 1 shard per node (44M docs per shard), no Solr

replication •  Uses MapReduceIndexerTool contrib. mapper/reducer slots = 2x/1x

number of cores •  Solr heap size = 20GB •  Record end-to-end indexing time, i.e., indexing + mtree merge + go

live •  Record average from 3 repeats

Index Performance Testing

•  Leg column is unsecured baseline.

•  Center column is ~20% lower → HDFS security introduces ~20% performance overhead.

•  Right column is ~same as center column → Solr security introduces no addi<onal overhead.

Query Test Setup •  Same setup as MapReduce batch indexing •  Uses the output of MapReduce batch indexing •  1 client, 30 threads per client •  Uses internal tool - QueryRunner

•  Similar to SolrMeter and JMeter •  Query randomly sampled from fixed set of 10,000 strings •  Record per thread query throughput for 5 runs of 30 min each

Query Performance Testing

•  Leg column is unsecured baseline.

•  Center column is ~13% lower → HDFS security introduces ~13% performance overhead.

•  Right column is same as center column → Solr security introduces no addi<onal overhead.

Future Work •  Support for Sentry service with improved APIs / performance /

integration •  Already supported for Hive/Impala •  Currently in development upstream

•  “Lineage” security: data flows from one system to another and retains security criteria •  Example: Index HBase data for full-text queries in Solr. HBase Table

and Cell-level security tags automatically applied to Solr Collections, Documents, and Fields

Questions? •  Thanks for listening! •  More information / Want to contribute?

http://sentry.incubator.apache.org/ •  Questions?

Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr:...

Technology

Transcript of Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr:...