Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr:...
-
Upload
lucidworks -
Category
Technology
-
view
676 -
download
2
Transcript of Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr:...
Who Am I? • Software Engineer at Cloudera • Apache Solr Committer • Apache Sentry Committer (incubating) • Apache HBase Committer
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Why Security? • Apache Solr only provides minimal security features
“Solr allows any client with access to it to add, update, and delete documents (and of course search/read too), including access to the Solr configura<on and schema files and the administra<ve user interface.”[1]
• In the past, deployed as a single server “It is strongly recommended that the applica<on server containing Solr be firewalled such the only clients with access to Solr are your own.” [1]
Why Security? • SolrCloud driving adoption in Big Data space
• Now, a component of a multi-tenant Hadoop cluster • Non-‐solr users on cluster • Solr communicates across machines and services
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Why Apache Sentry? • Sentry already established in Hadoop ecosystem
• Has understood authen<ca<on model (kerberos) • Has understood privilege/ac<on model
• Security-focused project • Solr focus on Search Engine • Sentry focus on Security
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Authentication • Authentication: Verifying identity of a user or service • Solr supports authenticating with dependent services (i.e. HDFS
and ZooKeeper*) • Sentry goal: support other services / users authenticating with
Solr • Consistent with other HTTP-level Hadoop services (e.g. Oozie
and HttpFs), Apache Sentry uses: • Kerberos: a mutual authentication protocol that works on the
basis of “tickets” • SPNego: a negotiation mechanism for selecting an underlying
authentication protocol
SPNego advantages • HTTP Tools have built-in support for SPNego/Kerberos
• Web browsers • curl (with --negotiate) • HTTP libraries, including Apache HttpClient (used by solrj)
• Although an authentication (not authorization) protocol, can be used for cluster-level access control • Only grant kerberos credentials to users who should have access to the cluster
Authentication Setup • Server side: use Sentry-provided web.xml which has a kerberos/
SPNego aware filter • Have to setup keytabs/principals/JAAS configura<ons
• Client side: Sentry provides HttpClient / HttpSolrServer configuration for communicating with kerberos/SPNego aware Solr servers • Have to setup keytabs/principals/JAAS configura<ons
• Cloudera Manager can do setup for you
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Authorization • Authorization: Controlling access to resources • Solr does not provide collection/document authorization support
• Does support “hooks” via solr.xml and solrconfig.xml to override request handler implementation
• Sentry uses these “hooks” to implement collection and document level authorization
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Collection-level Authorization • Sentry supports role-based granting of privileges
• each role can be granted QUERY, UPDATE, and/or administra<ve privileges on an collec<on
• Privileges stored in a “policy file” on HDFS: [groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] # Assigns each role to its set of privileges engineer_role = collec<on = source_code-‐>ac<on=Query, collec<on = source_code -‐> ac<on=Update ops_role = collec<on = hbase_logs -‐> ac<on=Query
Integrating Sentry and Solr • Sentry integrated via “hooks” in request handlers: • Specified per collection in solrconfig.xml: • Sentry ships with its own version of solrconfig.xml with secure handlers,
called solrconfig.xml.secure
Administrative requests • That covers queries/updates of collections, but what about administrative
actions such as getting the status of the cores? • In SolrCloud, admin looks like a collection: http://localhost:8983/solr/admin/cores?action=STATUS • Can just follow this structure in Sentry: sample_role = collec<on = admin-‐>ac<on=Query,
• Secure Admin Handlers controlled via cluster-wide “solr.xml” in ZooKeeper. By default, you get Secure Admin Handlers if Sentry is enabled
Administrative requests • Full privilege model documented here • Examples (colllection1 = arbitrary collection name):
Ac-on Required Privilege Collec-on
select QUERY collec<on1
update/json UPDATE collec<on1
ThreadDumpHandler QUERY admin
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Document-level authorization motivation • Collection-level authorization useful when access control requirements
for documents are homogeneous • Security requirements may require restricting access to a subset of
documents • Consider “Confidential” and “Secret” documents. How to store with only
collection-level authorization?
• Pushes complexity to application
Document-level authorization model • Instead of Policy File in HDFS:
[groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] # Assigns each role to its set of privileges engineer_role = collec<on = source_code-‐>ac<on=Query, collec<on = source_code-‐>ac<on=Update ops_role = collec<on = hbase_logs-‐>ac<on=Query
• Store authorization tokens in each document • Many more documents than collec<ons; doesn’t scale to store document-‐
level info in Policy File • Can use Solr’s built-‐in filtering capabili<es to restrict access
Document-level authorization model • A configurable field stores the authorization tokens • The authorization tokens are Sentry roles, i.e. “ops_role”
[roles] ops_role = collec<on = hbase_logs-‐>ac<on=Query
• Represents the roles that are allowed to view the document. To view a document, the querying user must belong to at least one role whose token is stored in the token field
• Can modify document permissions without restarting Solr • Can modify role memberships without reindexing
Document-level authorization impl • Intercepts the request via a SearchComponent • SearchComponent adds an “fq” or FilterQuery
• Filter out all documents that don’t have “role1” or “role2” in authField
• Filters are cached, so only construction expense once • Note: does not supersede collection-level authorization
Document-level authorization config • Configuration via solrconfig.xml.secure (per collection): <!-‐-‐ Set to true to enabled document-‐level authoriza<on -‐-‐> <bool name="enabled">false</bool> <!-‐-‐ Field where the auth tokens are stored in the document -‐-‐> <str name="sentryAuthField">sentry_auth</str> <!-‐-‐ Auth token defined to allow any role to access the document. Uncomment to enable. -‐-‐> <!-‐-‐<str name="allRolesToken">*</str>-‐-‐>
• No tokens = no access. To allow all users to access a document,
use the allRolesToken. Useful for getting started
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Secure Impersonation • But wait! My users don’t interact with Solr directly
• Custom web UI, load balancer, etc.
• Authorization won’t work! • “user” is forgotten, request to Solr from “UI”
Secure Impersonation • Secure impersonation: the ability of a “super-user” to submit
requests on behalf of another user • Conceptually similar to “sudo” on Unix • Limited to only groups/hosts that are explicitly configured to support it • Iden<cal to func<onality provided by HDFS, Oozie
Hue Search App UI • Uses Secure Impersonation to integrate with its own security mechanisms
• Users can login to Hue via LDAP or other auth mechanism • Hue makes requests on behalf of logged in user • Only Hue user requires kerberos keytab
• Seamlessly integrates with the collection and document-level access control mechanisms
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Performance Testing • Goal is to measure overhead of:
• Kerberos Authentication • Sentry Collection-Level Authorization
• Measure index, query overhead separately
Index Test Setup • 20-node cluster: 12 cores, 96 GB RAM, 12x 2TB disks, 10G Ethernet • Cloudera Search-1.2.0, CDH 4.6, MR1, CentOS 6.4 • 260M tweets/docs, indexed across 17 fields • 116 GB, ~800 JSON .gz files, ~130MB per file, 3-fold HDFS
replication • 1 Solr server and 1 shard per node (44M docs per shard), no Solr
replication • Uses MapReduceIndexerTool contrib. mapper/reducer slots = 2x/1x
number of cores • Solr heap size = 20GB • Record end-to-end indexing time, i.e., indexing + mtree merge + go
live • Record average from 3 repeats
Index Performance Testing
• Leg column is unsecured baseline.
• Center column is ~20% lower → HDFS security introduces ~20% performance overhead.
• Right column is ~same as center column → Solr security introduces no addi<onal overhead.
Query Test Setup • Same setup as MapReduce batch indexing • Uses the output of MapReduce batch indexing • 1 client, 30 threads per client • Uses internal tool - QueryRunner
• Similar to SolrMeter and JMeter • Query randomly sampled from fixed set of 10,000 strings • Record per thread query throughput for 5 runs of 30 min each
Query Performance Testing
• Leg column is unsecured baseline.
• Center column is ~13% lower → HDFS security introduces ~13% performance overhead.
• Right column is same as center column → Solr security introduces no addi<onal overhead.
Overview • Motivation
• Why security for Solr / SolrCloud? • Why Apache Sentry?
• Authentication • Authorization
• Collection-level • Document-level
• Secure Impersonation • Performance • Future Work
Future Work • Support for Sentry service with improved APIs / performance /
integration • Already supported for Hive/Impala • Currently in development upstream
• “Lineage” security: data flows from one system to another and retains security criteria • Example: Index HBase data for full-text queries in Solr. HBase Table
and Cell-level security tags automatically applied to Solr Collections, Documents, and Fields