CIS13: Big Data Analytics Vendor Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
-
Upload
cloudidsummit -
Category
Technology
-
view
112 -
download
5
description
Transcript of CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
Who am I?
• SoHware Engineer at Cloudera • Hadoop CommiJer and PMC Member at Apache SoHware Founda?on
• Primarily work on Hadoop Security and HDFS • Masters thesis focused on systems security
Agenda
• What is Hadoop? • Hadoop Ecosystem Interac?ons • Hadoop Authen?ca?on • Hadoop Authoriza?on • IT Infrastructure Integra?on • The Future: Where Hadoop Security is Headed
Hadoop Is…
• A distributed system • Designed for massive scaling of storage and compute across many (10s-‐1000s) nodes
• An ecosystem • Hadoop is the kernel, apps on top are user-‐level programs • e.g. Impala, Hive, Oozie, HBase, etc.
• A security pain • Designed to run arbitrary code submiJed by users
• Another place where many users interact with the system • Many orgs provide “Hadoop as a service”
Hadoop Is…
• Not secure by default • No authen?ca?on whatsoever • Usually behind a corporate firewall
• OHen accessed by common BI tools • Tableau, SAS, Microstrategy, etc.
• Expected to be integrated into corporate IT infra • SSO, etc.
Hadoop on its Own
Hadoop
NN
DN TT
JT
DN TT
DN TT
MR client
Map Task
Map Task
Reduce Task
SNN
hdfs, hJpfs & mapred users end users protocols: RPC/data transfer/HTTP
H6pFS
HDFS client
WebHdfs client
The Hadoop Ecosystem
• Storage • HBase • HDFS
• Processing • Map/Reduce • YARN
• Querying • Hive, Impala (SQL) • Pig (DSL)
• Cron, workflows • Oozie
• Data ingest
• Flume (streaming) • Sqoop (batch)
• Live data serving • HBase
• Pipelines • Crunch, Cascading
• GUI • Hue
• Management • Cloudera Manager
Hadoop and Friends
Hadoop
Hive Metastore
Hbase
Oozie
Hue
Impala
Zookeeper
Flume MapRed
Pig
Crunch
Cascading
Sqoop
Hive
Hbase
Oozie
Impala
browser
Flume
services clients clients RPC
HTTP
ThriH
HTTP
RPC
ThriH
HTTP
RPC
service users end users protocols: RPCs/data/HTTP/ThriH/Avro-‐RPC
Avro RPC
WebHdfs
HTTP
RPC Zookeeper
• Hadoop Authen?ca?on based on Kerberos • Usually MIT, also Ac?ve Directory
• End Users to services, as a user • CLI & libraries: Kerberos (kinit or keytab) • Web UIs: Kerberos SPNEGO & pluggable HTTP auth
• Services to Services, as a service • Creden?als: Kerberos (keytab)
• Services to Services, on behalf of a user • Proxy-‐user (aHer Kerberos for service)
• Job tasks to Services, on behalf of a user • Job delega?on token
Authen?ca?on Details
• HDFS Data • File System permissions (Unix like user/group permissions)
• HBase Data • Read/Write Access Control Lists (ACLs) at table level
• Hive Metastore (Hive, Impala) • Leverages/proxies HDFS permissions for tables & par??ons
• Hive Server (Hive, Impala) (coming) • More advanced GRANT/REVOKE with ACLs for tables
• Jobs (Hadoop, Oozie) • Job ACLs for Hadoop Scheduler Queues, manage & view jobs
• Zookeeper • ACLs at znodes, authen?cated & read/write
Authoriza?on Details
IT Integra?on: Kerberos
• Users don’t want Yet Another Creden?al • Corp IT doesn’t want to provision thousands of service principals
• Solu?on: local KDC + one-‐way trust • Run a KDC (usually MIT Kerberos) in the cluster
• Put all service principals here
• Set up one-‐way trust of central corporate realm by local KDC • Normal user creden?als can be used to access Hadoop
IT Integra?on: Groups
• Much of Hadoop authoriza?on uses “groups” • User ‘atm’ might belong to groups ‘analysts’, ‘eng’, etc.
• Users’ groups are not stored in Hadoop anywhere • Refers to external system to determine group membership • NN/JT/Oozie/Hive servers all must perform group mapping
• Default plugins for user/group mapping: • ShellBasedUnixGroupsMapping – forks/runs `/bin/id’ • JniBasedUnixGroupsMapping – makes a system call • LdapGroupsMapping – talks directly to an LDAP server
IT Integra?on: Kerberos + LDAP
Hadoop Cluster
Local KDC
hdfs/[email protected] yarn/[email protected]
…
Central Ac?ve Directory
[email protected] [email protected]
…
Cross-‐realm trust
NN JT
LDAP group mapping
IT Integra?on: Web Interfaces
• Most web interfaces authen?cate using SPNEGO • Standard HTTP authen?ca?on protocol • Used internally by services which communicate over HTTP • Most browsers support Kerberos SPNEGO authen?ca?on
• Hadoop components which use servlets for web interfaces can plug in custom filter • Integrate with intranet SSO HTTP solu?on
IT Integra?on: Web Interfaces
• Most web interfaces authen?cate using SPNEGO • Standard HTTP authen?ca?on protocol • Used internally by services which communicate over HTTP • Most browsers support Kerberos SPNEGO authen?ca?on
• Hadoop components which use servlets for web interfaces can plug in custom filter • Integrate with intranet SSO HTTP solu?on
Issues with Hadoop Security
• SSO is poorly and not universally supported • Only supported for the web interfaces, liJle used, etc.
• Kerberos the only op?on • Not all orgs comfortable administering net new Kerberos realm
• Not well-‐suited for cloud deployments • Need properly working reverse DNS • Pain to provision KDC, distribute keytabs
• Kerberos tough for management tools • No Kerberos administra?ve API/protocol
Issues with Hadoop Security (cont.)
• Isola?on of user tasks currently requires separate local Unix accounts on all boxes • Need to integrate with LDAP using PAM or something like it
• HDFS authoriza?on only supports Unix-‐style permissions • Not expressive enough for some applica?ons, e.g. Hive
Future Development
• Full SSO support • OAUTH the most commonly requested, first goal
• Decouple Hadoop RPC implementa?on from Kerberos • Make authen?ca?on system fully pluggable for custom implementa?ons
• Any service which can provide bidirec?onal authen?ca?on
• Improve management tools • Cloudera Manager can manage more of the security infrastructure
Future Development (cont.)
• Use beJer isola?on methods for user tasks • Linux containers • Solaris “zones” • Etc.
• BeJer authoriza?on capabili?es • Talk of adding ACL support to HDFS • Hive Server 2 will provide rich authoriza?on capabili?es