Hadoop & Security - Past, Present, Future

51
Hadoop & Security Past, Present, Future uweseiler

Transcript of Hadoop & Security - Past, Present, Future

Hadoop & SecurityPast, Present, Future

uweseiler

Page 2

About me

Big Data Nerd

TravelpiratePhotography Enthusiast

Hadoop TrainerData Architect

Page 3

Agenda

Past

Present

Authentification

Authorization

Auditing

Data Protection

Future

Page 4

Past

Page 5

Hadoop & Security 2010

Owen O‘Malley @ Hadoop Summit 2010http://de.slideshare.net/ydn/1-hadoop-securityindetailshadoopsummit2010

Page 6

Hadoop & Security 2010

Owen O‘Malley @ Hadoop Summit 2010http://de.slideshare.net/ydn/1-hadoop-securityindetailshadoopsummit2010

Page 7

Hadoop & Security (Not that long ago…)

Hadoop Cluster

User

SSH

hadoop fs -put

SSHGateway

/user/uwe/

Page 8

Present

Page 9

Security in Hadoop 2015

AuthorizationRestrict access to

explicit data

AuditUnderstand who did

what

Data ProtectionEncrypt data at rest

& in motion

• Kerberos in Native Apache Hadoop

• HTTP/REST API Secured with Apache Knox Gateway

AuthenticationWho am I/prove it?

• Wire encryption in Hadoop

• File Encryption • Built-in since

Hadoop 2.6• Partner tools

• HDFS, YARN, MapReduce, Hive & HBase

• Storm & Knox

• Fine grain access control

• Centralized audit reporting

• Policy and access history

Centralized Security Administration

Page 10

Typical Flow - Hive Access with Beeline CLI

HDFSHiveServer 2

A B C

Beeline Client

Page 11

Typical Flow - Authenticate trough Kerberos

HDFSHiveServer 2

A B C

Beeline Client

KDC

Use Hive, submit query

Hive gets NameNode

(NN) Service Ticket

Hive creates

MapReduce/Tez

job using NN

Client gets Service

Ticket for Hive

Page 12

Typical Flow - Authorization through Ranger

HDFSHiveServer 2

A B C

Beeline Client

KDC

Use Hive, submit query

Hive gets NameNode

(NN) Service Ticket

Hive creates

MapReduce/Tez

job using NN

Client gets Service

Ticket for Hive

Ranger

Page 13

Typical Flow - Perimeter through Knox

HDFSHiveServer 2

A B C

Beeline Client

KDC

Hive gets NameNode

(NN) Service Ticket

Knox gets Service

Ticket for Hive

Ranger

Client gets

query result

Original request

with user

id/password

Knox runs

as proxy

user using

Hive

Hive creates

MapReduce/Tez

job using NN

Page 14

Typical Flow - Wire & File Encryption

HDFSHiveServer 2

A B C

Beeline Client

KDC

Hive gets NameNode

(NN) Service Ticket

Hive creates

MapReduce/Tez

job using NN

Knox gets Service

Ticket for Hive

Ranger

Knox runs

as proxy

user using

Hive

Original request

with user

id/password

Client gets

query result

SSL SSL SASL

SSL SSL

Page 15

AuthenticationKerberos

Page 16

Kerberos Synopsis

• Client never sends a password

• Sends a username + token instead

• Authentication is centralized

• Key Distribution Center (KDC)

• Client will receive a Ticket-Granting-Ticket

• Allows authenticated client to request access to secured services

• Clients establish a timed session

• Clients establish trust with services by sending KDC-stamped tickets to service

Page 17

Kerberos + Active Directory/LDAP

Cross Realm Trust

Client

Hadoop Cluster

AD / LDAP KDC

Hosts: [email protected]

Services: hdfs/[email protected]

User Store

Use existing directory tools to

manage users

Use Kerberos tools to manage host + service principals

Authentication

Users: [email protected]

Page 18

Ambari & Kerberos

• Install & Configure Kerberos

Server on a single node

Client on rest of the nodes

• Define Principals & Keytabs

A keytab (key table) is a file containing a key for a principal

Since there are a few dozen principals, Ambari can generate keytab data for your entire cluster as a downloadable csv file

• Configure User Permissions

Page 19

Perimeter SecurityApache Knox

Page 20

Load Balancer

Knox: Core Concept

Data Ingest

ETL

SSH

RPC CallFalconOozieScoopFlume

Admin /Data

Operator

Business User

HadoopAdmin

JDBC/ODBCREST/HTTP

Hadoop Cluster

HDFS Hive App XApp CApplication Layer

REST/HTTP

EdgeNode

Page 21

Knox: Hadoop REST APIService Direct URL Knox URL

WebHDFS http://namenode-host:50070/webhdfs https://knox-host:8443/webhdfs

WebHCat http://webhcat-host:50111/templeton https://knox-host:8443/templeton

Oozie http://ooziehost:11000/oozie https://knox-host:8443/oozie

HBase http://hbasehost:60080 https://knox-host:8443/hbase

Hive http://hivehost:10001/cliservice https://knox-host:8443/hive

YARN http://yarn-host:yarn-port/ws https://knox-host:8443/resourcemanager

Masters could be on many

different hosts

One host, one port

Consistent paths

SSL config at one host

Page 22

Knox: Features

Simplified Access

• Kerberos Encapsulation • Single Access Point• Multi-cluster support• Single SSL certificate

Centralized Control

• Central REST API auditing• Service-level authorization• Alternative to SSH “edge node”

Enterprise Integration

• LDAP / AD integration• SSO integration• Apache Shiro extensibility• Custom extensibility

Enhanced Security

• Protect network details• SSL for non-SSL services• WebApp vulnerability filter

Page 23

Knox: Architecture

RESTClient

EnterpriseIdentityProvider

Knox

Firewall

Firewall

DMZ

LB

Edge Node /Hadoop CLIs

RPC

HTTP

Slaves

RM

NN

WebHCat

Oozie

DN NM

HS2

HBase

KnoxKnox

Masters

Slaves

Hadoop Cluster 1

Slaves

RM

NN

WebHCat

Oozie

DN NM

HS2

HBaseMasters

Slaves

Hadoop Cluster 2

Page 24

Knox: What’s New in Version 0.6

• Knox support for HDFS HA

• Support for YARN REST API

• Support for SSL to Hadoop Cluster Services (WebHDFS, HBase, Hive & Oozie)

• Knox Management REST API

• Integration with Ranger for Knox Service Level Authorization

• Use Ambari for install/start/stop/configuration

Page 25

Authorization

Page 26

The Hadoop Layers

Page 27

Authorization: Overview

• HDFS• Permissions

• ACL‘s

• YARN• Queue ACL‘s

• Pig• No server component to

check/enforce ACL‘s

• Hive• Column level ACL‘s

• HBase• Cell level ACL‘s

Page 28

Authorization: HDFS Permissions

hadoop fs -chown maya:sales /sales-data

hadoop fs -chmod 640 /sales-data

Page 29

Authorization: HDFS ACL‘s

New Requirements:– Maya, Diana and Clark are allowed to make modifications

– New group execs should be able to read the sales data

Page 30

Authorization: HDFS ACL‘s

hdfs dfs -setfacl -m group:execs:r-- /sales-data

hdfs dfs -getfacl /sales-data

hadoop fs -ls /sales-data

Page 31

Authorization: HDFS Best Practices

•Start with traditional HDFS file permissions to implementmost permission requirements

• Define a small number of ACL‘s to handle exceptionalcases

•A file/folder with ACL incurs an additional cost in memoryin the NameNode compared to a file/folder with traditional permissions

Page 32

Authorization: YARN Permissions

yarn.scheduler.capacity.root.longrunning-jobs.acl_submit_applications=“etl,admin,Uwe”

yarn.scheduler.capacity.root.longrunning-jobs.acl_administer_queue="admin,Uwe"

Page 33

Authorization: Hive

• Hive has traditionally offered full table access control via HDFS access control

• Solution for column-based control

– Let HiveServer2 check and submit the query execution

– Let the table accessible only by a special (technical) user

– Provide an authorization plugin to restrict UDF‘s and file formats

• Use standard SQL permission constructs– GRANT / REVOKE

• Store the ACL‘s in Hive Metastore

Page 34

Authorization: Hive ATZ-NG

Details: https://issues.apache.org/jira/browse/HIVE-5837

Page 35

Authorization: Hive

CREATE ROLE sales_role;

GRANT ALL ON DATABASE ‘sales-data’ TO ROLE ‘sales_role’;

GRANT SELECT ON DATABASE ‘marketing-data’ TO ROLE ‘sales_role’;

CREATE ROLE sales_column_role;

GRANT ‘c1,c2,c3’ to ‘sales_column_role’;

GRANT ‘SELECT(c1, c2, c3) ’ on ‘secret_table’ to ‘sales_column_role’;

Page 36

Authorization: Pig

• There is no Pig (or MapReduce) Server to submit andcheck column-based access

• Pig (and MapReduce) is restricted to full data access via HDFS access control

Page 37

Authorization: HBase

• The HBase permission model traditionally supported ACL‘sdefined at the namespace, table , column family andcolumn level

– This is sufficient to meet most requirements

• Cell-based security was introduced with HBase 0.98

– On par with the security model of Accumolo

Page 38

Authorization & AuditingApache Ranger

Page 39

Ranger: Central Security Administration

Apache Ranger• Delivers a Single Pane for the

(Security) Administrator

• Centralizes administration of Security Policies

• Ensures consistent coverage across the entire Hadoop Stack

Page 40

Ranger: Authorization Policies

Page 41

Ranger: Auditing

Page 42

Ranger: Architecture

Page 43

Ranger: What’s New in Version 0.4?

• New Components Coverage

• Storm Authorization & Auditing

• Knox Authorization & Auditing

• Deeper Integration with HDP

• Windows Support

• Integration with Hive Auth API, support grant/revoke commands

• Support grant/revoke commands in HBase

• Enterprise Readiness

• Rest APIs for policy manager

• Store Audit logs locally in HDFS

• Support Oracle DB

• Ambari support, as part of Ambari 2.0 release

Page 44

Data ProtectionEncryption

Page 45

Encryption: Data in motion

• Hadoop Client to DataNode via Data Transfer Protocol

– Client reads/writes to HDFS over encrypted channel

– Configurable encryption strength

• ODBC/JDBC Client to HiveServer2– Encryption via SASL Quality of Protection

• Mapper to Reducer during Shuffle/Sort Phase– Shuffle is over HTTP(S)– Supports mutual authentification via SSL– Host name verification enabled

• REST Protocols– SSL Support

Page 46

Encryption: Data at rest

HDFS Transparent Data Encryption• Install and run KMS on top of HDP 2.2

• Change according HDFS parameters (via Ambari)

• Create encryption key

hadoop key create key1 -size 256

hadoop key list –metadata

• Create an encryption zone using the key

hdfs dfs -mkdir /zone1

hdfs crypto -createZone -keyName key1 /zone1

hdfs –listZones

• Details:

– http://hortonworks.com/kb/hdfs-transparent-data-encryption/

Page 47

Future

Page 48

Apache Atlas: Data Classification

Currently in Incubation

– https://wiki.apache.org/incubator/AtlasProposal

Page 49

Apache Atlas: Tag-based Policies

HDFSHiveServer 2

A B C

Beeline Client

RangerMetadata Server

Data ClassificationTable1|“marketing“

Tag PolicyLogs IT-Admin Create

Data Ingestion / ETL

Falcon

Oozie

Source Data

Scoop

Flume

Page 50

Future: More goodies

Dynamic, Attribute based Access Control (ABAC)• Extend Ranger to support data or user attributes in policy decisions

• Example: Use geo-location of users

Enhanced Auditing• Ranger can stream audit data through Kafka&Storm into multiple stores

• Use Storm for correlation of data

Encryption as First Class Citizen• Build native encryption support in HDFS, Hive & HBase

• Ranger-based key management to support encryption

Page 51

Contact Details

Twitter: @[email protected]

Mail:[email protected]

Phone+49 176 1076531

XING:https://www.xing.com/profile/Uwe_Seiler