How Hue integrates Hadoop with Django

30
Django+NoSQL HOW Hue Integrates with Hadoop Abraham Elmahrek Cloudera - March 5th, 2014 Monday, March 3, 14

description

Given the different structure of big data systems, they can be difficult to query, and even more difficult to explore. Hue, a Django-drive web application, integrates with these components and provides a clean, easy-to-use interface. In this discussion, we'll cover how the Hue project addressed communicating with Hbase, Hdfs, and various query engines. We'll also cover the reasons behind these design decisions.

Transcript of How Hue integrates Hadoop with Django

Page 1: How Hue integrates Hadoop with Django

Django+NoSQLHOW Hue Integrates with HadoopAbraham ElmahrekCloudera - March 5th, 2014

Monday, March 3, 14

Page 2: How Hue integrates Hadoop with Django

What is Hue?

HUE 1

Desktop-like in a browser, did its job but pretty slow, memory leaks and not very IE friendly but definitely advanced for its time (2009-2010).

Monday, March 3, 14

Page 3: How Hue integrates Hadoop with Django

HISTORY

HUE 2

The first flat structure port, with Twitter Bootstrap all over the place.

Monday, March 3, 14

Page 4: How Hue integrates Hadoop with Django

HISTORY

HUE 2.5

New apps, improved the UX adding new nice functionalities like autocomplete and drag & drop.

Monday, March 3, 14

Page 5: How Hue integrates Hadoop with Django

HISTORY

HUE 3 ALPHA

Proposed design, didn’t make it.

Monday, March 3, 14

Page 6: How Hue integrates Hadoop with Django

HISTORY

HUE 3

Transition to the new UI, major improvements and new apps.

Monday, March 3, 14

Page 7: How Hue integrates Hadoop with Django

HISTORY

HUE 3.5+

Monday, March 3, 14

Page 8: How Hue integrates Hadoop with Django

APPS

PIGJO

B BROWSER

JOB DESIGNER

OOZIE

HIVE IMPA

LA

METASTO

RE BROWSERSEARCH

HBASE BROWSER

SQOOP

ZOOKEEPERUSER ADMIN

DB QUERY

SPARK

HOME ...

GUI DESIGN

FILE BROWSER

USER

USER WORKFL

OWS

USER

Monday, March 3, 14

Page 9: How Hue integrates Hadoop with Django

YARN JobTracker Oozie

Pig

HDFS

HiveServer2

HiveMetastore

ClouderaImpala

Solr

HBase

Sqoop2

Zookeeper

LDAPSAML

Hue Plugins

APPS

Monday, March 3, 14

Page 10: How Hue integrates Hadoop with Django

FAST PACE

LAST MONTH

91 issues created and 90 resolved.Core team + Community

Monday, March 3, 14

Page 11: How Hue integrates Hadoop with Django

STACK

BACKEND

Python + Django (2.6+/1.4.5)

FRONTEND

jQueryBootstrap

Knockout.jsLove

Monday, March 3, 14

Page 12: How Hue integrates Hadoop with Django

HADOOP INTERFACES

REST & THRIFT

Many Hadoop interfaces used

WebHDFSYARN API (RM, NM, MR...)HiveServer2ImpalaHBaseOozieSqoop2ZooKeeper...

CUSTOM CLIENTS

Provide custom clients for more explicit API definitions

Monday, March 3, 14

Page 13: How Hue integrates Hadoop with Django

PROTOCOLS

REST

Use python-requests and a custom client to streamline RESTful interface calls.

http_client.HttpClient(url,

exc_class=WebHdfsException,

logger=LOG)

if security_enabled:

client.set_kerberos_auth()

return client

Thrift

Custom connection pooling and socket multiplexing to streamline thrift calls.

thrift_util.get_client(TCLIService.Client,

query_server['server_host'],

query_server['server_port'],

service_name=query_server['server_name'],

kerberos_principal=kerberos_principal_short_name,

use_sasl=use_sasl,

mechanism=mechanism,

username=user.username,

timeout_seconds=conf.SERVER_CONN_TIMEOUT.get(),

use_ssl=conf.SSL.ENABLED.get(),

ca_certs=conf.SSL.CACERTS.get(),

keyfile=conf.SSL.KEY.get(),

certfile=conf.SSL.CERT.get(),

validate=conf.SSL.VALIDATE.get())

Monday, March 3, 14

Page 14: How Hue integrates Hadoop with Django

ACCESSIBILITY

Middleware

Make Hadoop interfaces accessible in request objects

class ClusterMiddleware(object):

def process_view(self, request, ...):

request.fs = cluster.get_hdfs(request.fs_ref)

if request.user.is_authenticated():

if request.fs is not None:

request.fs.setuser(request.user.username)

def download(request, path):

if not request.fs.exists(path):

raise Http404(_("File not found."))

if not request.fs.isfile(path):

raise PopupException(_("not a file."))

Monday, March 3, 14

Page 15: How Hue integrates Hadoop with Django

HDFS

Goal

Easily browse, create, read, update, and delete files in HDFS

Monday, March 3, 14

Page 16: How Hue integrates Hadoop with Django

HDFS - Communication

REST

The NameNode provides a RESTful server called WebHDFS

def download(request, path):

if not request.fs.exists(path):

raise Http404(_("File not found."))

if not request.fs.isfile(path):

raise PopupException(_("not a file."))

Request Accessible

Provide a middleware for populating a request member

Explicit Client

Provide an API that is explicit

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN

...

class WebHdfs(Hdfs):

def create(self, path, ...):

...

def read(self, path, ...):

...

Monday, March 3, 14

Page 17: How Hue integrates Hadoop with Django

HDFS - Cool Things

MIME Type Detection

Detect the various kinds of files being read: Avro, GZIP, etc.

Pagination

Nice pagination by block size when viewing a file (soon to be more like a PDF reader with content automatically being added)

Monday, March 3, 14

Page 18: How Hue integrates Hadoop with Django

HBase

Goal

Make it easy to view and search HBase

Monday, March 3, 14

Page 19: How Hue integrates Hadoop with Django

HBase - Technical Risk

2 Dimensions

Infinitely many columns and rows

Sparseness

Column names will often differ per row

Monday, March 3, 14

Page 20: How Hue integrates Hadoop with Django

HBase - Communication

Thrift

Communicate with HBase using Thrift for better filtering

Explicit Client

Provide an API that is explicit

class HBaseApi(Hdfs):

def createTable(self, cluster, tableName, ...):

...

def getRows(self, cluster, tableName, columns, ...):

...

Monday, March 3, 14

Page 22: How Hue integrates Hadoop with Django

Hive

Goal

Make it easy to run queries in Hive

Monday, March 3, 14

Page 23: How Hue integrates Hadoop with Django

Hive - Communication

Thrift

Communicate with HiveServer2 using Thrift

DBMS

Further the capacities of the DBMS in Hue

Explicit Client

Provide a higher level API that is explicit and easy to configure

class HiveServerClient:

HS2_MECHANISMS = {'KERBEROS': 'GSSAPI', 'NONE': 'PLAIN', 'NOSASL': 'NOSASL'}

def __init__(self, query_server, user, ...):

thrift_util.get_client(TCLIService.Client,

...

thrift_util.get_client(TCLIService.Client,

query_server['server_host'],

query_server['server_port'],

service_name=query_server['server_name'],

...)

class HiveServer2Dbms(object):

def get_databases(self):

return self.client.get_databases()

...

def select_star_from(self, database, table):

hql = "SELECT * FROM `%s.%s` %s" % (database, table.name, self._get_browse_limit_clause(table))

return self.execute_statement(hql)

...

Monday, March 3, 14

Page 24: How Hue integrates Hadoop with Django

Hive - Results

One Page App

Intelligent view that lets users worry about their queries

Navigation

Able to navigate databases and tables easily

Secure

Achieved some level of security through SASL, Kerberos, and SSL

Monday, March 3, 14

Page 25: How Hue integrates Hadoop with Django

DEMO TIME

Monday, March 3, 14

Page 27: How Hue integrates Hadoop with Django

What else does Hue do with Django?

Extensible settings

Configuration of settings.py provided through the hue.ini

Testing

Mocked and functional tests via nose + django-nose

Authentication

LDAP, PAM, OAuth, etc. provided through authentication backends

Security

Configurable session timeouts, SAML authentication, etc.

Doc Model

Polymorphic documents via a base document model

Permissions

Per-app permissions configurable in the UserAdmin

Monday, March 3, 14

Page 28: How Hue integrates Hadoop with Django

GET HUE

Try in advance the latest and greatest but you’ll have to configure everything on your own.

Get to play with Hue and various Hadoop components in 5 minutes. It’s a self contained CDH environment ready to use.

Newer version than HDP, close to the original 2.5 minus apps like HBase, Impala, Sqoop, Search.

The newest addition, ships Hue 3.0 through the GreenButton products.

Stable and highly tested releases perfectly integrated with the Hadoop ecosystem, automagically configured by Cloudera Manager.

In HDP there’s an old forked version of Hue 2.3.

CLOUDERA’S CDH TARBALL CLOUDERA’S DEMO VM

HORTONWORKS* MAPR* HP CLOUD*

* YOUR MILEAGE MAY VARY.

BIGTOP EMBEDDED/DEMO IN IND. COMPANIES

Monday, March 3, 14

Page 30: How Hue integrates Hadoop with Django

THANKS.

gethue.com

QUESTIONS?

Monday, March 3, 14