Post on 14-Jan-2017
1© Cloudera, Inc. All rights reserved.
UIUC Fireside ChatsBuilding NoSQL Applications, Hadoop In The Cloud, and other Assorted Topics
2© Cloudera, Inc. All rights reserved.
Introductions
Aleks Shulman - Software Engineer in Test
3© Cloudera, Inc. All rights reserved.
Agenda
• Introductions• The Technicals
• NoSQL Applications with HBase• Considerations for Hadoop in the Cloud
• Parting Thoughts & Suggestions• Small company vs. big company• Thoughts on becoming a successful engineer• Thoughts on the Bay Area
• Q & A• Happy Hour
4© Cloudera, Inc. All rights reserved.
A Little About Myself
A_Shulmanaleks@cloudera.com
Aleks Shulman
Current:ClouderaCloud TeamTest Engineer
Past:Salesforce.comPlatform API TeamQuality Engineering
School: UIUC - ‘10Computer Science (bachelors) & Aerospace Engineering (bachelors)
Aleks Shulman
5© Cloudera, Inc. All rights reserved.
A Little About Cloudera
Engineering - @ClouderaEngCorporate - @ClouderaLife at Cloudera - @ClouderaJobs
• Started 2009 by former Facebook, Yahoo!, Google, and Oracle engineers and execs
• 1000+ employees as of 10/2015• Distributes Cloudera’s Distribution
including Apache Hadoop (CDH), Cloudera's 100% Open Source Distribution of Hadoop
• Distributes Cloudera Manager (CM), a proprietary monitoring and management layer atop CDH
• Contributes heavily to the open source community
• Employs 50+ Apache committers across the community | 84 committerships | 12+ top-level Apache projects originated at Cloudera
6© Cloudera, Inc. All rights reserved.
Building NoSQL ApplicationsOverview and HBase Case Study
7© Cloudera, Inc. All rights reserved.
Why Are NoSQL Applications Interesting?
Relational databases are SO 2005
• Build higher-scale systems• Think differently about what it means
to store & retrieve data• Solve different types of problems
• Data variability• Data variety• Data velocity
• Expertise is highly coveted in industry
8© Cloudera, Inc. All rights reserved.
• Referential Integrity - Valid references across tables
• Transactions - Atomicity, consistency, isolation, and durability (ACID) while doing concurrent sets of multiple R+W requests
• Joins - Constructing a view of data from two or more table with a common criteria
• Locking - Not permitting access because someone or something else is using it
Key Database Terms and Concepts
ClassId ClassName ProfessorId
CS125 Intro To CS 003
CS241 Systems 002
CS473 Algorithms 005
ProfessorId ProfessorFName ProfessorLName
001 Chandra Chekuri
002 Kravets Robin
003 Angrave Lawrence
004 Erickson Jeff
9© Cloudera, Inc. All rights reserved.
What is this NoSQL thing?
• Premise - Full relational access to all data may not be necessary+
• Performance penalties• Implementation Complexity
• If we relax those constraints we can…
• Process more data• Have a more flexible schema• Scale out instead of scale-up
• Successful NoSQL databases• Document stores: MongoDB,
CouchDB• Key-Value stores: HBase,
Cassandra, Riak KV• Graph stores: Giraph, Neo4j
+ http://hbase.apache.org/acid-semantics.html
10© Cloudera, Inc. All rights reserved.
• What is Hadoop?• Open-source framework for crunching LOTS of data!• Originated at Google and Yahoo! in mid-2000’s
• Why is it such a big deal?• Democratizing access to extremely powerful tools• Solve problems that have never been possible to solve• Help enterprises & institutions learn from and use all their data!
Case Study: NoSQL Applications with Hadoop & HBase
11© Cloudera, Inc. All rights reserved.
• Philosophy• Distributed computing• Commodity hardware• Accept, embrace, and handle
failure• One set of data - multiple
processing engines• Linear scalability
• Core Hadoop• Storage - HDFS• Processing - MapReduce
• ...and we build from there
MapReduce
HDFS
Other Components
OS
JVM
ZooKeeper
Infrastru
cture
Had
oo
p
Physical Hardware
Hadoop Architecture
12© Cloudera, Inc. All rights reserved.
NoSQL: HBase, Accumulo, Kudu
Processing: MR, YARN, Spark
Query: Impala, Phoenix, Hive
Infrastructure : Linux
Coordination: ZooKeeper
Storage: HDFS (or Isilon, S3, etc.)
YOUR APPLICATION HERE
Hadoop
Ecosystem
Core H
adoopHadoop Architecture - The Rest of the Stack
13© Cloudera, Inc. All rights reserved.
Hadoop - Cluster Topology
A Machine
A Cluster of Machines
...
...
...
...
... ......... ...... ......
...... ......
14© Cloudera, Inc. All rights reserved.
Hadoop - Cluster Command And Control
A Machine
A Cluster of Machines
...
...
...
...
... ......... ...... ......
...... ......M M M
15© Cloudera, Inc. All rights reserved.
What Is HBase?
• Distributed, ColumnFamily-Oriented Key-Value Data-Store
• Modeled after Google’s BigTable paper• Scalable, low-latency, consistent,
random-access• Non-relational• Built atop HDFS• Apache-Licensed Open Source
16© Cloudera, Inc. All rights reserved.
RDBMS HBase
Data Layout Structured & Row-oriented Semi-Structured - Column-family-oriented
Schema Defined at create Defined at create & runtime
Transactions Multi-row ACID Single row only
Query Language SQL get/put/scan/increment/etc
Security - Authentication- Authorization
- Authentication (Kerberos)- Authorization (ACLs)
Indexes On arbitrary columns Row-key only
Max Data Size TBs ~1 PB
Read/write throughput limits 1000s queries/second Millions of “queries”/second
What Is HBase?
17© Cloudera, Inc. All rights reserved.
HBase Is A Set Of Tables Defined by KVsImplicit PRIMARY KEY in RDBMS terms
Column format isfamily:qualifier
Data is all byte[] in HBase
Different rows may have different sets of columns(table is sparse)
A single cell might have differentvalues at different timestamps
Key: cutting/info:height/<timestamp> Value: ‘9ft’Key: tlipcon/roles:hbase/<timestamp> Value: ‘Committer’
18© Cloudera, Inc. All rights reserved.
Your First HBase Java App
pom.xml MyClient.javaimport org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.*;import org.apache.hadoop.hbase.client.HBaseAdmin;import java.io.IOException;
public class MyClient { public static void main(String args[]) throws IOException {
final String TABLE_NAME = "myTable_" + System.currentTimeMillis(); final String CF_NAME = "myColumnFamily";
//Create the table HBaseAdmin myAdmin = new HBaseAdmin(new Configuration()); HTableDescriptor htd = new HTableDescriptor(TABLE_NAME); htd.addFamily(new HColumnDescriptor(CF_NAME)); myAdmin.createTable(htd);
//List the table (e.g. select name from tables) for(TableName t : myAdmin.listTableNames()) { System.out.println("Table: " + t.getNameAsString()); } }}
<groupId>myGroup</groupId> <artifactId>myClient</artifactId> <version>1.0-SNAPSHOT</version> <properties> <hadoop.version>2.3.0</hadoop.version> <hbase.version>0.98.2-hadoop2</hbase.version> </properties> <dependencies> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>${hbase.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies>
19© Cloudera, Inc. All rights reserved.
Running Your First HBase Java App
1. Get HBase (0.98, for example) List of mirrors: http://www.apache.org/dyn/closer.cgi/hbase/
wget -O hbase.tar.gz <mirror link> tar -xf ./hbase.tar.gz2. Start HBase Locally ./hbase/bin/start-hbase.sh3. Build & Run Your App mvn clean compile package -DskipTests mvn exec:java*
* A little config required in maven. To avoid this, add pom and Client into Intellij and run as an application
20© Cloudera, Inc. All rights reserved.
So you want to use a NoSQL Data Store?
• What data is being stored?• Entity data • Event data
• Why is the data being stored?• Operational use cases• Analytical use cases
• How does the data get in and out?• Real time vs. Batch • Random vs. Sequential
21© Cloudera, Inc. All rights reserved.
Why are you storing the data?
• So what kind of questions are you asking the data?• Entity-centric questions
• Give me everything about entity e• Give me the most recent event v about entity e• Give me the n most recent events V about entity e• Give me all events V about e between time [t1,t2]
• Event and Time-centric questions• Give me an aggregate for each entity between time [t1,t2]• Give me an aggregate for each time interval for entity e• Find events V that match some other given criteria
22© Cloudera, Inc. All rights reserved.
Entity Centric Data
• Entity data is information about current state• Generally real time reads and writes
• Examples: • Accounts• Users• Geolocation points• Click Counts and Metrics• Current Sensors Reading
• Scales up with # of Humans and # of Machines/Sensors• Billions of distinct entities
23© Cloudera, Inc. All rights reserved.
Event Centric Data
• Event centric data are time-series data points recording successive points spaced over time intervals.
• Generally real time write, some combination of real time read or batch read
• Examples: • Sensor data over time• Historical Stock Ticker data• Historical Metrics• Clicks time-series
• Scales up due to finer grained intervals, retention policies, and the passage of time
24© Cloudera, Inc. All rights reserved.
If You Need SQL, xTable Transactions, and ACID
• No SQL -> Not Only SQL• Hadoop Query Engines• SpliceMachine• Apache Phoenix• HP’s Trafodion
25© Cloudera, Inc. All rights reserved.
The CloudPossibilities & Considerations
26© Cloudera, Inc. All rights reserved.
Why Might You Find Hadoop in the Cloud Interesting?
• Flexibility• API-defined computing• Rapid prototyping and POC• Burst capacity
• Exciting new data persistence options
• Network-attached storage• File stores• Block stores
• Cluster topologies & lifecycles• Short-lived vs. long-lived
27© Cloudera, Inc. All rights reserved.
Hadoop + Cloud ?
• Pros• A natural fit• Elasticity and workloads• Easy to prove out new technology
• Cons• Security Policy• Cost• Lack of transparency/control• Relatively untested/unproven
28© Cloudera, Inc. All rights reserved.
Cloud Providers
• Short-lived Clusters• Microsoft HDInsights (Azure)• Amazon EMR (AWS)
• Longer-term Clusters• Cloudera Director (Azure, AWS,
GCP)
29© Cloudera, Inc. All rights reserved.
What Can Go Wrong
• Scale• Provisioning• Timeouts• Retry counts
• Network• Connectivity issues• VPC/NAT throttling• Latency
• AWS• Oversubscribed hardware• Network connectivity/throughput issues• Opaque topology
• Network-Attached Storage• High latency• Low throughput
• Running on a file store• Semantic mismatches• File operation incompatibility
• OS / Machine Image• Suboptimal memory/disk tuning• Hypervisor issues
30© Cloudera, Inc. All rights reserved.
What Can Go Wrong - Case Study: Network-Attached Storage
• Network attached storage• Backing store is some kind of
block store• Blocks organized into logical
disks• Disks are mounted• Should just work, right?!
• What can go wrong• High latency• Low throughput• Write/Read timeouts
• How to fix:• OS (Memory) Tuning• Disk tuning• Cluster (filesystem) tuning• Application (tuning)
31© Cloudera, Inc. All rights reserved.
Other Cloud Considerations
• SSD vs. Rotating disk• PV vs. HVM Machine Images
• Other virtualization considerations• # of disks• Memory considerations• Instance types vs. workloads
32© Cloudera, Inc. All rights reserved.
Less Technical MattersThoughts on Engineering, personal growth, living in the Bay Area, and topics
33© Cloudera, Inc. All rights reserved.
Small vs. Big Company
• Pros• Greater impact/greater leverage• Less process/bureaucracy• Dedicated staff/founders• Ownership of key, visible products/features• Closer personal ties with co-workers
• Cons• The buck stops with you• Nights & Weekends• Requirements and business direction change• Common problems don’t yet have solutions
34© Cloudera, Inc. All rights reserved.
Frenemies
• Sometimes companies have shared interests, but compete
• Customers at one level of the stack, collaborators at another level, competitors at a third
• Keep competition professional & ethical - things change often
• People move companies• Companies get acquired• Companies decide to partner or merge
35© Cloudera, Inc. All rights reserved.
Good Engineers...Apply Good Engineering Patterns & Processes
• Build small, dependable, trusted kernels, and then scale up
• Look to reuse as much as possible• Think twice, implement once• Pick their technical battles very carefully
• Knowing which constrains can be relaxed, and when, can be really helpful (don’t boil the ocean!)
Are Always Learning and Thinking In Patterns
• Look for natural interfaces to things• Aren’t afraid to go beyond abstractions• Able to quickly understand unfamiliar systems in
terms of more familiar systems• Invest in tools & tools knowledge
Communicate
• Conscious of their communication style and those of others
• Seek out feedback regularly and make sure to use it
• Give feedback compassionately and delicately• Look for mentorship & mentor others where
appropriate
Have a Sense of Self
• Are self-aware & play to their own strengths• Understand and mitigate their weaknesses• Self-regulate to avoid burn-out
36© Cloudera, Inc. All rights reserved.
SF & The Bay Area
Work
• The Technology• Hub for Innovation
• The People• Very intelligent, driven, and unique• Extremely diverse
Play
• Weather• No snow!• Drive to your choice of weather
• Food• Virtually unlimited options and
availability, usually a few minutes away
37© Cloudera, Inc. All rights reserved.
Staying In Touch
aleks@cloudera.comkathryn.obrien@cloudera.com
@a_shulman @cloudera@clouderaEng @clouderaJobs We’re Hiring!!!
Internships: Summer 2016Full Time Engineering