HBase: How to get MTTR below 1 minute
-
Upload
hortonworks -
Category
Technology
-
view
3.115 -
download
0
description
Transcript of HBase: How to get MTTR below 1 minute
How to get the MTTR below 1 minute and more
Devaraj Das ([email protected])
Nicolas Liochon ([email protected])
Outline
• What is this? Why are we talking about this topic? Why it ma>ers? ….
• HBase Recovery – an overview • HDFS issues • Beyond MTTR (Performance post recovery) • Conclusion / Future / Q & A
What is MTTR? Why its important? …
• Mean Time To Recovery -‐> Average Pme required to repair a failed component (Courtesy: Wikipedia)
• Enterprises want an MTTR of ZERO – Data should always be available with no degradaPon of perceived SLAs
– PracPcally hard to obtain but yeah it’s a goal • Close to Zero-‐MTTR is especially important for HBase – Given it is used in near realPme systems
• MTTR in other NoSQL systems & Databases
HBase Basics • Strongly consistent – Write ordered with reads – Once wri>en, the data will stay
• Built on top of HDFS
• When a machine fails the cluster remains available, and its data as well
• We’re just speaking about the piece of data that was handled by this machine
Write path
WAL – Write Ahead Log
A write is finished once wri>en on all HDFS nodes
The client communicated with the region servers
We’re in a distributed system
• You can’t disPnguish a slow server from a dead server
• Everything, or, nearly everything, is based on Pmeout
• Smaller Pmeouts means more false posiPve • HBase works well with false posiPve, but they
always have a cost.
• The less the Pmeouts the be>er
HBase components for recovery
Recovery in acPon
Recovery process • Failure detecPon: ZooKeeper
heartbeats the servers. Expire the session when it does not reply
• Region assignment: the master reallocates the regions to the other servers
• Failure recovery: read the WAL and rewrite the data again
• The clients stops the connecPon to the dead server and goes to the new one.
ZK Heartbeat
Client
Region Servers, DataNode
Data recovery
Master, RS, ZK Region Assignment
So….
• Detect the failure as fast as possible • Reassign as fast as possible • Read / rewrite the WAL as fast as possible
• That’s obvious
The obvious – failure detecPon • Failure detecPon – Set a ZooKeeper Pmeout to 30s instead of the old 180s default.
– Beware of the GC, but lower values are possible. – ZooKeeper detects the errors sooner than the configured Pmeout
• 0.96 – HBase scripts clean the ZK node when the server is kill
-‐9ed • => DetecPon Pme becomes 0
– Can be used by any monitoring tool
The obvious – faster data recovery
• Not so obvious actually • Already distributed since 0.92 – The large the cluster the be>er.
• Completely rewri>en in 0.96 – Recovery itself rewri>en in 0.96 – Will be covered in the second part
The obvious – Faster assignment • Faster assignment – Just improving performances
• Parallelism • Speed
– Globally ‘much’ faster – Backported to 0.94
• SPll possible to do be>er for huge number of regions.
• A few seconds for most cases
With this
• DetecPon: from 180s to 30s • Data recovery: around 10s • Reassignment : from 10s of seconds to seconds
Do you think we’re be>er with this
• Answer is NO • Actually, yes but if and only if HDFS is fine – But when you lose a regionserver, you’ve just lost a datanode
DataNode crash is expensive! • One replica of WAL edits is on the crashed DN – 33% of the reads during the regionserver recovery will go to it
• Many writes will go to it as well (the smaller the cluster, the higher that probability)
• NameNode re-‐replicates the data (maybe TBs) that was on this node to restore replica count – NameNode does this work only amer a good Pmeout (10 minutes by default)
HDFS – Stale mode Live
Stale
Dead
As today: used for reads & writes, using locality
Not used for writes, used as last resort for reads
As today: not used. And actually, it’s be>er to do the HBase recovery before HDFS replicates the TBs of data of this node
30 seconds, can be less.
10 minutes, don’t change this
Results
• Do more read/writes to HDFS during the recovery
• MulPple failures are sPll possible – Stale mode will sPll play its role – And set dfs.Pmeout to 30s – This limits the effect of two failure in a row. The cost of the second failure is 30s if you were unlucky
Are we done?
• We’re not bad • But there is sPll something
The client
You lem it waiPng on the dead server
Here it is
The client
• You want the client to be paPent • Retries when the system is already loaded is not good.
• You want the client to learn about region servers dying, and to be able to react immediately.
• You want this to scale.
SoluPon
• The master noPfies the client
– A cheap mulPcast message with the “dead servers” list. Sent 5 Pmes for safety.
– Off by default. – On recepPon, the client stops immediately waiPng on the TCP connecPon. You can now enjoy large hbase.rpc.Pmeout
Full workflow t0
t1
t2
t3
Client reads and writes
RegionServer serving reads and writes
RegionServer crashes
Affected regions reassigned
Client writes
Data recovered
Client reads and writes t4
Are we done
• In a way, yes – There is a lot of things around asynchronous writes, reads during recovery
– Will be for another Pme, but there will be some nice things in 0.96
• And a couple of them is presented in the second part of this talk!
Faster recovery • Previous algo – Read the WAL files – Write new Hfiles – Tell the region server it got new Hfiles
• Put pressure on namenode – Remember: don’t put pressure on the namenode
• New algo: – Read the WAL – Write to the regionserver – We’re done (have seen great improvements in our tests) – TBD: Assign the WAL to a RegionServer local to a replica
RegionServer0 RegionServer_x RegionServer_y
WAL-‐file3 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file2 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file1 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
HDFS
Splitlog-‐file-‐for-‐region3 <region3:edit1><region1:edit2> …… <region3:edit1> ……..
Splitlog-‐file-‐for-‐region2 <region2:edit1><region1:edit2> …… <region2:edit1> ……..
Splitlog-‐file-‐for-‐region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..
HDFS
RegionServer3
RegionServer2
RegionServer1
writes
writes reads
reads
Distributed log Split
RegionServer0 RegionServer_x RegionServer_y
WAL-‐file3 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file2 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file1 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
HDFS
Recovered-‐file-‐for-‐region3 <region3:edit1><region1:edit2> …… <region3:edit1> ……..
Recovered-‐file-‐for-‐region2 <region2:edit1><region1:edit2> …… <region2:edit1> ……..
Recovered-‐file-‐for-‐region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..
HDFS
RegionServer3
RegionServer2
RegionServer1
writes
writes reads
reads
Distributed log Replay
replays
Write during recovery
• Hey, you can write during the WAL replay • Events stream: your new recovery Pme is the failure detecPon Pme: max 30s, likely less!
MemStore flush
• Real life: some tables are updated at a given moment then lem alone – With a non empty memstore – More data to recover
• It’s now possible to guarantee that we don’t have MemStore with old data
• Improves real life MTTR • Helps snapshots
.META. • .META. – There is no –ROOT-‐ in 0.95/0.96 – But .META. failures are criPcal
• A lot of small improvements – Server now says to the client when a region has moved (client can avoid going to meta)
• And a big one – .META. WAL is managed separately to allow an immediate recovery of META
– With the new MemStore flush, ensure a quick recovery
Data locality post recovery
• HBase performance depends on data-‐locality • Amer a recovery, you’ve lost it – Bad for performance
• Here comes region groups • Assign 3 favored RegionServers for every region • On failures assign the region to one of the secondaries
• The data-‐locality issue is minimized on failures
Block1 Block2 Block3 Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
RegionServer4 Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Reads Blk1 and Blk2 remotely
Reads Blk3 remotely
RegionServer1 serves three regions, and their StoreFile blks are sca>ered across the cluster with one replica local to RegionServer1.
Block1 Block2 Block3 Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
RegionServer1 serves three regions, and their StoreFile blks are placed on specific machines on the other racks
Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
RegionServer4 Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
No remote reads
Datanode
Conclusion
• The target was “from omen 10 minutes to always less than 1 minute” – We’re almost there
• Most of it is available in 0.96, some parts were backported
• Real life tesPng of the improvements in progress
• Room for more improvements
Q & A
Thanks!