Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

18
Hadoop Admin Best Practices with HDP 2.3 Part-2

Transcript of Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Page 1: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Hadoop Admin Best Practices with HDP 2.3

Part-2

Page 2: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

We have INSTRUCTOR LED - both Online LIVE & Classroom Session

Present for classroom sessions in Bangalore & Delhi (NCR)

We are the ONLY Education delivery partners for Mulesoft, Elastic, Pivotal & Lightbend in India

We have delivered more than 5000 trainings and have over 400 courses and a vast pool of over 200 experts to make YOU the EXPERT!

FOLLOW US ON SOCIAL MEDIA TO STAY UPDATED ON THE UPCOMING WEBINARS

Page 3: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Online and Classroom Training on Technology Coursesat SpringPeople Certified Partners

Non-Certified Courses

…and many more

…NOW

Page 4: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

The Hadoop Ecosystem

Hadoop

Page 5: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

The HDP 2.3 Platform Versions

Page 6: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Covered Till Now

1. Use Ambari – Cluster Management Tool

2. More of WebHDFS Access

3. WebHDFS

4. Use More of HDFS Access Control Lists

5. Use HDFS Quotas

6. Understanding of YARN Components

7. Adding, Deleting, or Replacing Worker Nodes

8. Rack Awareness

9. NameNode High Availability

10. ResourceManager High Availability

11. Ambari Metrics System

12. What to Backup?

Page 7: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

13 – Setting appropriate Directory Space Quota

• Best practice is to also set space limits on home directory To set a 12TB limit:$ hdfs dfsadmin –setSpaceQuota 12t /user/username

• Includes space for replication• This is the actual use of space• Example:

• If storing 1TB and replication factor is 3• 3TB is needed

• Quota can be set on any directory

Page 8: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

14 - Configuring Trash

• Enable by setting time delay for trash's checkpoint removal: In core-site.xml• fs.trash.interval

• Delay is set in minutes (24 hours would be 1440 minutes)• Recommendation is to set to 360 minutes (6 hours)• Setting the value to 0 disables Trash

• Files deleted programmatically are deleted immediately• Files can be immediately deleted from the command line using -

skipTrash

Page 9: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

15 - Compression Needs and Tradeoffs

Compressing data can speed up data-intensive I/O operations• MapReduce jobs are almost always I/O bound

Compressed data can save storage space and speed up data transfers across the network• Capital allocation for hardware can go further

Reduced I/O and network load can result in significant performance improvements• MapReduce jobs can finish faster overall

But, CPU utilization and processing time increase during compression and decompression• Understanding the tradeoffs is important for MapReduce pipeline’s overall performance

Page 10: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

16 - Sqoop Security

• Database Authentication:• Sqoop needs to authenticate to the RDBMS

• How?• Usually this involves a username/password

(Oracle Wallet is the exception)• Can hard code password in scripts (not recommended/used)• Password usually stored in plaintext in a file protected by the filesystem

• Hadoop Credential Management Framework added in HDP 2.2• Not a keystore, but a way to interface with keystore backends• Passwords can be stored in a keystore and not in plain text• Can help with “no passwords in plaintext” requirements

Page 11: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

17 - distcp Configurations

• If Distcp runs out of memory before copying:• Possible Cause: Number of files/directories being copied from source

path(s) is extremely large (e.g. 100,000 paths)• Change: HEAP Size

- Export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m”• Map Sizing

• If -m is not specified: Default to 20 maps max• Tuning the number of maps to:

- Size of the source and destination cluster- The size of the copy- Available bandwidth

Page 12: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

18 - Falcon Centrally manages data lifecycle

• Centralized definition & management of pipelines for data ingest, process and export

Supports Business continuity and Disaster Recovery• Out of the box policies for data replication

and retention• End-to-end monitoring of data pipelines

Addresses basic audit & compliance requirements• Visualize data pipeline lineage • Track data pipeline audit logs• Tag data with business metadata

Page 13: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

19 - Running Balancer

• Can be run periodically as a batch job• Examples: every 24 hours or weekly

• Run after new nodes have been added to the cluster• To run balancer:

hdfs balancer [-threshold <threshold>] [-policy <policy>]]• Runs until there are no blocks to move

orUntil it has lost contact with the NameNode

• Can be stopped with a Ctrl+C

Page 14: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

20 - HDFS Snapshots

Create HDFS directory snapshots Fast operation - only metadata affected Results in .snapshot/ directory in the HDFS directory Snapshots are named or default to timestamp Directories must be made snapshottable Snapshot Steps:

– Allow snapshot on directoryhdfs dfsadmin -allowSnapshot foo/bar/

– Create snapshot for directory and optionally provide snapshot namehdfs dfs -createSnapshot foo/bar/ mysnapshot_today

– Verify snapshothdfs dfs -ls foo/bar/.snapshot

Page 15: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

21 - HDFS Data – Automate & Restore

• Use Falcon/Oozie to automate backups• Falcon utilizes Oozie as a workflow scheduler• distcp is an Oozie action

- use -update and -prbugp• Restoring is the reverse process of backups

1. On your backup cluster choose which snapshot to restore2. Remove/move target directory on production system3. Run distcp without -update options

Page 16: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

22 - Apache Ranger

Page 17: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2
Page 18: Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

     

[email protected]

Upcoming Hortonworks Classes at SpringPeople

Classroom (Bengaluru)

05 - 08 Sept26 - 28 Sept10 - 13 Oct07 - 10 Nov05 - 08 Dec19 - 21 Dec

Online LIVE22 - 31 Aug05 - 17 Sept

19 Sept - 01 Oct