The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest...

19
#TDPARTNERS16 Sept 11,2016 GEORGIA WORLD CONGRESS CENTER The Last Mile: Why Hadoop Management Is Critical to Success Ron Bodkin and Scott Fleming Think Big, a Teradata company

Transcript of The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest...

Page 1: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

#TDPARTNERS16 Sept 11,2016 GEORGIA WORLD CONGRESS CENTER

The Last Mile:Why Hadoop Management Is Critical to Success

Ron Bodkin and Scott Fleming

Think Big, a Teradata company

Page 2: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

The Last Mile

• The open source ecosystem for analytics is complicated

• It’s easy to get started• Maintaining an optimal, performant environment is

not• Success depends on careful planning and

management

Page 3: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Data Lake Design Principles

• Automated and reliable data ingest • Capture and manage relevant metadata• Preserve original source data where possible• Provide cleansing, aggregation, and integration matched

to each use• Balance governance and agility• Implement security at the right time• Easily search, access, and consume data• Make the data ready for analysis

Page 4: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

New Data Sources

• It all starts here• Capture the rawest form• Determine how it will be used and who will be using it• Cleanse it, validate it and profile it• Make it discoverable (and useful) • Bottom line: Be consistent and consider tools

Page 5: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Typical Data Ingestion

Page 6: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Governance

• Clear distinction of roles and responsibilities for curating data

• Common vocabulary for data sets / types• Implement required security – not too much, not too little• On-going data quality polices• Data retention / archival policies

Page 7: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Security Challenges

• Residual files following failed jobs• Compatibility of security tools with major Hadoop

distributors• Multiple types of discoverable data in the environment• BI and analytics user access• Lack of mature security tools• Uncontrolled replication of data• User authentication and authorization is complex

Without considering comprehensive security measures, your valuable data could be easily compromised and you may be a subject to security breach

Page 8: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Security LayersRedo this image

Page 9: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Ingestion Jobs and Monitoring

• Baseline job performance and resource requirements• Ensure error handling is robust• Build alerting into the processes that submit jobs• Develop and monitor SLAs for job performance. Look

for drift.• Leverage tools where possible

Page 10: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Resource Contention

• SLAs and sandboxes – often in the same environment

• Leverage the capacity scheduler and hierarchical queues

• Don’t be afraid to get granular• Use YARN containers – be prudent about the

resources requested

Page 11: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Capacity Planning

• Capacity planning is an on-going effort, not one and done

• Includes storage, compute, network, memory and real estate

• Review resource and storage utilization at least monthly

• Implement retention and archiving processes where appropriate

• Be thoughtful and plan when expanding• Just adding nodes can have unexpected

results

Page 12: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Hive Operations

• Bring your own data – User Education• Sub-optimal storage formats

• Table proliferation• Over partitioning• ODBC / JDBC Connectivity

• Canary processes for Hive Server 2• Impala – compute stats

Page 13: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

General Hadoop Operations

• Develop a RACI for operations• ITIL Processes – minimally Release Management

and Change Management• Stay aligned with the distro versions• Use configuration management tools like Puppet

and Ansible• Staff appropriately

Page 14: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Hadoop Operations Top 10

1. Continuous Capacity Planning2. Isolate the LAN3. Implement proactive monitoring and alerting4. Establish data balancer schedule and use5. Periodic review of Hive tables, schemas and data storage6. Monitor for small files7. End user education8. Periodic review of the capacity scheduler and resource

management9. Monitor SLAs for drift10.Runbook, Runbook, Runbook

Page 15: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Monitoring

• Ambari / Cloudera Manager – basic blocking and tackling• Nagios – where there are gaps• PCNG – for application monitoring• Dr. Elephant – for application heuristics

Page 16: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Engineering and Operations

• Weekly reviews for alignment and planning• Include operations in engineering design• New technology preparation, planning and

training• Continuous updates to the runbooks• DevOps and Agile – rules of the road to be able to

fail fast while maintaining a stable environment

Page 17: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Monitoring Adoption

• Knowing who is doing what in the environment is essential to maintenance and planning.

• Determine who the power users are and make them champions

• Helps to understand resource planning and allocation

Page 18: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Summary

• Getting started is easy• Getting started to ensure long term success takes

some planning• There is a lot to stay on top of to ensure successful

operations• The platform components and tools vary in every

environment• Capable operations people are hard to find• Proactive management and monitoring is key to

happy users

Page 19: The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Thank You

Questions/CommentsEmail:

Follow MeTwitter @

Rate This Session #with the PARTNERS Mobile App

Remember To Share Your Virtual Passes

[email protected]

Ronbodkin and @scottbfleming

653

19