CDP DATA CENTER 7.1Laurent Edel : Solution EngineerJacques Marchand : Solution EngineerMael Ropars : Principal Solution Engineer
30 Juin 2020
© 2019 Cloudera, Inc. All rights reserved.
SPEAKERS
•
© 2019 Cloudera, Inc. All rights reserved.
AGENDA
• CDP DATA CENTER OVERVIEW• DETAILS ABOUT MAJOR COMPONENTS• PATH TO CDP DC && SMART MIGRATION • Q/A
© 2020 Cloudera, Inc. All rights reserved. 4
CLOUDERA DATA PLATFORM
© 2020 Cloudera, Inc. All rights reserved. 5
ARCHITECTURE CIBLE : ENTERPRISE DATA CLOUD
CDP Cloud Public(platform-as-a-service)
CDP On-Prem(installable software)
© 2020 Cloudera, Inc. All rights reserved. 6
CDP DATA CENTER OVERVIEW
NEW CDP Data Center features include:• High-performance SQL analytics• Real-time stream processing, analytics, and management• Fine-grained security, enterprise metadata, and scalable data lineage• Support for object storage (tech preview)• Single pane of glass for management - multi-cluster support
Enterprise analytics and data management platform, built for hybrid cloud, optimized for bare metal and ready for
private cloud
CDP Data Center(installable software)
Cloudera Runtime
Cloudera Manager
© 2019 Cloudera, Inc. All rights reserved. 7
A NEW OPEN SOURCE DISTRIBUTION FOR BETTER CAPABILITY
Cloudera Runtime - created from the best of CDH and HDP
Deprecate competitive technologies
Merge overlapping technologies
Keep complementary technologies
Upgrade shared technologies
© 2020 Cloudera, Inc. All rights reserved. 8
CDP Data Center 7.1(May) 2020
• Cloudera Manager 7.1
• Hadoop 3.1
• Spark 2.4 / Spark 3(b2)
• Hive 3.1
• Impala 3.4
• Oozie 5.1
• Hue 4.5
• Ranger 2.0
• Atlas 2.0
• HBase 2.2
• Phoenix 5.0
• Kudu 1.12
• Sqoop 1.4.7
• Parquet 1.10
• Avro 1.8
• ORC 1.5
• Zookeeper 3.5
• Solr 8.4
• Key HSM 7.1
• Knox 1.3
• Livy 0.7
• Navigator Encrypt 7.1
• Ranger KMS 7.1
• Zeppelin
• Hive Warehouse Connector 1.0
• Kafka 2.4
• Kafka Schema Registry 0.8
• Streams Messaging Mgr 1.0
• Streams Replication Mgr 2.1
• Ozone (Beta) 0.6
• Kafka Connect 2.4
• Cruise Control 2.0
• Tez 0.9
• Key Trustee Server 7
• RHEL/CENTOS/OEL 7.7
• Postgres 10
• JDK 8
• JDK 11 Runtime
• MySQL 5.7
• Oracle DB 12 (Fresh Install Only)
• PostgreSQL 10
• Maria DB 10.2
• Upgrades from CDP DC 7.0
• Upgrades from CDH 5.13-5.16
• Upgrades from HDP 2.6.5
COMPONENT LIST
© 2019 Cloudera, Inc. All rights reserved.
AGENDA
• CDP DATA CENTER OVERVIEW• DETAILS ABOUT MAJOR COMPONENTS• PATH TO CDP DC && SMART MIGRATION • Q/A
© 2020 Cloudera, Inc. All rights reserved. 10
CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]
01 03
04
05
02Collect
EnrichSparkHive
ReportImpala
HiveKudu
ServeHbase
PhoenixSolR
PredictSpark
Zeppelin
SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION
© 2020 Cloudera, Inc. All rights reserved. 11
CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]
01 03
04
05
02Collect
EnrichSparkHive
ReportImpala
HiveKudu
ServeHbase
PhoenixSolR
PredictSpark
Zeppelin
SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION
© 2019 Cloudera, Inc. All rights reserved.
KAFKA COMPUTE CLUSTERS WITH CLOUDERA MANAGERKafka Clusters using Shared Security & Governance Data Lake with Atlas and Ranger
• Kafka 2.4• Ranger & Atlas Integration
• Support of Kafka Connect, Kafka Streams
• Cruise Control for load balancing
• Create multiple Kafka compute clusters using shared Security Data Lake with Ranger & Atlas
© 2019 Cloudera, Inc. All rights reserved.
KAFKA MANAGEMENT SERVICESKafka Services for Schema Management, Replication and Monitoring
Schema RegistryNew Kafka Schema Governance
Streams Replication Manager (SRM)New Kafka Replication Engine powered by
MirrorMaker2
Streams Messaging Manager (SMM)New Kafka Monitoring Service
© 2020 Cloudera, Inc. All rights reserved. 14
CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]
01 03
04
05
02Collect
EnrichSparkHive
ReportImpala
HiveKudu
ServeHbase
PhoenixSolR
PredictSpark
Zeppelin
SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION
© 2020 Cloudera, Inc. All rights reserved. 15
SPARK
• Spark 2.4
• Integration with Ranger for Fine Grained Authorizations
• Coming soon: Spark 3 !• Better performance• Enhanced support for Deep
Learning• New modules• MLLib replaced with SparkML• Tech Preview available
© 2020 Cloudera, Inc. All rights reserved. 16
CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]
01 03
04
05
02Collect
EnrichSparkHive
ReportImpala
HiveKudu
ServeHbase
PhoenixSolR
PredictSpark
Zeppelin
SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION
© 2020 Cloudera, Inc. All rights reserved. 17
SQL USER EXPERIENCE : HUE
© 2020 Cloudera, Inc. All rights reserved. 18
DATA WAREHOUSEHive 3
Apache Hive 3• Comprehensive ANSI SQL 2016 coverage• GDPR: new ACID v2 as fast as regular tables, transactions,
UPDATE/DELETE/MERGE• Cloud-ready: optimized for S3/WASB/GCP• Support for JDBC/Kafka/Druid out of the box• EDW offload:
– “DBA” tooling: surrogate keys, materialized views, constraints– information schema
• Performance: – workload management– query result cache
© 2020 Cloudera, Inc. All rights reserved. 19
DATA WAREHOUSEImpala and Kudu
Apache Impala• Leading MPP SQL Engine for DW -
optimized for Parquet/Kudu• Ideal for Data Mart Implementations that
require Interactive/Ad-hoc BI• 1000+ enterprise customers - many
running on 10s of PBs and 100s of nodes• Certified with leading BI tools with broad
SQL coverage• Latest release adds improvements for
resiliency, concurrency, and metadata
Apache Kudu• Leading columnar storage engine for fast
analytics on fast data• Ideal for Low latency time series data
ingest and analytics (with Impala SQL engine)
• Strength of fast ingest with single rows like HBASE and allows large scans like HDFS
• ACID (insert/update/delete) semantics with single rows
© 2020 Cloudera, Inc. All rights reserved. 20
WORKLOAD MANAGER
Global view on analytic processing
Deep Dive Query analysis
© 2020 Cloudera, Inc. All rights reserved. 21
CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]
01 03
04
05
02Collect
EnrichSparkHive
ReportImpala
HiveKudu
ServeHbase
PhoenixSolR
PredictSpark
Zeppelin
SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION
© 2020 Cloudera, Inc. All rights reserved. 22
stmt.executeUpdate(“UPSERT INTO TABLE_NAME VALUES(rowKey, GREETINGS) ");
stmt.execute();
HBASE + PHOENIX
• Maximally flexible & customizable• SQL only for data remediation• All advanced functionality available• New async client• JDK8/G1GC• Off-Heap read path• API clean-up, HBCK2
• Programmatic ANSI SQL support• RDBMS-like data architecture• Auto-applies performance best
practices• Can co-exist with HBase apps
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(COLUMN_FAMILY_NAME, COLUMN_NAME, Bytes.toBytes(GREETINGS));
table.put(put);
HBASEFlexible, scale-out, no-sql database
PHOENIXRDBMS-like, scale-out database
© 2020 Cloudera, Inc. All rights reserved. 23
Shared Data Storage
Indexing engine (Lucene)
Extraction Mapping
Solr
Distributed processing coordinator
Solr Cloud
Querying API Indexing APIScalable and Robust Index Storage with SOLR 8.4
● Scalable, cost-efficient index storage
● High availability, Integrated security with Atlas/Ranger
● Shared data store with other processing tools (Spark, Impala..)
● Search AND process data in one platform
CDP SEARCH
© 2020 Cloudera, Inc. All rights reserved. 24
CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]
01 03
04
05
02Collect
EnrichSparkHive
ReportImpala
HiveKudu
ServeHbase
PhoenixSolR
PredictSpark
Zeppelin
SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION
© 2019 Cloudera, Inc. All rights reserved.
DATA SCIENCE AND ENGINEERING TOOLS
APACHE ZEPPELIN CLOUDERA DATA SCIENCE WORKBENCH
© 2020 Cloudera, Inc. All rights reserved. 26
CDP DATA CENTER OVERVIEW[What is the scope of CDP Data Center]
01 03
04
05
02Collect
EnrichSparkHive
ReportImpala
HiveKudu
ServeHbase
PhoenixSolR
PredictSpark
Zeppelin
SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION
© 2020 Cloudera, Inc. All rights reserved. 27
• Management of multiple clusters• Knox,Ranger,Atlas,Hive-on-Tez,DAS• Cluster-level configuration history• Improved global search• Resume errors in enabling
Kerberos
• Scalability improvements• Improved alerts configuration• Upgrade Support• Support for Private Cloud (Beta)
SIMPLIFIED MANAGEMENTCloudera Manager
© 2020 Cloudera, Inc. All rights reserved. 28
CONSISTENT SECURITY AND GOVERNANCEBuilt for multi-functional analytics anywhere
• Data Catalog: a comprehensive catalog of all data sets, spanning on-premises, cloud object stores, structured, unstructured, and semi-structured
• Schema: automatic capture and storage of any and all schema and metadata definitions as they are used and created by platform workloads
• Replication: deliver data as well as data policies there where the enterprise needs to work, with complete consistency and security
• Security: role-based access control applied consistently across the platform. Includes full stack encryption and key management
• Governance: enterprise-grade auditing, lineage, and governance capabilities applied across the platform with rich extensibility for partner integrations
© 2020 Cloudera, Inc. All rights reserved. 29
Identity & PerimeterValidate users in enterprise directory
Technical Concepts:AuthenticationUser/group mapping
Data ProtectionShielding data in the cluster from unauthorized visibility
Technical Concepts:Encryption, Key Management
VisibilityReporting on where data came from and how it’s being used
Technical Concepts:AuditingLineage
AccessDefining what users and applications can do with data
Technical Concepts:PermissionsAuthorization
SECURITY AND GOVERNANCE
Kerberos, Apache Knox
SSL/TLS, HDFS TDE, Apache Ranger
(KMS, Masking, Filtering)Apache Ranger Apache Atlas
© 2020 Cloudera, Inc. All rights reserved. 30
VISIBILITY SECURITY AND AUDITINGApache Atlas
• Lineage– What data do I consume?– What consumes my Data?
• Who uses my data?– Audit who accessed what– Track access events from Apache
Ranger
– Metadata audit and versioning from Apache Atlas
© 2020 Cloudera, Inc. All rights reserved. 31
ACCESS CONTROLApache Ranger
• Maintain one set of data, control access centrally with fine grained policies down to the column and the row level.
• Anonymize PII with Dynamic column masking
• Customize views for users with Dynamic row filtering
• Manage user access with Role-based Access Control
• Unify policies across many data sets with Attribute-based Access Control
© 2020 Cloudera, Inc. All rights reserved. 32
OBJECT STORAGEApache Ozone
• Ozone is the next generation of HDFS– Based on HDFS architecture, but with
some fundamental shifts
– Preserve and reuse good parts of HDFS
– Addresses HDFS scale limits and small file problem
• Uses an object store architecture to achieve scale.
• Provides native Hadoop File System API as well as a native S3 API
PATHS TO CDP
© 2020 Cloudera, Inc. All rights reserved. 34
THREE PATHS TO CDP
Migrate to Public Cloud Migrate to CDP DC Upgrade to CDP DC
Copy data and metadata to a public cloud; implement new, or migrate existing workloads on CDP Public Cloud.
Build a new CDP Datacenter cluster on-premises; copy data and metadata from existing classic cluster; and migrate existing workloads.
Upgrade from classic cluster to CDP Datacenter in-place on the same hardware infrastructure.
© 2020 Cloudera, Inc. All rights reserved. 35
3. PRODUCTION1. PLAN 2. LAUNCH
PATH2CDP
●
●
●
●
●
●
1+ Weeks
CONSUME CDP
Custom Estimate
CDP Data Center
UPGRADE2CDP
●●●●
●●●
3 + Weeks
SMARTUPGRADE TO CDP DC
●
●
●
●
●
●
© 2020 Cloudera, Inc. All rights reserved. 36
FORMATIONS GENERALES ET CDP
Spark (DC - Q1)
CDW
CML
CDP Public Cloud Admin (Q2)
Stream Processing (Q1)(CDF)
Spark Performance Workshop
Hive/Impala (DC - Q1)
Data Science Wkshp (Q2)
CDP Security (Q2-Q3)
Kafka Operations(CDF)
CDP Data Governance (Q2)
DS/ML Modules
ADMINISTRATOR
DATA ANALYST
DEVELOPER
DATA SCIENTIST
AWS Fundamentals for CDP Pub. Cloud
Flow Management with NiFi (CDF)
cloudera.com/training.html
CDH/HDP TO CDP DELTA COURSES
https://www.cloudera.com/about/training/courses/advanced-spark-application-performance-tuning.htmlhttps://www.cloudera.com/about/training/courses/advanced-spark-application-performance-tuning.htmlhttps://www.cloudera.com/about/training/courses/aws-undamentals-for-cdp.htmlhttps://www.cloudera.com/about/training/courses/aws-undamentals-for-cdp.htmlhttps://www.cloudera.com/about/training/courses/dataflow-flow-managment-with-nifi.html#?classType=&country=UShttps://www.cloudera.com/about/training/courses/dataflow-flow-managment-with-nifi.html#?classType=&country=UShttps://cloudera.com/training.html
WHAT IS COMING NEXT ?
© 2020 Cloudera, Inc. All rights reserved. 38
CDP PRIVATE CLOUD : BASED ON CDP DATA CENTER
Management Console
Data Catalog
Workload Manager
Replication Manager
Experiences
Machine Learning
Kubernetes
BareMetal
New set of data analytics applications Featuring use-case optimized interfaces
Running on a container cloud Fast provisioning & scaling, efficient, simple
With access to a shared data lake That is secured and governed
SDXSecurity
MetadataGovernance
Data Warehouse
Data Engineering DataFlow
Bare MetalWorkloads
CDP Data Center
Questions/RéponsesA vous la parole...
Top Related