Spark Streaming in K8s with ArgoCD & Spark Operator · 2020. 11. 25. · Data Engineer @ Schibsted...

29
Spark Streaming in K8s with ArgoCD & Spark Operator Albert Franzi - Data Engineer Lead @ Typeform

Transcript of Spark Streaming in K8s with ArgoCD & Spark Operator · 2020. 11. 25. · Data Engineer @ Schibsted...

  • Spark Streaming in K8s with ArgoCD & Spark OperatorAlbert Franzi - Data Engineer Lead @ Typeform

  • Agenda

    val sc: SparkContextWhere are we nowadays

    Spark(implicit mode:K8s)When Spark met K8s

    type Deploy=SparkOperatorHow we deploy into K8s

    Some[Learnings]Why it matters

  • About meData Engineer Lead @ Typeform

  • About me

    Data Engineer Lead @ Typeform○ Leading the Data Platform team

    Previously○ Data Engineer @ Alpha Health○ Data Engineer @ Schibsted Classified Media ○ Data Engineer @ Trovit Search

    albert-franzi FranziCros

    http://typeform.comhttps://www.alpha.company/https://schibsted.com/https://www.trovit.es/https://medium.com/albert-franzihttps://medium.com/albert-franzihttps://twitter.com/FranziCroshttps://twitter.com/FranziCros

  • About Typeform

  • val sc: SparkContextWhere are we nowadays

  • val sc: SparkContextWhere are we nowadays - Environments

  • val sc: SparkContextWhere are we nowadays - Executions

    Great for batch processing

    Good orchestrators

    Old school Area 51 Next slides

  • Spark(implicit mode:K8s)When Spark met K8s

  • ● Delayed EMR releases

    EMR 6.1.0 Spark 3.0.0 after ~3 months.

    ● Spark fixed version per cluster.

    ● Unused resources.

    ● Same IAM role shared across the entire cluster.

    Spark(implicit mode:K8s)When Spark met K8s - EMR : The Past

    https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html

  • ● Multiple Spark versions running in parallel in the same cluster.

    ● Use what you need, share what you don’t.

    ● IAM role per Service Account.

    ● Different node types based on your needs.

    ● You define the dockers.

    Spark(implicit mode:K8s)When Spark met K8s - The future

  • Spark(implicit mode:K8s)When Spark met K8s - Requirements

    Kubernetes Cluster

    v : 1.13+

    AWS SDK

    v : 1.788+🔗 WebIdentityTokenCredentialsProvider

    IAM Roles

    Fine-grained IAM roles for service accounts🔗 IRSA

    Spark docker image

    hadoop : v3.2.1aws_sdk: v1.11.788scala: v2.12spark: v3.0.0java: 8🔗 hadoop.Dockerfile & spark.Dockerfile

    https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/WebIdentityTokenCredentialsProvider.htmlhttps://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/https://gist.github.com/afranzi/4685518e24fd81e07639b97c4a5a2757https://gist.github.com/afranzi/85ff3bf47632fc650cec17b0cc16bbca

  • type Deploy=SparkOperatorHow we deploy into K8s

  • type Deploy=SparkOperatorHow we deploy into K8s

    ref: github.com - spark-on-k8s-operator

    Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

    https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

  • type Deploy=SparkOperatorHow we deploy into K8s - Application Specs

    apiVersion: "sparkoperator.k8s.io/v1beta2"kind: SparkApplicationmetadata: name: our-spark-job-name namespace: sparkspec: type: Scala mode: cluster image: "xxx/typeform/spark:3.0.0" imagePullPolicy: Always imagePullSecrets: [xxx] sparkVersion: "3.0.0" restartPolicy: type: Never volumes: - name: temp-volume emptyDir: {} hadoopConf: fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider mainClass: com.typeform.data.spark.our.class.package mainApplicationFile: "s3a://my_spark_bucket/spark_jars/0.8.23/data-spark-jobs-assembly-0.8.23.jar" arguments: - --argument_name_1 - argument_value_1

    driver: cores: 1 coreLimit: "1000m" memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true secrets: - name: my-secret secretType: generic path: /mnt/secrets volumeMounts: - name: "temp-volume" mountPath: "/tmp" executor: cores: 1 instances: 4 memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true volumeMounts: - name: "temp-volume" mountPath: "/tmp"

  • type Deploy=SparkOperatorHow we deploy into K8s

    schedule: "@every 5m"concurrencyPolicy

    Replace

    Allow

    Forbid

    crontab.guru

    https://crontab.guruhttp://crontab.guru

  • type Deploy=SparkOperatorHow we deploy into K8s

    Never AlwaysOnFailure

    restartPolicy

  • type Deploy=SparkOperatorHow we deploy into K8s - Deployment Flow

  • type Deploy=SparkOperatorHow we deploy into K8s - Deploying it manually (Simple & easy)

    $ sbt assembly

    $ aws s3 cp \ target/scala-2.12/data-spark-jobs-assembly-0.8.23.jar \ s3://my_spark_bucket/spark_jars/data-spark-jobs_2.12/0.8.23/

    $ kubectl apply -f spark-job.yaml

    Build the jar, put it into S3 and deploy the Spark Application

    $ kubectl delete -f spark-job.yaml

    Delete our Spark Application

  • type Deploy=SparkOperatorHow we deploy into K8s - Deploying it automatically (Simple & easy)

    Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.

    ref: argoproj.github.io/argo-cd

    https://argoproj.github.io/argo-cd/

  • type Deploy=SparkOperatorHow we deploy into K8s - Deploying it automatically (Simple & easy)

    apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: data-spark-jobs namespace: argocdspec: destination: namespace: spark server: 'https://kubernetes.default.svc' project: data-platform-projects source: helm: valueFiles: - values.yaml - values.prod.yaml path: k8s/data-spark-jobs repoURL: 'https://github.com/thereponame' targetRevision: HEAD syncPolicy: {}

    Argo CD Application Spec

  • ArgoCD manual Sync

  • type Deploy=SparkOperatorHow we deploy into K8s - Deployment Flow

  • Some[Learnings]Why it matters

  • Some[Learnings]

    ● It was really easy to set up with the right team and the right infrastructure.

    ● Different teams & Projects adopting new Spark versions with their own pace.

    ● Spark Testing Cluster always ready to accept new jobs without “paying for it”. -- K8s cluster already available in dev environments.

    ● Monitor the pods consumption to tune their memory and cpu properly.

    Why it matters

  • Some[Learnings]Why it matters : Data Devops makes a difference

    Adopt a Devops in your team and convert it into a Data Devops.

  • The[team]

    Digital Analytics Specialists (x2)

    BI / DWH Architect (x2)

    Data Devops (x1)

    Data engineers (x4)

    Data Platform : A multidisciplinary team

  • Links of Interest

    Spark structured streaming in K8s with ArgoCD by Albert Franzi

    Spark on K8s operator

    ArgoCD - App of apps pattern

    Spark History Server in K8s by Carlos Escura

    Spark Operator - Specs

    https://medium.com/albert-franzi/spark-structured-streaming-in-k8s-with-argo-cd-de4942846161https://www.linkedin.com/in/albertfranzi/https://medium.com/albert-franzi/spark-structured-streaming-in-k8s-with-argo-cd-de4942846161https://github.com/GoogleCloudPlatform/spark-on-k8s-operatorhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operatorhttps://argoproj.github.io/argo-cd/operator-manual/cluster-bootstrapping/#app-of-apps-patternhttps://argoproj.github.io/argo-cd/operator-manual/cluster-bootstrapping/#app-of-apps-patternhttps://medium.com/@carlosescura/run-spark-history-server-on-kubernetes-using-helm-7b03bfed20f6https://www.linkedin.com/in/carlosescura/https://medium.com/@carlosescura/run-spark-history-server-on-kubernetes-using-helm-7b03bfed20f6https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.mdhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/api-docs.md