EAS-030 Spark Scala Kubernetes (Piloting)
Description
Participants will embark on an enriching voyage through the Spark universe powered by Scala, beginning with a foundational understanding of Spark's architecture and seeing its edge over Hadoop's MapReduce.
is issued on the Luxoft Training form
Objectives
- Foundational Spark Principles: Dives into Spark's foundational concepts and architecture, comparing its efficiency to Hadoop's MapReduce, and exploring its diverse resource managers.
- Spark & Kubernetes Synergy: Equips participants with knowledge about the containerization of Spark applications, understanding Kubernetes dynamics, and efficient deployment techniques.
- Data API Proficiency: Delves deep into Spark's high-level Data APIs - DataFrame and DataSet - highlighting differences, parallelization, and optimal storage methods.
- External Data Management Mastery: Focuses on robust techniques for data interaction with diverse external storages, optimizing data formats, and efficient data transfers.
- Spark Optimization & Streamlining: Addresses the core challenges in Spark, understanding optimization strategies, and diving into structured streaming techniques and applications.
Target Audience
Prerequisites
Roadmap
-
Spark concepts and architecture (theory 2h 30m, practice 1h 30m)
Explore Spark's superiority over Hadoop's MapReduce with hands-on examples. Dive into Lambda architecture, understand batch vs. streaming. Master Spark's resource managers: Kubernetes, YARN, Standalone. Learn to initiate Spark applications. Comprehensive definitions included.
-
Containerization and deploy Spark Applications to Kubernetes - (theory 1h, practice 1h)
Master containerization: delve into Kubernetes terminology. Compare Kubernetes with YARN. Grasp dynamic resource allocation. Learn to containerize and deploy Spark on Kubernetes. Kickstart Spark applications seamlessly. -
High Level Data API: DataFrame, DataSet
Explore high-level Data APIs: DataFrame & DataSet. Unravel differences between RDD, DataFrame, and DataSet. Learn creation, parallelization techniques. Dive into DataFrame & DataSet analysis, control via plans and DAGs. Master saving methods to HDFS, FTP, S3.
-
Loading data from/in external storages
Master data loading techniques from external storages: Dive into reading/writing from HDFS, S3, FTP, FS. Choose optimal data formats. Learn parallelized JDBC interactions. Create DataFrames & DataSets from Kafka topics. Efficiently load data into Cassandra.
-
Spark optimization cases
Delve into Spark optimization scenarios: Address 'out of memory' issues, manage small files in HDFS, correct skewed data, enhance join speeds, optimize large table broadcasts, resource sharing strategies, and leverage AQE & DPP for performance tuning.
-
Testing Spark Applications
4 levels of quality for Spark Application
Unit Testing for Spark Application
Problems with Unit testing Spark Application
Libraries and Solutions
-
Spark Structure Streaming
Streaming DataFrame & Dataset
DF, DS based on the Kafka Topic
Loading Data to Cassandra
Working with Spark, Cassandra State
Optimization features