EAS-030 Spark Scala Kubernetes (Piloting)
DescriptionParticipants will embark on an enriching voyage through the Spark universe powered by Scala, beginning with a foundational understanding of Spark's architecture and seeing its edge over Hadoop's MapReduce.
is issued on the Luxoft Training form
- Foundational Spark Principles: Dives into Spark's foundational concepts and architecture, comparing its efficiency to Hadoop's MapReduce, and exploring its diverse resource managers.
- Spark & Kubernetes Synergy: Equips participants with knowledge about the containerization of Spark applications, understanding Kubernetes dynamics, and efficient deployment techniques.
- Data API Proficiency: Delves deep into Spark's high-level Data APIs - DataFrame and DataSet - highlighting differences, parallelization, and optimal storage methods.
- External Data Management Mastery: Focuses on robust techniques for data interaction with diverse external storages, optimizing data formats, and efficient data transfers.
- Spark Optimization & Streamlining: Addresses the core challenges in Spark, understanding optimization strategies, and diving into structured streaming techniques and applications.
Spark concepts and architecture (theory 2h 30m, practice 1h 30m)
Explore Spark's superiority over Hadoop's MapReduce with hands-on examples. Dive into Lambda architecture, understand batch vs. streaming. Master Spark's resource managers: Kubernetes, YARN, Standalone. Learn to initiate Spark applications. Comprehensive definitions included.
Containerization and deploy Spark Applications to Kubernetes - (theory 1h, practice 1h)Master containerization: delve into Kubernetes terminology. Compare Kubernetes with YARN. Grasp dynamic resource allocation. Learn to containerize and deploy Spark on Kubernetes. Kickstart Spark applications seamlessly.
High Level Data API: DataFrame, DataSet
Explore high-level Data APIs: DataFrame & DataSet. Unravel differences between RDD, DataFrame, and DataSet. Learn creation, parallelization techniques. Dive into DataFrame & DataSet analysis, control via plans and DAGs. Master saving methods to HDFS, FTP, S3.
Loading data from/in external storages
Master data loading techniques from external storages: Dive into reading/writing from HDFS, S3, FTP, FS. Choose optimal data formats. Learn parallelized JDBC interactions. Create DataFrames & DataSets from Kafka topics. Efficiently load data into Cassandra.
Spark optimization cases
Delve into Spark optimization scenarios: Address 'out of memory' issues, manage small files in HDFS, correct skewed data, enhance join speeds, optimize large table broadcasts, resource sharing strategies, and leverage AQE & DPP for performance tuning.
Testing Spark Applications
4 levels of quality for Spark Application
Unit Testing for Spark Application
Problems with Unit testing Spark Application
Libraries and Solutions
Spark Structure Streaming
Streaming DataFrame & Dataset
DF, DS based on the Kafka Topic
Loading Data to Cassandra
Working with Spark, Cassandra StateOptimization features