Advanced Spark for Developers | | Software Development
Advanced Spark for Developers
Duration
28 hours
Location
Online
Language
English
Code
EAS-024
€ 650 *
Training for 7-8 or more people? Customize trainings for your specific needs
Description
This course will help trainees get a proper understanding of the internal structure and functioning of Apache Spark – Spark Core (RDD), Spark SQL, Spark Streaming and Spark Structured Streaming. We will discuss the mechanisms of running Spark cluster components under control of various cluster managers, resource allocation management, and scheduler operation mechanisms. It will focus on advantages of the Tungsten format of internal view and Catalyst optimizer.
After completing the course, a certificate
is issued on the Luxoft Training form
is issued on the Luxoft Training form
Objectives
- Understand Spark’s internal structure
- Understand the deployment, configuration and execution of Spark components on various clusters (Standalone, YARN, Mesos)
- Optimize RDD-based Spark jobs
- Optimize Spark SQL jobs
- Optimize Microbatch and Structured Streaming jobs
Target Audience
- Developers
- Architects
Prerequisites
Development experience in Java or Scala for Apache Spark over 3 months.
Roadmap
a:2:{s:4:"TEXT";s:3679:"Module 0 - Scala in one day
1. Examine Scala features used in the Spark framework
2. Theory:
1. var and val, val (x, x), lazy val, transient lazy val
2. type and Type, (Nil, None, Null => null, Nothing, Unit => (), Any, AnyRef, AnyVal, String, interpolation
3. class, object (case), abstract class, trait
4. Scala function, methods, lambda
5. Generic, ClassTag, covariant, contravariant, invariant position, F[_], *
6. Pattern matching and if then else construction
7. Mutable and Immutable collection, Iterator, collection operation
8. Monads (Option, Either, Try, Future, ....), Try().recovery
9. map, flatMap, foreach, for comprehension
10. Implicits, private[sql], package
11. Scala sbt, assembly
12. Encoder, Product
13. Scala libs for Spark: scopt, chimney, jsoniter
Module 1 – RDD
1. Theory RDD api:
1. RDD creating api: from array, from file. from DS
2. RDD base operations: map, flatMap, filter, reduceByKey, sort
3. Time parse libs
2. Theory RDD under the hood:
1. Iterator + mapPartitions()
2. RDD creating path: compute() and getPartitions()
3. Partitions
4. Partitioner: Hash and Range
5. Dependencies: wide and narrow
6. Joins: inner, cogroup, join without shuffle
7. Query Plan
Module 2 - DataFrame & DataSet, Spark DSL & Spark SQL
1. Theory DataFrame, DataSet api:
1. Creating DataFrame: memory from file (HDFS, S3, FS) (Avro, Orc, Parquet)
2. Spark DSL: Join broadcast, grouped operations
3. Spark SQL: Window functions, single partitions
4. Scala UDF problem-solving
5. Spark catalog
2. Recreate code using plans
1. Catalyst Optimiser: Logical & Physical plans
2. Codegen
3. Persist vs Cache vs Checkpoint
4. Creating DataFrame Path
5. Raw vs InternalRaw
Module 3 - Spark optimization
1. Compare speed, size RDD, DataFrame, DataSet
2. Compare crimes counting: SortMerge Join, BroadCast, BlumFilter
3. Resolve problems with a skewed join
4. Build UDF for Python and Scala
5. UDF Problems
Module 4 - External and Connectors
1. How to read/write data from file storages (HDFS, S3, FTP, FS)
2. What data format to choose (Json, CSV, Avro, Orc, Parquet, Delta, ... )
3. How to parallelize reading/writing to JDBC
4. How to create dataframe from MPP (Cassandra, vertica, gp)
5. How to work with Kafka
6. How to write your own connectors
7. Write UDF for joining with cassandra
Module 5 – Testing
1. Write a test for data marts written in module (Exercise: find popular time for orders, find the most popular boroughs for orders, find distance distribution for orders grouped by boroughs)
2. Theory:
1. Unit testing
2. Code review
3. QA
4. CI/CD
5. Problems
6. Libs which solve these problems
Module 6 - Spark Cluster
1. Build config with allocation
2. Compare several workers
3. Dynamic Resource Allocation
4. Manual managing executors runtime
Module 7 - Spark streaming
1. [Solve problem with Cassandra writing](src/main/scala/mod4connectors/DataSetsWithCassandra.scala)
2. Build Spark Structure Reading Kafka
3. Build Spark Structure Using State
4. Build Spark Structure Writing Cassandra";s:4:"TYPE";s:4:"HTML";}
1. Examine Scala features used in the Spark framework
2. Theory:
1. var and val, val (x, x), lazy val, transient lazy val
2. type and Type, (Nil, None, Null => null, Nothing, Unit => (), Any, AnyRef, AnyVal, String, interpolation
3. class, object (case), abstract class, trait
4. Scala function, methods, lambda
5. Generic, ClassTag, covariant, contravariant, invariant position, F[_], *
6. Pattern matching and if then else construction
7. Mutable and Immutable collection, Iterator, collection operation
8. Monads (Option, Either, Try, Future, ....), Try().recovery
9. map, flatMap, foreach, for comprehension
10. Implicits, private[sql], package
11. Scala sbt, assembly
12. Encoder, Product
13. Scala libs for Spark: scopt, chimney, jsoniter
Module 1 – RDD
1. Theory RDD api:
1. RDD creating api: from array, from file. from DS
2. RDD base operations: map, flatMap, filter, reduceByKey, sort
3. Time parse libs
2. Theory RDD under the hood:
1. Iterator + mapPartitions()
2. RDD creating path: compute() and getPartitions()
3. Partitions
4. Partitioner: Hash and Range
5. Dependencies: wide and narrow
6. Joins: inner, cogroup, join without shuffle
7. Query Plan
Module 2 - DataFrame & DataSet, Spark DSL & Spark SQL
1. Theory DataFrame, DataSet api:
1. Creating DataFrame: memory from file (HDFS, S3, FS) (Avro, Orc, Parquet)
2. Spark DSL: Join broadcast, grouped operations
3. Spark SQL: Window functions, single partitions
4. Scala UDF problem-solving
5. Spark catalog
2. Recreate code using plans
1. Catalyst Optimiser: Logical & Physical plans
2. Codegen
3. Persist vs Cache vs Checkpoint
4. Creating DataFrame Path
5. Raw vs InternalRaw
Module 3 - Spark optimization
1. Compare speed, size RDD, DataFrame, DataSet
2. Compare crimes counting: SortMerge Join, BroadCast, BlumFilter
3. Resolve problems with a skewed join
4. Build UDF for Python and Scala
5. UDF Problems
Module 4 - External and Connectors
1. How to read/write data from file storages (HDFS, S3, FTP, FS)
2. What data format to choose (Json, CSV, Avro, Orc, Parquet, Delta, ... )
3. How to parallelize reading/writing to JDBC
4. How to create dataframe from MPP (Cassandra, vertica, gp)
5. How to work with Kafka
6. How to write your own connectors
7. Write UDF for joining with cassandra
Module 5 – Testing
1. Write a test for data marts written in module (Exercise: find popular time for orders, find the most popular boroughs for orders, find distance distribution for orders grouped by boroughs)
2. Theory:
1. Unit testing
2. Code review
3. QA
4. CI/CD
5. Problems
6. Libs which solve these problems
Module 6 - Spark Cluster
1. Build config with allocation
2. Compare several workers
3. Dynamic Resource Allocation
4. Manual managing executors runtime
Module 7 - Spark streaming
1. [Solve problem with Cassandra writing](src/main/scala/mod4connectors/DataSetsWithCassandra.scala)
2. Build Spark Structure Reading Kafka
3. Build Spark Structure Using State
4. Build Spark Structure Writing Cassandra";s:4:"TYPE";s:4:"HTML";}
Schedule and prices
View:
Register for the next course
Registering in advance ensures you have priority. We will notify you when we schedule the next course on this topic
Courses you may be interested in
Apache Spark Fundamentals
We’ll look at the RDD-based framework for automated distributed code execution, and companion projects in different paradigms: Spark SQL, Spark Streaming, MLLib, Spark ML, GraphX.
Apache Maven Introduction
The course deals with theoretical basics and specifics of Java project building with Apache Maven, its principles and architectural characteristics.