Apache Spark Fundamentals

Apache Spark Fundamentals

This training course delivers key concepts and methods for data processing applications development using Apache Spark. We’ll look at the Spark framework for automated distributed code execution, and companion projects in the Map-Reduce paradigm. We’ll work with Spark Data API, in box connectors, batch and streaming pipelines.We’ll look at the RDD-based framework for automated distributed code execution, and companion projects in different paradigms: Spark SQL, Spark Streaming, MLLib, Spark ML, GraphX.

Duration
26 hours
Course type
Online
Language
English
Duration
26 hours
Location
Online
Language
English
Code
EAS-017
Training for 7-8 or more people? Customize trainings for your specific needs
Apache Spark Fundamentals
Duration
26 hours
Location
Online
Language
English
Code
EAS-017
€ 700 *
Training for 7-8 or more people? Customize trainings for your specific needs

Description

This training course delivers key concepts and methods for data processing applications development using Apache Spark. We’ll look at the Spark framework for automated distributed code execution, and companion projects in the Map-Reduce paradigm. We’ll work with RDD, DataFrame, DataSet and describe logic with Spark SQL and DSL. As well, we’ll talk about loading data from/to external storages such as Cassandra, Kafka, Postgres, and S3. We will also work with HDFS and data formats.This training course delivers key concepts and methods for data processing applications development using Apache Spark. We’ll look at the Spark framework for automated distributed code execution, and companion projects in the Map-Reduce paradigm. We’ll work with Spark Data API, in box connectors, batch and streaming pipelines.

certificate
After completing the course, a certificate
is issued on the Luxoft Training form

Objectives

During the training participants will:

  1. Write a Spark pipeline via functional Python and RDDs; 
  2. Write a Spark pipeline via Python, Spark DSL, Spark SQL and DataFrame; 
  3. Draw architecture with different sources; 
  4. Write a Spark pipeline with external systems (Kafka, Cassandra, Postgres) which works in parallel modes; 
  5. Resolve problems with slow joins. 

After the training, participants will be able to build a simple PySpark application and execute it on the cluster in parallel mode.

Target Audience

  • Software developers
  • Software architects

Prerequisites

Basic Java, Python, Scala programming skills. Unix/Linux shell familiarity. Experience with databases is optional.

Roadmap

  • Spark concepts and architecture
  • Programming with RDDs: transformations and actions
  • Using key/value pairs
  • Loading and storing data
  • Accumulators and broadcast variables
  • Spark SQL, DataFrames, Datasets
  • Spark Streaming
  • Machine Learning using MLLib and Spark ML
  • Graph analysis using GraphX
Still have questions?
Connect with us