Hadoop Fundamentals
Hadoop Fundamentals
Duration
24 hours
Location
Online
Language
English
Code
EAS-015
€ 600 *
Training for 7-8 or more people? Customize trainings for your specific needs
Description
This training provides a foundation of Apache Hadoop concepts and methods for developing data-processing applications while using it. Participants will get acquainted with HDFS, the de facto standard for long-term reliable big data storage; the YARN framework that manages parallellized execution of applications on a cluster; and the Hadoop ecosystem projects: Hive, Spark, & HBase.
After completing the course, a certificate
is issued on the Luxoft Training form
is issued on the Luxoft Training form
Objectives
- Understand the key concepts and architecture of Hadoop
- Get an idea of the ecosystem that has developed around Hadoop and its key components
- Know how to read & write data to/from HDFS
- Comprehend the MapReduce programming paradigm
- Be able to access tabular data using Hive
- Learn to access tabular data using Spark SQL/DataFrame in batch mode
- Process data streams using Spark Structured Streaming
- Learn to use HBase for low-latency data storage and reading
Target Audience
- Software developers
- Software architects
- Database designers
- Database administrators
Prerequisites
- Basic Java programming skills
- Unix/Linux shell familiarity
- Experience with databases is optional
Roadmap
1. Basic concepts of modern data architecture: Lambda
2. External storages: Apache Kafka, Amazon S3 and tools for working with.
3. HDFS: Hadoop Distributed File System
- Architecture, replication, data in/out, HDFS commands
Practice (shell, Hue): connecting to a cluster, working with the file system
4. The MapReduce paradigm, engines and its implementation in Frameworks:
Practice: Launching applications
5. YARN: Distributed application execution management
- YARN architecture, application launch in YARN
Practice: launching applications and monitoring the cluster through the UI
6. Introduction to Hive
- Architecture, Table metadata, File formats, HiveQL query language
Practice (Hue, hive, beeline, Tez UI): creating tables, reading & writing CSV, Parquet, ORC, partitioning, SQL queries with aggregation and joins
7. Introduction to Spark
- DataFrame/SQL, metadata, file formats, data sources, RDD
Practice (Zeppelin, Spark UI): reading & writing from the database (JDBC), CSV, Parquet, partitioning, SQL queries with aggregation and joins, query execution plans, monitoring
8. Introduction to streaming data processing
- Spark Streaming, Spark Structured Streaming, Flink
Practice: Reading/processing/writing streams between Kafka, relational database and file system
2. External storages: Apache Kafka, Amazon S3 and tools for working with.
3. HDFS: Hadoop Distributed File System
- Architecture, replication, data in/out, HDFS commands
Practice (shell, Hue): connecting to a cluster, working with the file system
4. The MapReduce paradigm, engines and its implementation in Frameworks:
Practice: Launching applications
5. YARN: Distributed application execution management
- YARN architecture, application launch in YARN
Practice: launching applications and monitoring the cluster through the UI
6. Introduction to Hive
- Architecture, Table metadata, File formats, HiveQL query language
Practice (Hue, hive, beeline, Tez UI): creating tables, reading & writing CSV, Parquet, ORC, partitioning, SQL queries with aggregation and joins
7. Introduction to Spark
- DataFrame/SQL, metadata, file formats, data sources, RDD
Practice (Zeppelin, Spark UI): reading & writing from the database (JDBC), CSV, Parquet, partitioning, SQL queries with aggregation and joins, query execution plans, monitoring
8. Introduction to streaming data processing
- Spark Streaming, Spark Structured Streaming, Flink
Practice: Reading/processing/writing streams between Kafka, relational database and file system
Schedule and prices
View:
Register for the next course
Registering in advance ensures you have priority. We will notify you when we schedule the next course on this topic
Courses you may be interested in
Data Warehouse Fundamentals
Understand current approaches to designing data warehouses and using them in heterogeneous enterprise information systems.
Online:
08.01.2024 - 15.01.2024
Modern Data Management Approaches in Real World Cases
This training provides an overview of modern methods for data storage, including key-value stores, document-oriented and database management systems, distributed data storage and processing systems.