Apache Beam是一个开源的统一编程模型,用于定义和执行并行数据处理管道。它的强大之处在于它能够同时运行批处理流和流式管道,并且由Beam支持的分布式处理后端之一执行:Apache Apex,Apache Flink,Apache Spark和Google Cloud Dataflow。 Apache Beam可用于ETL(提取,转换和加载)任务,如在不同存储介质和数据源之间移动数据,将数据转换为更理想的格式以及将数据加载到新系统中。 在这个有指导意义的现场培训(现场或远程)中,参与者将学习如何在Java或Python应用程序中实现Apache Beam SDK,该应用程序定义了一个数据处理管道,用于将大数据集分解为更小的块,以进行独立的并行处理。 在培训结束后,参与者将能够: 安装并配置Apache Beam。 使用单一编程模型来执行批处理和流处理,而不是使用其Java或Python应用程序。 在多个环境中执行管道。 听众 开发商 课程的格式 部分讲座,部分讨论,练习和沉重的练习 注意 此课程将在未来可用于Scala。请联系我们安排。
Machine Translated
Introduction
- Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
- Beam Model, SDKs, Beam Pipeline Runners
- Distributed processing back-ends
Understanding the Apache Beam Programming Model
- How a pipeline is executed
Running a sample pipeline
- Preparing a WordCount pipeline
- Executing the Pipeline locally
Designing a Pipeline
- Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
- Writing the driver program and defining the pipeline
- Using Apache Beam classes
- Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
- Executing the pipeline locally, on remote machines, and on a public cloud
- Choosing a runner
- Runner-specific configurations
Testing and Debugging Apache Beam
- Using type hints to emulate static typing
- Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
- Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
- Apache Hadoop, Apache Spark, Apache Kafka
Troubleshooting
Summary and Conclusion