Hadoop是一种流行的Big Data处理框架。 Python是一种高级编程语言,以其清晰的语法和代码可读性而闻名。
在这个以讲师为主导的现场培训中,参与者将学习如何使用Python来处理Hadoop ,MapReduce,Pig和Spark,因为他们会逐步完成多个示例和用例。
在培训结束时,参与者将能够:
- 了解Hadoop ,MapReduce,Pig和Spark背后的基本概念
- 将Python与Hadoop分布式文件系统(HDFS),MapReduce,Pig和Spark结合使用
- 使用Snakebite以编程方式访问Python HDFS
- 使用mrjob在Python编写MapReduce作业
- 用Python编写Spark程序
- 使用Python UDF扩展pig的功能
- 使用Luigi管理MapReduce作业和Pig脚本
听众
课程形式
Machine Translated
Introduction
Understanding Hadoop's Architecture and Key Concepts
Understanding the Hadoop Distributed File System (HDFS)
- Overview of HDFS and its Architectural Design
- Interacting with HDFS
- Performing Basic File Operations on HDFS
- Overview of HDFS Command Reference
- Overview of Snakebite
- Installing Snakebite
- Using the Snakebite Client Library
- Using the CLI Client
Learning the MapReduce Programming Model with Python
- Overview of the MapReduce Programming Model
- Understanding Data Flow in the MapReduce Framework
- Map
- Shuffle and Sort
- Reduce
- Using the Hadoop Streaming Utility
- Understanding How the Hadoop Streaming Utility Works
- Demo: Implementing the WordCount Application on Python
- Using the mrjob Library
- Overview of mrjob
- Installing mrjob
- Demo: Implementing the WordCount Algorithm Using mrjob
- Understanding How a MapReduce Job Written with the mrjob Library Works
- Executing a MapReduce Application with mrjob
- Hands-on: Computing Top Salaries Using mrjob
Learning Pig with Python
- Overview of Pig
- Demo: Implementing the WordCount Algorithm in Pig
- Configuring and Running Pig Scripts and Pig Statements
- Using the Pig Execution Modes
- Using the Pig Interactive Mode
- Using the Pic Batch Mode
- Understanding the Basic Concepts of the Pig Latin Language
- Using Statements
- Loading Data
- Transforming Data
- Storing Data
- Extending Pig's Functionality with Python UDFs
- Registering a Python UDF File
- Demo: A Simple Python UDF
- Demo: String Manipulation Using Python UDF
- Hands-on: Calculating the 10 Most Recent Movies Using Python UDF
Using Spark and PySpark
- Overview of Spark
- Demo: Implementing the WordCount Algorithm in PySpark
- Overview of PySpark
- Using an Interactive Shell
- Implementing Self-Contained Applications
- Working with Resilient Distributed Datasets (RDDs)
- Creating RDDs from a Python Collection
- Creating RDDs from Files
- Implementing RDD Transformations
- Implementing RDD Actions
- Hands-on: Implementing a Text Search Program for Movie Titles with PySpark
Managing Workflow with Python
- Overview of Apache Oozie and Luigi
- Installing Luigi
- Understanding Luigi Workflow Concepts
- Demo: Examining a Workflow that Implements the WordCount Algorithm
- Working with Hadoop Workflows that Control MapReduce and Pig Jobs
- Using Luigi's Configuration Files
- Working with MapReduce in Luigi
- Working with Pig in Luigi
Summary and Conclusion