Introduction to GPU Programming 培训
GPU 编程是一种利用 GPU 的并行处理能力来加速需要高性能计算的应用程序的技术,例如人工智能、游戏、图形和科学计算。有几种框架和工具可以实现 GPU 编程,每种框架和工具都有自己的优点和缺点。一些最受欢迎的是 OpenCL、CUDA、ROCm 和 HIP。
这种以讲师为主导的现场培训(现场或远程)针对的是希望学习GPU编程基础知识以及开发GPU应用程序的主要框架和工具的初级到中级开发人员。
- 在培训结束时,参与者将能够:
了解 CPU 和 GPU 计算之间的区别以及 GPU 编程的好处和挑战。 - 为他们的 GPU 应用程序选择正确的框架和工具。
- 创建一个基本的 GPU 程序,该程序使用一个或多个框架和工具执行向量加法。
- 使用相应的 API、语言和库来查询设备信息、分配和解除分配设备内存、在主机和设备之间复制数据、启动内核以及同步线程。
- 使用相应的内存空间(如全局、本地、常量和专用)来优化数据传输和内存访问。
- 使用相应的执行模型(如工作项、工作组、线程、块和网格)来控制并行度。
- 使用 CodeXL 、 CUDA-GDB 、 CUDA-MEMCHECK 和 NVIDIA Nsight 等工具调试和测试 GPU 程序。
- 使用合并、缓存、预取和分析等技术优化 GPU 程序。
课程形式
- 互动讲座和讨论。
- 大量的练习和练习。
- 在现场实验室环境中动手实施。
课程自定义选项
- 如需申请本课程的定制培训,请联系我们进行安排。
课程大纲
介绍
- 什么是 GPU 编程?
- 为什么要使用 GPU 编程?
- GPU 编程的挑战和权衡是什么?
- GPU 编程的框架和工具是什么?
- 为您的应用程序选择正确的框架和工具
OpenCL
- 什么是OpenCL?
- OpenCL的优点和缺点是什么?
- 为 OpenCL 设置开发环境
- 创建一个执行向量加法的基本 OpenCL 程序
- 使用 OpenCL API 查询设备信息、分配和释放设备内存、在主机和设备之间复制数据、启动内核和同步线程
- 使用 OpenCL C 语言编写在设备上执行的内核并操作数据
- 使用 OpenCL 内置函数、变量和库执行常见任务和操作
- 使用 OpenCL 内存空间(例如全局、本地、常量和专用)来优化数据传输和内存访问
- 使用 OpenCL 执行模型来控制定义并行度的工作项、工作组和 ND 范围
- 使用 CodeXL 等工具调试和测试 OpenCL 个程序
- 使用合并、缓存、预取和分析等技术优化 OpenCL 个程序
CUDA的
- 什么是CUDA?
- CUDA的优缺点是什么?
- 设置 CUDA 的开发环境
- 创建一个执行向量加法的基本 CUDA 程序
- 使用 CUDA API 查询设备信息、分配和释放设备内存、在主机和设备之间复制数据、启动内核和同步线程
- 使用 CUDA C/C++ 语言编写在设备上执行的内核并操作数据
- 使用 CUDA 内置函数、变量和库执行常见任务和操作
- 使用 CUDA 内存空间(例如全局、共享、常量和本地)来优化数据传输和内存访问
- 使用 CUDA 执行模型来控制定义并行度的线程、块和网格
- 使用 CUDA-GDB、CUDA-MEMCHECK 和 NVIDIA Nsight 等工具调试和测试 CUDA 程序
- 使用合并、缓存、预取和分析等技术优化 CUDA 程序
中华民国
- 什么是ROCm?
- ROCm的优缺点是什么?
- 为 ROCm 设置开发环境
- 创建执行向量加法的基本 ROCm 程序
- 使用 ROCm API 查询设备信息、分配和释放设备内存、在主机和设备之间复制数据、启动内核和同步线程
- 使用 ROCm C/C++ 语言编写在设备上执行的内核并操作数据
- 使用 ROCm 内置函数、变量和库执行常见任务和操作
- 使用 ROCm 内存空间(如全局、本地、常量和专用)来优化数据传输和内存访问
- 使用 ROCm 执行模型来控制定义并行度的线程、块和网格
- 使用 ROCm Debugger 和 ROCm Profiler 等工具调试和测试 ROCm 程序
- 使用合并、缓存、预取和分析等技术优化 ROCm 程序
臀部
- 什么是HIP?
- HIP的优点和缺点是什么?
- 设置 HIP 的开发环境
- 创建执行向量加法的基本 HIP 程序
- 使用 HIP 语言编写在设备上执行的内核并操作数据
- 使用 HIP 内置函数、变量和库执行常见任务和操作
- 使用 HIP 内存空间(如全局、共享、常量和本地)来优化数据传输和内存访问
- 使用 HIP 执行模型来控制定义并行度的线程、块和网格
- 使用 ROCm Debugger 和 ROCm Profiler 等工具调试和测试 HIP 程序
- 使用合并、缓存、预取和分析等技术优化 HIP 程序
比较
- 比较 OpenCL、CUDA、ROCm 和 HIP 的功能、性能和兼容性
- 使用基准和指标评估 GPU 个程序
- 学习 GPU 编程的最佳实践和技巧
- 探索 GPU 编程的当前和未来趋势和挑战
总结和下一步
要求
- 了解 C/C++ 语言和并行编程概念
- 计算机体系结构和内存层次结构的基础知识
- 具有命令行工具和代码编辑器的经验
观众
- 希望学习 GPU 编程基础知识以及开发 GPU 应用程序的主要框架和工具的开发人员
- 希望编写可在不同平台和设备上运行的可移植和可扩展代码的开发人员
- 希望探索 GPU 编程和优化的好处和挑战的程序员
需要帮助选择合适的课程吗?
Introduction to GPU Programming 培训 - Enquiry
Introduction to GPU Programming - 问询
问询
即将举行的公开课程
相关课程
Developing AI Applications with Huawei Ascend and CANN
21 小时Huawei Ascend is a family of AI processors designed for high-performance inference and training.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI engineers and data scientists who wish to develop and optimize neural network models using Huawei’s Ascend platform and the CANN toolkit.
By the end of this training, participants will be able to:
- Set up and configure the CANN development environment.
- Develop AI applications using MindSpore and CloudMatrix workflows.
- Optimize performance on Ascend NPUs using custom operators and tiling.
- Deploy models to edge or cloud environments.
Format of the Course
- Interactive lecture and discussion.
- Hands-on use of Huawei Ascend and CANN toolkit in sample applications.
- Guided exercises focused on model building, training, and deployment.
Course Customization Options
- To request a customized training for this course based on your infrastructure or datasets, please contact us to arrange.
Deploying AI Models with CANN and Ascend AI Processors
14 小时CANN (Compute Architecture for Neural Networks) is Huawei’s AI compute stack for deploying and optimizing AI models on Ascend AI processors.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI developers and engineers who wish to deploy trained AI models efficiently to Huawei Ascend hardware using the CANN toolkit and tools such as MindSpore, TensorFlow, or PyTorch.
By the end of this training, participants will be able to:
- Understand the CANN architecture and its role in the AI deployment pipeline.
- Convert and adapt models from popular frameworks to Ascend-compatible formats.
- Use tools like ATC, OM model conversion, and MindSpore for edge and cloud inference.
- Diagnose deployment issues and optimize performance on Ascend hardware.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on lab work using CANN tools and Ascend simulators or devices.
- Practical deployment scenarios based on real-world AI models.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
GPU Programming on Biren AI Accelerators
21 小时Biren AI Accelerators are high-performance GPUs designed for AI and HPC workloads with support for large-scale training and inference.
This instructor-led, live training (online or onsite) is aimed at intermediate-level to advanced-level developers who wish to program and optimize applications using Biren’s proprietary GPU stack, with practical comparisons to CUDA-based environments.
By the end of this training, participants will be able to:
- Understand Biren GPU architecture and memory hierarchy.
- Set up the development environment and use Biren’s programming model.
- Translate and optimize CUDA-style code for Biren platforms.
- Apply performance tuning and debugging techniques.
Format of the Course
- Interactive lecture and discussion.
- Hands-on use of Biren SDK in sample GPU workloads.
- Guided exercises focused on porting and performance tuning.
Course Customization Options
- To request a customized training for this course based on your application stack or integration needs, please contact us to arrange.
Cambricon MLU Development with BANGPy and Neuware
21 小时Cambricon MLUs(Machine Learning单元)是专为边缘和数据中心场景中的推理和训练优化的AI芯片。
本次由讲师指导的培训(线上或线下)面向中级开发者,旨在帮助他们使用BANGPy框架和Neuware SDK在Cambricon MLU硬件上构建和部署AI模型。
通过本次培训,参与者将能够:
- 设置和配置BANGPy与Neuware开发环境。
- 为Cambricon MLUs开发和优化基于Python和C++的模型。
- 将模型部署到运行Neuware运行时的边缘和数据中心设备。
- 将机器学习工作流与MLU特定的加速功能集成。
课程形式
- 互动式讲座和讨论。
- 动手实践BANGPy和Neuware进行开发和部署。
- 专注于优化、集成和测试的指导练习。
课程定制选项
- 如需根据您的Cambricon设备型号或使用场景定制本课程,请联系我们安排。
Introduction to CANN for AI Framework Developers
7 小时CANN (Compute Architecture for Neural Networks) is Huawei’s AI computing toolkit used to compile, optimize, and deploy AI models on Ascend AI processors.
This instructor-led, live training (online or onsite) is aimed at beginner-level AI developers who wish to understand how CANN fits into the model lifecycle from training to deployment, and how it works with frameworks like MindSpore, TensorFlow, and PyTorch.
By the end of this training, participants will be able to:
- Understand the purpose and architecture of the CANN toolkit.
- Set up a development environment with CANN and MindSpore.
- Convert and deploy a simple AI model to Ascend hardware.
- Gain foundational knowledge for future CANN optimization or integration projects.
Format of the Course
- Interactive lecture and discussion.
- Hands-on labs with simple model deployment.
- Step-by-step walkthrough of the CANN toolchain and integration points.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
CANN for Edge AI Deployment
14 小时Huawei's Ascend CANN toolkit enables powerful AI inference on edge devices such as the Ascend 310. CANN provides essential tools for compiling, optimizing, and deploying models where compute and memory are constrained.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI developers and integrators who wish to deploy and optimize models on Ascend edge devices using the CANN toolchain.
By the end of this training, participants will be able to:
- Prepare and convert AI models for Ascend 310 using CANN tools.
- Build lightweight inference pipelines using MindSpore Lite and AscendCL.
- Optimize model performance for limited compute and memory environments.
- Deploy and monitor AI applications in real-world edge use cases.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on lab work with edge-specific models and scenarios.
- Live deployment examples on virtual or physical edge hardware.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Understanding Huawei’s AI Compute Stack: From CANN to MindSpore
14 小时Huawei’s AI stack — from the low-level CANN SDK to the high-level MindSpore framework — offers a tightly integrated AI development and deployment environment optimized for Ascend hardware.
This instructor-led, live training (online or onsite) is aimed at beginner-level to intermediate-level technical professionals who wish to understand how the CANN and MindSpore components work together to support AI lifecycle management and infrastructure decisions.
By the end of this training, participants will be able to:
- Understand the layered architecture of Huawei’s AI compute stack.
- Identify how CANN supports model optimization and hardware-level deployment.
- Evaluate the MindSpore framework and toolchain in relation to industry alternatives.
- Position Huawei's AI stack within enterprise or cloud/on-prem environments.
Format of the Course
- Interactive lecture and discussion.
- Live system demos and case-based walkthroughs.
- Optional guided labs on model flow from MindSpore to CANN.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Optimizing Neural Network Performance with CANN SDK
14 小时CANN SDK (Compute Architecture for Neural Networks) is Huawei’s AI compute foundation that allows developers to fine-tune and optimize the performance of deployed neural networks on Ascend AI processors.
This instructor-led, live training (online or onsite) is aimed at advanced-level AI developers and system engineers who wish to optimize inference performance using CANN’s advanced toolset, including the Graph Engine, TIK, and custom operator development.
By the end of this training, participants will be able to:
- Understand CANN's runtime architecture and performance lifecycle.
- Use profiling tools and Graph Engine for performance analysis and optimization.
- Create and optimize custom operators using TIK and TVM.
- Resolve memory bottlenecks and improve model throughput.
Format of the Course
- Interactive lecture and discussion.
- Hands-on labs with real-time profiling and operator tuning.
- Optimization exercises using edge-case deployment examples.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
CANN SDK for Computer Vision and NLP Pipelines
14 小时The CANN SDK (Compute Architecture for Neural Networks) provides powerful deployment and optimization tools for real-time AI applications in computer vision and NLP, especially on Huawei Ascend hardware.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI practitioners who wish to build, deploy, and optimize vision and language models using the CANN SDK for production use cases.
By the end of this training, participants will be able to:
- Deploy and optimize CV and NLP models using CANN and AscendCL.
- Use CANN tools to convert models and integrate them into live pipelines.
- Optimize inference performance for tasks like detection, classification, and sentiment analysis.
- Build real-time CV/NLP pipelines for edge or cloud-based deployment scenarios.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on lab with model deployment and performance profiling.
- Live pipeline design using real CV and NLP use cases.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Building Custom AI Operators with CANN TIK and TVM
14 小时CANN TIK (Tensor Instruction Kernel) and Apache TVM enable advanced optimization and customization of AI model operators for Huawei Ascend hardware.
This instructor-led, live training (online or onsite) is aimed at advanced-level system developers who wish to build, deploy, and tune custom operators for AI models using CANN’s TIK programming model and TVM compiler integration.
By the end of this training, participants will be able to:
- Write and test custom AI operators using the TIK DSL for Ascend processors.
- Integrate custom ops into the CANN runtime and execution graph.
- Use TVM for operator scheduling, auto-tuning, and benchmarking.
- Debug and optimize instruction-level performance for custom computation patterns.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on coding of operators using TIK and TVM pipelines.
- Testing and tuning on Ascend hardware or simulators.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Migrating CUDA Applications to Chinese GPU Architectures
21 小时中国的GPU架构,如Huawei Ascend、Biren和Cambricon MLU,提供了专为本地AI和HPC市场量身定制的CUDA替代方案。
这项由讲师指导的培训(线上或线下)旨在为高级GPU程式设计师和基础设施专家提供迁移和优化现有CUDA应用程序,以便在中国硬件平台上部署的能力。
培训结束后,参与者将能够:
- 评估现有CUDA工作负载与中国芯片替代方案的兼容性。
- 将CUDA代码库移植到华为CANN、Biren SDK和Cambricon BANGPy环境中。
- 比较性能并识别跨平台的优化点。
- 解决跨架构支持和部署中的实际挑战。
课程形式
- 互动式讲座和讨论。
- 实践代码翻译和性能比较实验。
- 专注于多GPU适应策略的指导练习。
课程定制选项
- 如需根据您的平台或CUDA项目定制培训,请联系我们安排。
Performance Optimization on Ascend, Biren, and Cambricon
21 小时Ascend、Biren 和 Cambricon 是中国领先的 AI 硬体平台,各自提供独特的加速和性能分析工具,用于生产规模的 AI 工作负载。
这项由讲师指导的培训(线上或线下)针对高级 AI 基础设施和性能工程师,旨在优化跨多个中国 AI 晶片平台的模型推理和训练工作流程。
在培训结束时,参与者将能够:
- 在 Ascend、Biren 和 Cambricon 平台上进行模型基准测试。
- 识别系统瓶颈和记忆体/计算效率低下的问题。
- 应用图层级、核心层级和操作层级的优化。
- 调整部署管道以提高吞吐量和减少延迟。
课程形式
- 互动式讲座和讨论。
- 在每个平台上实际使用性能分析和优化工具。
- 专注于实际调整情境的指导练习。
课程定制选项
- 如需根据您的性能环境或模型类型定制此课程,请联系我们安排。