My One-Month Journey into GPU Programming

14 Feb, 2025

For the past month, I have embarked on an intense and rewarding journey into the world of GPU programming. Starting from scratch, I dedicated myself to learning the intricacies of parallel computation and harnessing the power of GPUs for a wide range of applications. This blog post chronicles my experiences, challenges, and key takeaways from this exciting adventure.

The Initial Spark

My motivation stemmed from a desire to delve into the heart of modern computing—understanding how to leverage the massive parallelism offered by GPUs. With the guidance of my mentor (hkproj) and inspiration from fellow learners like 1y33, I committed to a 100-day challenge, pushing myself to code and learn something new every single day.

Learning CUDA and Expanding to AMD

While CUDA was my primary tool, I also focused on ensuring compatibility with AMD GPUs. This broadened my understanding of different GPU architectures and programming models. By Day 30, I had optimized multiple kernels for AMD hardware, testing them on an AMD MI250 with 128 cores per node and 1TB of memory.

Some key kernel implementations included:

Vector addition
Matrix-vector multiplication
GELU activation
Layer normalization
Matrix transpose
2D convolution
Flash Attention (forward and backward passes)
Prefix sum and partial sum
Parallel merge
Sparse matrix-vector multiplication
ROPE (Rotary Position Embedding)
Matrix addition using rocBLAS

This hands-on experience provided valuable insights into optimizing performance across different GPU architectures.

Challenges and Triumphs

Like any learning journey, this one wasn’t without hurdles. Some of the biggest challenges included:

Understanding grids, blocks, and threads: Mapping parallel computations efficiently took time.
Memory management: Properly handling device memory (cudaMalloc, cudaMemcpy, cudaFree) was critical.
Debugging race conditions: Tracking down synchronization issues in CUDA kernels required patience.
Cross-platform compatibility: Adapting code for both NVIDIA and AMD GPUs was more complex than expected.

However, the triumphs made it all worthwhile:

Massive speedups: Seeing algorithms run orders of magnitude faster due to parallelization was exhilarating.
Optimized performance: Using shared memory, tiling, and coalesced access led to significant efficiency gains.
Successfully running complex kernels: Implementing Flash Attention and sparse matrix operations felt like major victories.

Key Projects and Learnings

Throughout the month, I tackled a variety of projects that deepened my understanding of GPU programming:

Fundamental Algorithms

Implemented vector/matrix addition, matrix multiplication, and matrix transpose.
Focused on optimizing these implementations for both NVIDIA and AMD GPUs.

Convolutional Neural Networks (CNNs)

Developed CNN implementations in CUDA, including forward and backward passes.
Explored deep learning acceleration on different hardware architectures.

Flash Attention

Implemented both forward and backward passes for Flash Attention.
Aimed for portability across different GPU architectures.

Sparse Matrix-Vector Multiplication (SpMV)

Optimized SpMV using a hybrid ELL-COO approach.
Addressed the challenges of handling sparse data efficiently on GPUs.

Monte Carlo Tree Search (MCTS)

Parallelized the rollout phase of MCTS to explore GPU applications in game AI.
Focused on cross-platform compatibility.

Other Implemented Algorithms

Breadth-First Search
Merge Sort
Expectation-Maximization (EM)
Stochastic Gradient Descent (SGD)

I also explored optimized libraries such as cuBLAS, rocBLAS, and cuDNN, which significantly simplified complex computations on NVIDIA and AMD architectures.

Tools and Resources

My learning was significantly aided by the following resources:

"Parallel Programming and Optimization" (PMPP) Book: My primary guide to understanding CUDA and parallel computing.
Online Communities: Engaging with forums helped me troubleshoot issues and gain insights from experienced developers.
Profiling Tools: NVIDIA Visual Profiler and AMD ROCm Profiler were invaluable for identifying performance bottlenecks.

Looking Ahead

This first month has laid a strong foundation. Going forward, I plan to:

Improve cross-platform GPU programming: Develop kernels that run efficiently on both NVIDIA and AMD GPUs.
Explore advanced topics: Investigate CUDA streams, dynamic parallelism, and more complex algorithms.
Optimize performance further: Experiment with new techniques to maximize GPU resource utilization.

My journey is documented on GitHub (a-hamdi-cuda), where you can find all my code and track my progress. I also share my learnings on my blog (https://hamdi.bearblog.dev/) so feel free to check it out!

This is just the beginning of my GPU programming adventure—I’m eager to continue pushing the boundaries of parallel computation.

Social Media:

Twitter