My 20-Day Journey Learning CUDA from Scratch

04 Feb, 2025

Over the past 20 days, I embarked on an incredible learning journey to master CUDA programming, a powerful framework for parallel programming primarily used for GPU acceleration. Below, I document my progress, highlights, and key insights gained throughout this experience.

Day 1: Vector Addition

I kicked off my journey by implementing vector addition using a simple CUDA program. This involved:

Writing my first CUDA kernel.
Understanding the grid, block, and thread hierarchy.
Allocating and managing device memory with cudaMalloc, cudaMemcpy, and cudaFree.

Reading: Chapter 1 of the PMPP book provided an introduction to parallel programming, the CUDA architecture, and the GPU execution model.

Day 2: Matrix Addition

I extended my knowledge to matrix addition by designing a grid and block layout for 2D matrices. Key learning points included:

Thread indexing in 2D grids.
Techniques for synchronizing threads to prevent race conditions.

Reading: Chapter 2 of the PMPP book on GPU scalability and massive parallelism.

Day 3: Matrix-Vector Multiplication

I implemented matrix-vector multiplication, where each thread computed the dot product between a row of the matrix and a vector, leading to:

Efficiency through the use of shared memory.

Reading: Half of Chapter 3, focusing on scalable parallel execution.

Day 4: Parallel Reduction

Working on computing partial sums, I learned about tree-based reduction algorithms, emphasizing:

Minimizing warp divergence.

Reading: Finished Chapter 3, exploring resource assignment and latency tolerance.

Day 5: Layer Normalization

I tackled implementing layer normalization, a critical operation in deep learning, and focused on:

Computing mean and variance in parallel.

Reading: Chapter 4, delving into memory optimizations and performance tuning.

Day 6: Matrix Transposition

In this phase, I optimized matrix transposition by leveraging shared memory to reduce global memory accesses.

Reading: Chapter 5, where I learned about performance considerations and advanced shared memory usage.

Day 7: One-Dimensional Convolution

I implemented both simple and tiled versions of 1D convolution:

By optimizing memory access patterns, I reduced global memory latency.

Reading: Chapter 7 on convolution techniques.

Day 8: Prefix Sum

I implemented the Brent-Kung algorithm for parallel prefix sum computations, learning about:

Hierarchical scan algorithms and thread synchronization.

Reading: Chapter 8 focusing on parallel patterns for prefix sum.

Day 9-10: Flash Attention

I developed a forward pass for Flash Attention but faced challenges with numerical stability. I also optimized it further in Day 10.

And I made a trending post about it!

Reading: Explored the Flash Attention paper deepening my understanding of attention mechanisms.

Day 11: Sparse Matrix-Vector Multiplication

I created an optimized sparse matrix-vector multiplication algorithm and wrote a benchmarking script to compare performance against PyTorch.

Reading: Chapter 10 on sparse matrix computations.

Day 12: Merge Sort

I implemented the Merge Sort algorithm using a parallel approach, acquiring insights into merging two sorted arrays effectively.

Reading: Chapter 11, focusing on merge sort parallelization strategies.

Day 13: BFS and GELU Activation

I explored advanced algorithms, implementing a breadth-first search optimized kernel and GELU activation function, essential in neural networks.

Reading: Chapters 12 and 13, enhancing my understanding of graph algorithms and dynamic parallelism.

Day 14: MRI Reconstruction

I worked on the FHD algorithm for non-Cartesian MRI reconstruction, combining theoretical knowledge with practical application in medical imaging.

Reading: Chapter 14, studying iterative reconstruction techniques.

Day 15: Flash Attention Backpropagation

I implemented the backpropagation step for Flash Attention, tackling gradient calculations and optimizing memory usage.

Reading: Chapters 15-17 on application case studies in molecular visualization and machine learning.

Day 16: Naive Bayes Classifier

I created a CUDA-accelerated Naive Bayes classifier, optimizing the training process to efficiently handle feature probabilities.

Reading: Updated my blog with insights on using NVCC in Colab. This blog made a trend as well !

Day 17: Vector Addition with cuBLAS

I learned the basics of the cuBLAS library and implemented vector addition using cublasSaxpy, enhancing performance via optimized routines.

Day 18: Matrix Multiplication with cuBLAS

Continuing with cuBLAS, I implemented matrix multiplication, deepening my understanding of high-performance linear algebra operations.

Day 19: Fully Connected Neural Network

I constructed a fully connected neural network using cuDNN, exploring tensor descriptors, filter descriptors, and the complexities of neural networks on GPUs.

Day 20: Rotary Positional Encoding

I wrapped up this learning adventure by implementing the Rotary Positional Encoding (RoPE) mechanism in CUDA, enhancing transformer models for sequential data processing.

Conclusion

These 20 days of learning CUDA have been transformative. I balanced the theoretical foundations provided by the PMPP book with hands-on coding projects, deepening my understanding of parallel programming and GPU architecture. As I move forward, I plan to continue refining the implementations and exploring more advanced topics in CUDA programming.

Stay tuned for further updates on my journey!

My github

Check out my blogs

My LinkedIn

Umar Jamil's 100 days of GPU Challenge