Deep Learning Framework in C++ & CUDA

A lightweight, device-agnostic deep learning framework architected from scratch.

Overview

To gain a deep understanding of modern deep learning internals, I architected a lightweight, device-agnostic deep learning framework in C++ from scratch. The design mimics the intuitive torch.nn.Module API, supporting dynamic graph construction and automatic differentiation.

Key Features

  • Custom CUDA Kernels: Implemented highly optimized CUDA kernels for linear layers and activation functions.
  • Performance Optimization: Achieved a 13x speedup compared to naive implementations by leveraging OpenMP multi-threading and GPU acceleration.
  • Memory Management: Engineered a robust training pipeline with manual host-device memory management to ensure efficiency and stability.
  • Validation: Successfully trained a fully connected neural network on the MNIST dataset, validating the framework’s correctness and convergence properties.

Tech Stack

  • C++
  • CUDA
  • OpenMP
  • Deep Learning Systems