Project Detail2025-11-05

Transformer Implementation From Scratch

End-to-end transformer reimplementation with component-level ablation tests.

Key ResultMatched 98% of baseline perplexity while exposing bottlenecks in attention kernels.

1. Overview

Reimplemented the transformer architecture to deeply understand each component and failure mode.

2. Architecture Diagram

Token Embeddings -> Multi-Head Attention -> MLP -> LayerNorm -> Decoder Head

3. Technical Stack

  • PyTorch
  • NumPy
  • Weights and Biases

4. Experimental Results

  • Perplexity: within 2% of baseline
  • Training throughput: 1700 tokens/s on A100
  • Memory savings: 12% with fused ops

5. Tradeoffs / Lessons

Kernel-level optimizations improved throughput but increased implementation complexity and debugging time.

6. Links