Transformer Implementation From Scratch
End-to-end transformer reimplementation with component-level ablation tests.
Key ResultMatched 98% of baseline perplexity while exposing bottlenecks in attention kernels.
1. Overview
Reimplemented the transformer architecture to deeply understand each component and failure mode.
2. Architecture Diagram
Token Embeddings -> Multi-Head Attention -> MLP -> LayerNorm -> Decoder Head
3. Technical Stack
- PyTorch
- NumPy
- Weights and Biases
4. Experimental Results
- Perplexity: within 2% of baseline
- Training throughput: 1700 tokens/s on A100
- Memory savings: 12% with fused ops
5. Tradeoffs / Lessons
Kernel-level optimizations improved throughput but increased implementation complexity and debugging time.