AMD

Place AMD , San Jose, CA, USA
Keywords Quantization, GPU Parallelism, Mixture-of-Experts
Role AI Frameworks Intern
Timeline May 2025 - Aug 2025

Model Optimization

  • Proposed and implemented multi-gpu based solutions for GPTQ quantization of MoE models and model evaluation
  • Bypassed GIL in python for quantization of experts and distributed the computation among GPU devices using model parallelism
  • Achieved 1.65x and 10x speedup in quantization of Qwen1.5-MoE-A2.7B & DeepSeek-R1 with 0.5% improvement in perplexity
  • Implemented data parallelism, partitioning the evaluation dataset across ranks and achieved 5.01x speedup in evaluation time