loading page

Efficient Training and Inference: Techniques for Large Language Models Using Llama
  • Sophia R. Cunningham,
  • Dominique Archambault,
  • Austin Kung
Sophia R. Cunningham
AI-dealistic Lab

Corresponding Author:[email protected]

Author Profile
Dominique Archambault
AI-dealistic Lab
Author Profile
Austin Kung
AI-dealistic Lab
Author Profile

Abstract

To enhance the efficiency of language models, it would involve optimizing their training and inference processes to reduce computational demands while maintaining high performance. The research focuses on the application of model compression, quantization, and hardware acceleration techniques to the Llama model. Pruning and knowledge distillation methods effectively reduce the model size, resulting in faster training times and lower resource consumption. Quantization techniques, including 8-bit and 4-bit representations, significantly decrease memory usage and improve computational speed without substantial accuracy loss. The integration of GPUs and TPUs further accelerates the training and inference processes, demonstrating the crucial role of hardware in optimizing large-scale models. The study highlights the practical implications of those techniques, paving the way for more sustainable and scalable AI solutions.
18 May 2024Submitted to TechRxiv
24 May 2024Published in TechRxiv