Efficient Training and Inference: Techniques for Large Language Models Using Llama

Sophia R. Cunningham; Dominique Archambault; Austin Kung

doi:10.36227/techrxiv.171651876.65094225/v1

loading page

Efficient Training and Inference: Techniques for Large Language Models Using Llama

Sophia R. Cunningham,
Dominique Archambault,
Austin Kung

Abstract

To enhance the efficiency of language models, it would involve optimizing their training and inference processes to reduce computational demands while maintaining high performance. The research focuses on the application of model compression, quantization, and hardware acceleration techniques to the Llama model. Pruning and knowledge distillation methods effectively reduce the model size, resulting in faster training times and lower resource consumption. Quantization techniques, including 8-bit and 4-bit representations, significantly decrease memory usage and improve computational speed without substantial accuracy loss. The integration of GPUs and TPUs further accelerates the training and inference processes, demonstrating the crucial role of hardware in optimizing large-scale models. The study highlights the practical implications of those techniques, paving the way for more sustainable and scalable AI solutions.

18 May 2024Submitted to TechRxiv

24 May 2024Published in TechRxiv

Abstract

Peer review timeline