دورية أكاديمية

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

التفاصيل البيبلوغرافية
العنوان: LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
المؤلفون: Park, Gunho, Park, Baeseong, Kim, Minsub, Lee, Sungjae, Kim, Jeonghoon, Kwon, Beomseok, Kwon, Se Jung, Kim, Byeongwook, Lee, Youngjoo, Lee, Dongsoo
سنة النشر: 2022
المجموعة: ArXiv.org (Cornell University Library)
مصطلحات موضوعية: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Computation and Language
الوصف: The recent advancements in self-supervised learning, combined with the Transformer architecture, have enabled natural language processing (NLP) to achieve remarkably low perplexity. However, powerful NLP models necessitate increasing model size, leading to substantial computational and memory requirements. In this paper, we introduce an efficient inference framework tailored for large-scale generative language models. To reduce the model size, we employ a weight-only quantization strategy while preserving full precision for activations. As a result, we attain sub-4-bit quantization for each weight through non-uniform or uniform quantization techniques. Our proposed kernel, called LUT-GEMM, then accelerates quantized matrix multiplications, offering a flexible balance between compression ratio and accuracy. Unlike earlier matrix multiplication kernels that accommodated weight-only quantization, LUT-GEMM efficiently eliminates the resource-demanding dequantization process for both uniform and non-uniform quantization methods. By reducing the latency of individual GPUs and the overall inference process for large-scale language models, LUT-GEMM provides significant performance improvements in inference. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations, which decreases the number of required GPUs. For the OPT-175B model with 3-bit quantization, we show that LUT-GEMM accelerates the latency for generating each token by 2.1x compared to OPTQ, which requires costly dequantization. Consequently, LUT-GEMM enables inference of the OPT-175B model on a single GPU without noticeable degradation in accuracy or performance, while the non-quantized OPT-175B model requires a minimum of 8 GPUs. ; Comment: Extension of "nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models"
نوع الوثيقة: text
اللغة: unknown
العلاقة: http://arxiv.org/abs/2206.09557Test
الإتاحة: http://arxiv.org/abs/2206.09557Test
رقم الانضمام: edsbas.2B2FCC1
قاعدة البيانات: BASE