GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Date:

Nowadays, Generative Pre-trained Transformer models has not only breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs.

While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models.

In this paper, they propose GPTQ, a new one-shot weight quantization method based on approximate second-order information.

Powerpoint for this talk

Reference Paper

Leave a Comment