NGT Tutorials: PQuantML
This tutorial introduces PQuantML, a practical framework for pruning and quantization-aware training (QAT) of deep neural networks. It walks users through the core concepts behind model compression—why pruning redundant weights and training with low-precision arithmetic can dramatically reduce model size, latency, and energy consumption without sacrificing accuracy. The tutorial explains how PQuantML integrates seamlessly into a typical training pipeline, highlighting its modular design and clear APIs that make it easy to experiment with different compression strategies across common neural network architectures.
Through step-by-step examples, the tutorial demonstrates how to configure structured and unstructured pruning, enable quantization-aware training, and fine-tune compressed models to recover performance. Users learn how to monitor sparsity, accuracy, and computational efficiency throughout training, and how to export optimized models for deployment on resource-constrained hardware. By the end, readers will have a solid understanding of how to use PQuantML to build efficient, production-ready neural networks while maintaining strong predictive performance.