SPEECH DENOISING USING ADVANCED DIFFUSION TECHNIQUES IN CONSTRAINED RESOURCE ENVIRONMENTS THROUGH INFERENCE PIPELINES
Keywords:
speech enhancement, diffusion models, generative modelling, inference optimization, batch inference, resource-constrained environmentsAbstract
An unclean speech can degrade the value of audio data. Speech enhancement techniques have played a crucial role in recovering its importance. Speech signal denoising has been an important research problem over the past decades that we can divide into three categories: classical statistical methods (1970s–2000s), early deep learning methods (2010s–2020s), and modern state-of-the-art diffusion techniques (2022–present). These latest techniques require substantial computational resources during model training and inference. In this paper, we investigate the performance of the pretrained SGMSE+ model through inference under constrained resources. We use two systematic batch-processing experiments for the entire test set of the VoiceBank+DEMAND dataset [1] in the pipeline using Google Colab's free GPUs. The proposed inference pipeline confirms that high-quality speech enhancement is achievable on free-tier consumer hardware without any model modifications, fine-tuning, or architectural changes. Experiment 1 achieves PESQ 2.90 and SI-SDR 17.4 dB, statistically equivalent to the published baseline, while Experiment 2 reduces inference calls by 84% while preserving PESQ at 2.893. Careful inference-time optimization enables stable and reproducible inference of pretrained diffusion-based models under limited resources, highlighting its importance in diffusion-based systems.













