- 无标题文档
查看论文信息

中文题名:

 

基于改进CRN的单通道语音增强研究

    

姓名:

 陈宏炎    

学号:

 1049721801643    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 083500    

学科名称:

 工学 - 软件工程    

学生类型:

 硕士    

学位:

 工学硕士    

学校:

 武汉理工大学    

院系:

 计算机科学与技术学院    

专业:

 软件工程    

研究方向:

 语音增强    

第一导师姓名:

 胡燕    

第一导师院系:

 武汉理工大学    

完成日期:

 2021-09-01    

答辩日期:

 2021-09-23    

中文关键词:

 

语音增强 ; 全尺度连接 ; 特征融合 ; 子带分析 ; 深度复数网络

    

中文摘要:

随着社会和信息技术的飞速发展,语音交互在各种设备和应用场景得到了广泛的应用,然而在真实的语音相关的应用场景中,语音信号往往会被复杂的环境噪声和混响所干扰,导致语音的可懂度下降,也会影响语音下游相关应用的性能,研究如何有效降低语音中的噪声干扰,提高失真语音信号的可懂度和整体感知质量便是语音增强的基本任务。本文主要研究对象就是的单通道语音增强。

主流的基于编解码器的单通道语音增强模型使用卷积编码器来对语音特征降维,然后对得到的高级特征使用编码器升维,得到目标输出,在这个编解码的过程中,还使用了跳跃连接将编码器层生成的特征图传递给同层的解码器层,从而帮助解码器更好地恢复出降噪后的语音。然而现有的方法未充分利用编解码过程中生成的全尺度特征,并且基于全带的模型忽略了语音的局部谱模式之间的差异,同时基于时频域的方法在重建干净语音时只是简单结合了原始输入信号的相位,未对短时傅里叶变换后虚部的相位信息进行充分考虑,制约了模型增强语音的效果,本文考虑上述问题,针对在时频域上基于CRN的语音增强方法进行了研究,主要研究内容包括:

(1)针对当前较为流行的基于编解码器的模型未充分利用全尺度特征,提出了一种全尺度特征连接的语音增强模型FSC-SENet。首先,本文构建了一种基于CRN架构的语音增强方法,使用卷积编码器和解码器来提取和恢复语音特征,在模型编解码器最窄处使用LSTM模块来提取特征的时序信息,然后提出了一种全尺度的连接方法和多特征动态融合机制,使得解码器在解码过程中能够充分利用全尺度的特征来恢复出干净语音。在TIMIT语料库上的实验结果表明,我们的FSC-SENet相比基础的骨干网络CRN,在可见噪声情况下PESQ提高0.39,STOI提高2.8%,在不可见噪声情况下PESQ提高0.43,STOI提高3.1%,证明本文所提出的全尺度连接机制能使基于编解码结构的网络(如CRN)有更好的语音增强性能。

(2)针对本文的单一的全带模型忽略局部谱模式信息问题,引入了语音的子带分析,提出了一种全带子带级联的模型来结合子带模型和全带模型各自的不足,同时还提出了简化的特征融合模块,对原始带噪语音特征和中间阶段的估计结果融合,帮助后续阶段的模型进行更好地估计。本文在TIMIT数据集上进行了实验,实验结果显示全带子带级联的语音增强模型相比其他模型具有最高的客观指标得分,证明我们所提出的两阶段语音增强模型相比纯全带和纯子带模型有更好的增强效果,全带模型和子带模型能够互补。

(3)针对当前在时频域上的语音增强模型对相位信息利用不充分,本文在上一个工作中所提出的两阶段语音增强模型的基础上提出了一种深度复数语音增强网络,使得网络能够对语音的复数特征进行操作,同时提出了支持复数特征的复数动态特征融合模块。网络能够更好地利用相位信息,而不是之前的只利用语音的幅值特征来预测干净语音。在开源数据集上的实验结果证明,在引入相位信息后本文模型有更好的增强效果,评价指标优于其它基准算法

参考文献:

[1] Loizou P C. Speech enhancement: Theory and practice [M]. CRC press, 2007.

[2] Boll S. Suppression of acoustic noise in speech using spectral subtraction [J]. IEEE Transactions on acoustics, speech, and signal processing, 1979, 27(2): 113-120.

[3] Paliwal K, Basu A. A speech enhancement method based on kalman filtering [C]. ICASSP'87 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1987: 177-180.

[4] Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator [J]. IEEE Transactions on acoustics, speech, and signal processing, 1984, 32(6): 1109-1121.

[5] Ephraim Y, Van Trees H L. A signal subspace approach for speech enhancement [J]. IEEE Transactions on speech and audio processing, 1995, 3(4): 251-266.

[6] 路成, 田猛, 周健, et al. L_(1/2)稀疏约束卷积非负矩阵分解的单通道语音增强方法 [J]. 声学学报, 2017, 42(03): 377-384.

[7] Wang Y, Wang D. Towards scaling up classification-based speech separation [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(7): 1381-1390.

[8] Xu Y, Du J, Dai L-R, et al. A regression approach to speech enhancement based on deep neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 23(1): 7-19.

[9] Wang D. On ideal binary mask as the computational goal of auditory scene analysis [M]. Speech separation by humans and machines. Springer. 2005: 181-197.

[10] Lu X, Tsao Y, Matsuda S, et al. Speech enhancement based on deep denoising autoencoder [C]. Interspeech, 2013:

[11] 徐勇. 基于深层神经网络的语音增强方法研究 [D]. 中国科学技术大学, 2015.

[12] Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks [C]. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015: 708-712.

[13] Chen J, Wang Y, Yoho S E, et al. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises [J]. The Journal of the Acoustical Society of America, 2016, 139(5): 2604-2612.

[14] Park S R, Lee J W. A fully convolutional neural network for speech enhancement [C]. Interspeech 2017, 2017: 1993-1997.

[15] Tan K, Wang D. A convolutional recurrent neural network for real-time speech enhancement [C]. Interspeech, 2018: 3229-3233.

[16] Tan K, Chen J, Wang D. Gated residual networks with dilated convolutions for monaural speech enhancement [J]. IEEE/ACM transactions on audio, speech, and language processing, 2018, 27(1): 189-198.

[17] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation [C].MICCAI 2015, 2015: 234-241.

[18] Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation [J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.

[19] Jansson A, Humphrey E, Montecchio N, et al. Singing voice separation with deep u-net convolutional networks [C]. 18th International Society for Music Information Retrieval Conference, 2017: 23-27.

[20] Stoller D, Ewert S, Dixon S. Wave-u-net: A multi-scale neural network for end-to-end audio source separation [C]. International Society for Music Information Retrieval (ISMIR) Conference 2018, 2018: 334-340.

[21] Soni M H, Shah N, Patil H A. Time-frequency masking-based speech enhancement using generative adversarial network [C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 5039-5043.

[22] Giri R, Isik U, Krishnaswamy A. Attention wave-u-net for speech enhancement [C]. 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019: 249-253.

[23] Deng F, Jiang T, Wang X, et al. Naagn: Noise-aware attention-gated network for speech enhancement [C]. INTERSPEECH, 2020: 2457-2461.

[24] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets [J]. Advances in neural information processing systems, 2014, 27.

[25] Pascual S, Bonafonte A, Serrà J. Segan: Speech enhancement generative adversarial network [C]. Interspeech 2017, 2017: 3642-3646.

[26] Liu G, Gong K, Liang X, et al. Cp-gan: Context pyramid generative adversarial network for speech enhancement [C]. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 6624-6628.

[27] Lin J, Niu S, van Wijngaarden A J, et al. Improved speech enhancement using a time-domain gan with mask learning [C]. INTERSPEECH, 2020: 3286-3290.

[28] Phan H, McLoughlin I V, Pham L, et al. Improving gans for speech enhancement [J]. IEEE Signal Processing Letters, 2020, 27: 1700-1704.

[29] Zhang Z, Deng C, Shen Y, et al. On loss functions and recurrency training for gan-based speech enhancement systems [C]. Proc Interspeech 2020, 2020: 3266-3270.

[30] Nugraha A A, Sekiguchi K, Yoshii K. A flow-based deep latent variable model for speech spectrogram modeling and enhancement [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1104-1117.

[31] Strauss M, Edler B. A flow-based neural network for time domain speech enhancement [C]. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 5754-5758.

[32] Phan H, Le Nguyen H, Chén O Y, et al. Self-attention generative adversarial network for speech enhancement [C]. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 7103-7107.

[33] Kim J, El-Khamy M, Lee J. T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement [C]. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 6649-6653.

[34] Zhao Y, Wang D, Xu B, et al. Monaural speech dereverberation using temporal convolutional networks with self attention [J]. IEEE/ACM transactions on audio, speech, and language processing, 2020, 28: 1598-1607.

[35] Koizumi Y, Yatabe K, Delcroix M, et al. Speech enhancement using self-adaptation and multi-head self-attention [C]. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 181-185.

[36] Pandey A, Wang D. Dense cnn with self-attention for time-domain speech enhancement [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 1270-1279.

[37] Zhao Y, Wang D. Noisy-reverberant speech enhancement using denseunet with time-frequency attention [C]. INTERSPEECH, 2020: 3261-3265.

[38] Li X, Horaud R. Online monaural speech enhancement using delayed subband lstm [C]. Proceedings of INTERSPEECH, 2020:

[39] Liu H, Xie L, Wu J, et al. Channel-wise subband input for better voice and accompaniment separation on high resolution music [J]. Proc Interspeech 2020, 2020: 1241-1245.

[40] Narayanan A, Wang D. Ideal ratio mask estimation using deep neural networks for robust speech recognition [C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013: 7092-7096.

[41] Wang Y, Narayanan A, Wang D. On training targets for supervised speech separation [J]. IEEE/ACM transactions on audio, speech, and language processing, 2014, 22(12): 1849-1858.

[42] Williamson D S, Wang Y, Wang D. Complex ratio masking for monaural speech separation [J]. IEEE/ACM transactions on audio, speech, and language processing, 2015, 24(3): 483-492.

[43] Tan K, Wang D. Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement [C]. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: 6865-6869.

[44] Trabelsi C, Bilaniuk O, Zhang Y, et al. Deep complex networks [C]. International Conference on Learning Representations, 2018:

[45] Choi H-S, Kim J-H, Huh J, et al. Phase-aware speech enhancement with deep complex u-net [C]. International Conference on Learning Representations, 2018:

[46] Hu Y, Liu Y, Lv S, et al. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement [C]. Proc Interspeech 2020, 2020: 2472-2476.

[47] Afouras T, Chung J S, Zisserman A. The conversation: Deep audio-visual speech enhancement [C]. Proc Interspeech 2018, 2018: 3244-3248.

[48] Wang W, Xing C, Wang D, et al. A robust audio-visual speech enhancement model [C]. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 7529-7533.

[49] Li A, Zheng C, Fan C, et al. A recursive network with dynamic attention for monaural speech enhancement [C]. Interspeech 2020, 2020: 2422-2426.

[50] Huang H, Lin L, Tong R, et al. Unet 3+: A full-scale connected unet for medical image segmentation [C]. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 1055-1059.

[51] Garofolo J S, Lamel L F, Fisher W M, et al. Darpa timit acoustic-phonetic continous speech corpus cd-rom. Nist speech disc 1-1.1 [J]. 1993, 93: 27403.

[52] Hu G, Wang D. A tandem algorithm for pitch estimation and voiced speech segregation [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(8): 2067-2079.

[53] Varga A, Steeneken H J. Assessment for automatic speech recognition: Ii. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems [J]. Speech communication, 1993, 12(3): 247-251.

[54] Rix A W, Beerends J G, Hollier M P, et al. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs [C]. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings (Cat No 01CH37221), 2001: 749-752.

[55] Taal C H, Hendriks R C, Heusdens R, et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136.

[56] Takahashi N, Mitsufuji Y. Multi-scale multi-band densenets for audio source separation [C]. 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017: 21-25.

[57] Liu Y, Zhang H, Zhang X, et al. Supervised speech enhancement with real spectrum approximation [C]. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: 5746-5750.

[58] Luo Y, Mesgarani N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation [J]. IEEE/ACM transactions on audio, speech, and language processing, 2019, 27(8): 1256-1266.

中图分类号:

 TN912.3    

条码号:

 002000062960    

馆藏号:

 TD10049613    

馆藏位置:

 403    

备注:

 403-西院分馆博硕论文库;203-余家头分馆博硕论文库    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式