- 无标题文档
查看论文信息

中文题名:

 

基于深度学习的声音信号增强关键技术研究

    

姓名:

 邱翔宇    

学号:

 1049722003773    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 080703    

学科名称:

 工学 - 动力工程及工程热物理 - 动力机械及工程    

学生类型:

 硕士    

学校:

 武汉理工大学    

院系:

 船海与能源动力工程学院    

专业:

 动力机械及工程    

研究方向:

 噪声振动控制    

第一导师姓名:

 胡甫才    

第一导师院系:

 船海与能源动力工程学院    

完成日期:

 2023-03-15    

答辩日期:

 2023-05-13    

中文关键词:

 

信号分离 ; 深度学习 ; 信号增强 ; 混响抑制 ; 波束形成

    

中文摘要:

近年来,基于声音信号的监测方法逐渐成为无接触检测的重要手段。由于环境中其它声音及可能存在的由墙壁和其它物体表面反射引起的混响会对目标声信号造成干扰,麦克风采集到的目标信号质量会受到不同程度的影响。因此在声信号监测系统的信号处理模块中,针对不同场景设计的信号增强算法是声信号监测系统的核心任务。论文针对目标信号提取、源分离和降噪抑制混响等信号增强任务,开展声音信号增强深度学习模型研究,主要工作和结论如下:

(1)针对当前目标信号提取算法使用传统时频特征作为输入,导致结果精度低和计算量大的问题,提出了一种使用自学习特征的基于U-Net掩蔽模型目标信号提取算法。对U-Net进行了结构调整并且嵌入长短时记忆结构,提升了模型对信号特征图的特征提取能力。改进的U-Net掩蔽模型使用自学习特征经过ReLU函数激活保证其稀疏性,通过多组可训练权重的卷积采样对波形信号进行编码,避免了时频域方法存在的相位估计问题。实验结果表明,自学习特征对深度学习模型的性能以及信号建模能力的提升效果明显,所提出单通道目标信号提取算法在不同声源信号分离任务中取得了较好的性能指标。

(2)针对源分离深度学习模型难以平衡计算成本和准确性的问题,提出了一种基于Conv-TasNet的端到端全卷积时域声源分离改进模型。针对Conv-TasNet模型对信号特征图利用不充分的问题,设计了一种同时具有时域和通道域特征提取能力的TCN结构,并且改进自学习特征的设置以适应全新的掩码估计网络。实验结果表明,改进的声源分离模型在经典的语音分离任务中对比基线模型的性能有显著的提升。计算复杂度分析结果表明,提出的基于Conv-TasNet的改进声源分离模型参数量少、计算量小,在实时应用场景中对比其它单通道声源分离模型更具有优势。

(3)针对现有用于多通道信号增强的波束形成方法过于依赖上下文信息的问题,提出了一种改进的神经网络滤波求和式波束形成模型。在FaSNet波束形成模型的基础上,设计了一种基于门控循环单元GRU的神经网络结构作为模型滤波阶段使用的滤波器。在RIR仿真验证数据集上实验结果表明,所提出的波束形成模型在处理多通道回声噪声语音增强任务上优于传统波束形成方法和基线模型FaSNet的性能。对比传统波束形成方法,所提出的模型需要的时序信息长度大幅度减少,能够应用于实时麦克风阵列信号处理。

参考文献:

[1] 马文嘉, 王丰华, 党晓婧. 基于稀疏自适应S变换的变压器短路冲击绕组状态声信号检测[J]. 电网技术, 2021,045(009):3755-3762.

[2] Hartmann W M, Candy J V. Acoustic Signal Processing[J]. Journal of the Acoustical Society of America, 2014.

[3] Comon P. Separation of stochastic processes[J]. IEEE Proceeding of Workshop on Higher-order Spectral Analysis, 1989:174-179.

[4] Abdi H, Williams L J. Principal component analysis[J]. Wiley Interdisciplinary Reviews Computational Statistics, 2010, 2(4):433-459.

[5] Vidal R. Generalized Principal Component Analysis (GPCA)[J]. Springer New York, 2016.

[6] Vidal R. Generalized Principal Component Analysis (GPCA): an Algebraic Geometric Approach to Subspace Clustering and Motion Segmentation[J]. Springer New York, 2003.

[7] 孙学明, 张大华, 周志全, 等. 基于主成分分析的激光麦克风的语音信号提取[J]. 激光与红外, 2022, 52(12):7.

[8] Cheng Y, Gao H. Underwater Acoustic Channel Denoising Algorithm Based on Least Mean Square Algorithm[J]. IOP Conference Series Earth and Environmental Science, 2019, 252:042091.

[9] Michel U. History of acoustic beamforming[C]. Berlin Beamforming Conference. DLR, 2006.

[10] Habets E, Member, IEEE, et al. New Insights Into the MVDR Beamformer in Room Acoustics[J]. IEEE Transactions on Audio Speech and Language Processing, 2009, 18(1):158-170.

[11] Higuchi T, Ito N, Yoshioka T, et al. Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.

[12] Barker J, Marxer R, Vincent E, et al. The third 'CHiME' speech separation and recognition challenge: Dataset, task and baselines[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2016.

[13] Bai, Ih J, Benesty J. Time‐Domain MVDR Array Filter for Speech Enhancement[M]. John Wiley & Sons, Ltd, 2013.

[14] 叶中付, 朱媛媛, 贾翔宇. 基于字典学习和稀疏表示的单通道语音增强算法综述[J]. 应用声学, 2019, 38(4):8.

[15] 廖重阳. 稀疏表示中的字典学习方法研究及其在图像分类中的应用[D]. 合肥工业大学, 2015.

[16] Mallat S G. A theory for multiresolution signal decomposition: the wavelet representation[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 1989, 11(4).

[17] 黄建军, 张雄伟, 张亚非, 等. 时频字典学习的单通道语音增强算法[J]. 声学学报, 2012, 37(5):9.

[18] 姜峰, 霍彦明, 李争. 稀疏表示及区分性联合字典学习语音降噪算法[J]. 小型微型计算机系统, 2020, 41(5):5.

[19] Wang Y, Zhang Y. Nonnegative Matrix Factorization: A Comprehensive Review[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 6(25):1336-1353.

[20] Xiao N, Li J, Xiao Q. SVD-NMF based denoising algorithm for pipeline leak signal[J]. Chinese Journal of Sensors and Actuators, 2017, 30(1):101-108.

[21] Darsono A M, Haron N Z, Saat S, et al. Blind Audio Source Separation with Sparse Nonnegative Matrix Factorization[J]. Research Journal of Applied Sciences Engineering & Technology, 2014, 7(23):5015-5020.

[22] Ono N, Rafii Z, Kitamura D, et al. The 2015 Signal Separation Evaluation Campaign[J]. Springer, Cham, 2015.

[23] Stöter, Fabian-Robert, Liutkus A, et al. The 2016 Signal Separation Evaluation Campaign[C]. Springer, Cham. Springer, Cham, 2018.

[24] Fabian-RobertStter, AntoineLiutkus, NobutakaIto. The 2018 Signal Separation Evaluation Campaign[J]. Springer, Cham, 2018.

[25] Huang P S, Kim M, Hasegawa-Johnson M, et al. Deep learning for monaural speech separation[C]. IEEE International Conference on Acoustics. IEEE, 2014.

[26] Xu Y, Du J, Dai L R, et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015.

[27] Grais E M, Roma G, Simpson A, et al. Single-Channel Audio Source Separation Using Deep Neural Network Ensembles[C]. Proc 1 Aes Convention, June Convention Paper. 2016.

[28] Takahashi N, Goswami N, Mitsufuji Y. MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation[J]. 2018.

[29] Erdogan H, Hershey J R, Watanabe S, et al. Phase-Sensitive and Recognition-Boosted Speech Separation using Deep Recurrent Neural Networks[J]. IEEE, 2015.

[30] Lee Y S, Wang C Y, Wang S F, et al. Fully complex deep neural network for phase-incorporating monaural source separation[C]. IEEE International Conference on Acoustics. IEEE, 2017:281-285.

[31] Choi H S, Kim J H, Huh J, et al. Phase-aware Speech Enhancement with Deep Complex U-Net[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.

[32] Hu Y, Liu Y, Lv S, et al. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement[J]. 2020.

[33] Reddy C, Gopal V, Cutler R, et al. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results[J]. 2020.

[34] Y. Luo and N. Mesgaran1. Tasnet: Surpassing ideal time-frequency masking for speech separation[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.

[35] Yi Luo and Nima Mesgarani. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.

[36] 徐亮, 王晶杨, 文镜, 等. 基于Conv-TasNet的多特征融合音视频联合语音分离算法[J]. 信号处理. 2021,37(10):1799-1805.

[37] Stoller D, Ewert S, Dixon S. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation[J]. 2018.

[38] 皮磊, 朱磊, 郑翔, 等. 基于改进Wave-U-Net跳跃连接的盲源分离算法[J]. 信号处理, 2022(004):038.

[39] A Défossez, Usunier N, Bottou L, et al. Music Source Separation in the Waveform Domain[J]. 2019.

[40] Ditter D, Gerkmann T. A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet[J]. 2019.

[41] Zhu W, Wang M, Zhang X L, et al. A comparison of handcrafted, parameterized, and learnable features for speech separation[J]. 2020.

[42] Li C, Luo Y, Han C, et al. Dual-Path RNN for Long Recording Speech Separation[C]. Spoken Language Technology Workshop. Institute of Electrical and Electronics Engineers Inc. 2021.

[43] Lam M, Wang J, Su D, et al. Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks[J]. 2021.

[44] Subakan C, Ravanelli M, Cornell S, et al. Attention is All You Need in Speech Separation [P]. 10.48550/arXiv.2010.13154, 2020.

[45] Guimares H R, Nagano H, Silva D W. Monaural Speech Enhancement through Deep Wave-U-Net[J]. Expert Systems with Applications, 2020, 158:113582.

[46] Fu S W, Wang T W, Yu T, et al. End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks[J]. 2018.

[47] Heitkaemper J, Jakobeit D, Boeddeker C, et al. Demystifying TasNet: A Dissecting Approach[J]. arXiv, 2019.

[48] A T H, B Z C A, C H E A, et al. Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend[J]. Computer Speech & Language, 2017, 46:401-418.

[49] Y. Luo, C. Han, N. Mesgarani, E. Ceolini, et al. FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing[J]. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 260–267.

[50] Gu R , Zhang S X , Chen L , et al. Enhancing end-to-end multi-channel speech separation via spatial feature learning[C]. IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), 2020, pp. 7319–7323.

[51] Zhang X, Zou Y, Wei S. Dilated convolution neural network with LeakyReLU for environmental sound classification[C]. 2017 22nd International Conference on Digital Signal Processing (DSP). IEEE, 2017.

[52] Barzegar V, Laflamme S, Hu C, et al. Ensemble of recurrent neural networks with long short-term memory cells for high-rate structural health monitoring[J]. Mechanical Systems & Signal Processing, 2022(164-):164.

[53] Virtanen T, Gemmeke J F, Raj B. Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio[J]. IEEE Transactions on Audio Speech and Language Processing, 2013, 21(11):2277-2289.

[54] Han K, Wang Y, Wang D L. Learning spectral mapping for speech dereverberation[C]. IEEE International Conference on Acoustics. IEEE, 2014.

[55] Hu G, Wang D L. Speech segregation based on pitch tracking and amplitude modulation[C]. IEEE Workshop on Applications of Signal Processing to Audio & Acoustics. IEEE, 2002.

[56] Srinivasan S, Roman N, Wang D L. Binary and ratio time-frequency masks for robust speech recognition[J]. Speech Communication, 2006, 48(11):1486-1501.

[57] Narayanan A, Wang D L. Ideal ratio mask estimation using deep neural networks for robust speech recognition[C]. IEEE International Conference on Acoustics. IEEE, 2013.

[58] Williamson D S, Wang Y, Wang D L. Complex ratio masking for monaural speech separation[J]. IEEE/ACM Trans. Audio Speech Lang. Proc., vol. 24, pp. 483–492, 2016.

[59] Salamon J, Jacoby C, Bello J P. A Dataset and Taxonomy for Urban Sound Research[J]. ACM, 2014.

[60] Liu Y, Delfarah M, Wang D L. Deep Casa for Talker-independent Monaural Speech Separation[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.

[61] Hershey J R, Chen Z, Roux J L, et al. Deep clustering: Discriminative embeddings for segmentation and separation[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.

[62] Zhang L, Shi Z, Han J, et al. FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks[C]. 2020.

[63] 王昕, 蒋志翔, 张杨, 等. 基于时间卷积网络的深度聚类说话人语音分离[J]. 计算机工程与设计, 2020(009): 041.

[64] Hershey J R, Chen Z, Roux J L, et al. Deep clustering: Discriminative embeddings for segmentation and separation[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.

[65] Murthi M N, Rao B D. All-pole modeling of speech based on the minimum variance distortionless response spectrum[J]. IEEE Transactions on Speech & Audio Processing, 2000, 8(3):221-239.

[66] 刘雨佶,童峰,陈东升,等. 面向船载远程会议的麦克风阵列高精度DOA估计[J].电子技术应用,2022,48(3):32-36,77.

[67] Bu S, Zhao Y, Zhao T, S, et al. Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition[J] IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2705-2715, 2022.

[68] Zhang Z, Xu Y, Yu M, et al. ADL-MVDR: All deep learning MVDR beamformer for target speech separation[C]. International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2020.

[69] Hegde R M, Jin Y, Rao B D. Spectral Estimation of Voiced Speech using a Family of MVDR Estimates [C]. IEEE. IEEE, 2007.

[70] Deng S, Bao C, Cheng R. GEV Beamforming with BAN Integrating LPS Estimation and Post-filtering[C]. 2020 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). IEEE, 2020.

[71] Heymann J, Drude L, Chinaev A, et al. BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2016.

[72] Grais E M, Ward D, MD Plumbley. Raw Multi-Channel Audio Source Separation using Multi- Resolution Convolutional Auto-Encoders[P]. IEEE, 10.23919/EUSIPCO.2018.8553571, 2018.

[73] Priyanka S S. A review on adaptive beamforming techniques for speech enhancement[C]. 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, 2018.

[74] Hamid U, Qamar R A, Waqas K. Performance comparison of time-domain and frequency-domain beamforming techniques for sensor array processing[C]. Proceedings of 2014 11th International Bhurban Conference on Applied Sciences & Technology (IBCAST) Islamabad, Pakistan, 14th-18th January, 2014. IEEE, 2014.

[75] Peng K, Sun K, Chen H, et al. A Speech Enhancement Method Using Attention Mechanism and Gated Recurrent Unit[C]. 2021 3rd International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 2021, pp. 1-5.

[76] Yuan W H, Lou Y X, Xia B, et al. Speech enhancement method based on convolutional gated recurrent neural network[J]. Journal of Huazhong University of Science and Technology (Natural Science Edition), vol. 47, no. 4, pp. 13-18, Apr. 2019.

[77] Brandstein M S, Silverman H F. A robust method for speech signal time-delay estimation in reverberant rooms[C]. IEEE International Conference on Acoustics. IEEE, 1997.

[78] Yang S, Yu X, Zhou Y. LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review Dataset as an Example[C]. 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI). 2020.

[79] Scheibler R, Bezzam E, I Dokmanić. Pyroomacoustics: A Python package for audio room simulations and array processing algorithms[J]. IEEE, 2018.

[80] Bezzam E, Scheibler R, Cadoux C, et al. A study on more realistic room simulation for far-field keyword spotting[J]. 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 2020, pp. 674-680.

[81] Lehmann E A, Johansson A M. Diffuse Reverberation Model for Efficient Image-Source Simulation of Room Impulse Responses[J]. IEEE Transactions on Audio Speech & Language Processing, 2010, 18(6):1429-1439.

中图分类号:

 TN912.35    

条码号:

 002000074553    

馆藏号:

 YD10002543    

馆藏位置:

 203    

备注:

 403-西院分馆博硕论文库;203-余家头分馆博硕论文库    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式