A SURVEY OF SPEECH SEPERATION BY DEEP LEARNING
In the last decade, image and voice based applications have begun to take an important part in daily life and widely used for various purposes such as recognition, tracking, security, etc. . The most used methods in these fields are based on machine learning. For example conventional neural networks have been applied in many studies to improve accuracy in these fields. Parallel to this progress in learning algorithms, new processor which supports applying learning algorithm on big data and matrix based operations has been developed. The last step in parallel processing is applying deep learning in image and voice applications. From view point of hardware implementation, GPU processors support the learning and testing deep learning algorithms. In the last decade exploiting the capacity of GPU for mathematical operations provided a hardware with high performance and low cost for big matrix calculations. Combining this with novel machine learning techniques made possible to deal with big data and emerge of Deep Learning concept.
In order to separate an individual voice from the other in the noisy environment, the proposed study investigates and survey the last studies to find an accurate acceptable step for solving this problem using deep learning and suggests a deterministic approach for similar studies with conventional methods.
F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. INTERSPEECH, 2011, pp. 437–440.
G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep neural networks for large-vocabulary speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 20, no. 1, pp. 30–42, 2012.
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speech denoising using nonnegative matrix factorization with priors,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process., 2008, pp. 4029–4032.
Y. G. Jin and N. S. Kim, “On detecting target acoustic signals based on non-negative matrix factorization,” IEICE Trans. Inf. Syst., vol. E93-D, no. 4, pp. 922–925, Apr. 2010.
B. Raj, T. Virtanen, S. Chaudhuri, and R. Singh, “Non-negative matrix factorization based compensation of music for automatic speech recognition,” in Proc. of Interspeech,2010, pp. 717–720.
E. M. Grais and H. Erdogan, “Single channel speech music separation using nonnegative matrix factorization and spectral masks,” Int. Conf. Digital Signal Process., pp. 1–6, 2011.
T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, "NMF-based target source separation using deep neural network," IEEE Signal Process. Lett., vol. 22, no. 2, pp. 229--233, Feb. 2015.
Bregman, A.(1990). Auditory Scene Analysis. Cambridge: MIT Press.
M. Cooke, Modelling auditory processing and organisation. Cambridge University Press, 2005.
D. P. Ellis, “Prediction-driven computational auditory scene analysis,” Ph.D. dissertation, Massachusetts Institute of Technology, 1996.
S. I. Tamura, “An analysis of a noise reduction neural network,” in Proc. ICASSP, 1989, pp. 2001–2004.
F. Xie and D. V. Compernolle, “A family of MLP based nonlinear spectral estimators for noise reduction,” in Proc. ICASSP, 1994, pp. 53–56.
E. A. Wan and A. T. Nelson, “Networks for speech enhancement,” in Handbook of Neural Networks for Speech Processing, S. Katagiri, Ed. Norwell, MA, USA: Artech House, 1998.
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems 25 1090–1098 (2012).
K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.
Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Hosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR , 2015.
G. Huang, Z. Liu, K.Q. Weinberger Densely connected convolutional networks Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 4700-4708.
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.
A. Narayanan and D. Wang. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, pages 7092–7096, 2013.
Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using Long Short-Term Memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of INTERSPEECH, ISCA, Singapore (2014)
P.-S. Huang M. Kim M. Hasegawa-Johnson P. Smaragdis "Singing-voice separation from monaural recordings using deep recurrent neural networks" Proc. Int. Soc. Music Inf. Retrieval pp. 477-482 Oct. 2014.
Noda, Kuniaki & Hashimoto, Naoya & Nakadai, Kazuhiro & Ogata, Tetsuya. (2015). Sound source separation for robot audition using deep learning. 389-394.
Y. Wang, A. Narayanan, and D. L. Wang, "On training targets for supervised speech separation," IEEE/ACM Trans. Speech Audio Process., vol. 22, no. 12, pp. 1849-1858, Dec. 2014.
Simpson AJR (2015) “Probabilistic Binary-Mask Cocktail-Party Source Separation in a Convolutional Deep Neural Network”, arxiv.org abs/1503.06962
Simpson A.J.R., Roma G., Plumbley M.D. (2015) Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network. In: Vincent E., Yeredor A., Koldovský Z., Tichavský P. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2015. Lecture Notes in Computer Science, vol 9237. Springer, Cham
A.A. Nugraha, A. Liutkus, E. VincentMultichannel audio source separation with deep neural networks IEEE/ACM Trans. Audio Speech Lang. Process., 24 (2016), pp. 1652-1664