Computer Science Artificial Intelligence

Speech Recognition and Synthesis

Description

This cluster of papers focuses on the advances in speech recognition technology, covering topics such as acoustic modeling using deep neural networks, speaker verification, convolutional neural networks for speech recognition, end-to-end speech recognition systems, hidden Markov models, sequence-to-sequence models, automatic speech recognition, speaker diarization, and statistical language modeling.

Keywords

Deep Neural Networks; Acoustic Modeling; Speaker Verification; Convolutional Neural Networks; End-to-End Speech Recognition; Hidden Markov Models; Sequence-to-Sequence Models; Automatic Speech Recognition; Speaker Diarization; Statistical Language Modeling

We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together … We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.
We introduce the DET Curve as a means of representing performance on detection tasks that involve a tradeoff of error types.We discuss why we prefer it to the traditional ROC … We introduce the DET Curve as a means of representing performance on detection tasks that involve a tradeoff of error types.We discuss why we prefer it to the traditional ROC Curve and offer several examples of its use in speaker recognition and language recognition.We explain why it is likely to produce approximately linear curves.We also note special points that may be included on these curves, how they are used with multiple targets, and possible further applications.
The article describes a database of emotional speech. Ten actors (5 female and 5 male) simulated the emotions, producing 10 German utterances (5 short and 5 longer sentences) which could … The article describes a database of emotional speech. Ten actors (5 female and 5 male) simulated the emotions, producing 10 German utterances (5 short and 5 longer sentences) which could be used in everyday communication and are interpretable in all applied emotions. The recordings were taken in an anechoic chamber with high-quality recording equipment. In addition to the sound electro-glottograms were recorded. The speech material comprises about 800 sentences (seven emotions * ten actors * ten sentences + some second versions). The complete database was evaluated in a perception test regarding the recognisability of emotions and their naturalness. Utterances recognised better than 80% and judged as natural by more than 60% of the listeners were phonetically labelled in a narrow transcription with special markers for voice-quality, phonatory and articulatory settings and articulatory features. The database can be accessed by the public via the internet (http://www.expressive-speech.net/emodb/).
A new recurrent neural network based language model (RNN LM) with applications to speech recognition is presented. Results indicate that it is possible to obtain around 50% reduction of perplexity … A new recurrent neural network based language model (RNN LM) with applications to speech recognition is presented. Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, even when the backoff model is trained on much more data than the RNN LM. We provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity. Index Terms: language modeling, recurrent neural networks, speech recognition
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional … We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset … Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.
Recently, the hybrid deep neural network (DNN)-hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is … Recently, the hybrid deep neural network (DNN)-hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.
A software formant synthesizer is described that can generate synthetic speech using a laboratory digital computer. A flexible synthesizer configuration permits the synthesis of sonorants by either a cascade or … A software formant synthesizer is described that can generate synthetic speech using a laboratory digital computer. A flexible synthesizer configuration permits the synthesis of sonorants by either a cascade or parallel connection of digital resonators, but frication spectra must be synthesized by a set of resonators connected in parallel. A control program lets the user specify variable control parameter data, such as formant frequencies as a function of time, as a sequence of 〈time, value〉 points. The synthesizer design is described and motivated in Secs. I–III, and fortran listings for the synthesizer and control program are provided in an appendix. Computer requirements and necessary support software are described in Sec. IV. Strategies for the imitation of any speech utterance are described in Sec. V, and suggested values of control parameters for the synthesis of many English sounds are presented in tabular form.
In this paper, a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely, the choice of prior distribution … In this paper, a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely, the choice of prior distribution family, the specification of the parameters of prior densities, and the evaluation of the MAP estimates, are addressed. Using HMM's with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely, the forward-backward algorithm and the segmental k-means algorithm, are expanded, and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications/spl minus/parameter smoothing and model adaptation/spl minus/and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep … This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.
The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been … The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why speech models, based on Markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the Markov model to match observed signal patterns. Such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. Continued refinements in the theory and implementation of Markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.
The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy … The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes B, D, and G in varying phonetic contexts was chosen. For comparison, several discrete hidden Markov models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5% correct while the rate obtained by the best of the HMMs was only 93.7%.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. … A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct decalcification. Last, the performances of various systems are compared.
The description of a novel type of m-gram language model is given. The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of … The description of a novel type of m-gram language model is given. The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data. This solution compares favorably to other proposed methods. While the method has been developed for and successfully implemented in the IBM Real Time Speech Recognizers, its generality makes it applicable in other areas where the problem of estimating probabilities from sparse data arises.
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where … Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural … We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined … This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of … Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general … This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of … Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components … We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this … Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we explore LSTM RNN architectures for large scale acoustic modeling in speech recognition. We recently showed that LSTM RNNs are more effective than DNNs and conventional RNNs for acoustic modeling, considering moderately-sized models trained on a single machine. Here, we introduce the first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines. We show that a two-layer deep LSTM RNN where each LSTM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. This architecture makes more effective use of model parameters than the others considered, converges quickly, and outperforms a deep feed forward neural network having an order of magnitude more parameters. Index Terms: Long Short-Term Memory, LSTM, recurrent neural network, RNN, speech recognition, acoustic modeling.
Neural networks have become increasingly popular for the task of language modeling.Whereas feed-forward networks only exploit a fixed context length to predict the next word of a sequence, conceptually, standard … Neural networks have become increasingly popular for the task of language modeling.Whereas feed-forward networks only exploit a fixed context length to predict the next word of a sequence, conceptually, standard recurrent neural networks can take into account all of the predecessor words.On the other hand, it is well known that recurrent networks are difficult to train and therefore are unlikely to show the full potential of recurrent models.These problems are addressed by a the Long Short-Term Memory neural network architecture.In this work, we analyze this type of network on an English and a large French language modeling task.Experiments show improvements of about 8 % relative in perplexity over standard recurrent neural network LMs.In addition, we gain considerable improvements in WER on top of a state-of-the-art speech recognition system.
We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy … We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.Building these components often requires extensive domain expertise … A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.Building these components often requires extensive domain expertise and may contain brittle design choices.In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters.Given <text, audio> pairs, the model can be trained completely from scratch with random initialization.We present several key techniques to make the sequence-tosequence framework perform well for this challenging task.Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, … Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings … This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. … We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).Transformer models are good at capturing content-based global … Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer.Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3%without using a language model and 1.9%/3.9%with an external language model on test/testother.We also observe competitive performance of 2.7%/6.3%with a small model of only 10M parameters.
Akshay Kumar | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
In this project, we have worked on creating a voice cloning system using deep learning. The main idea was to build a model that can listen to one person's voice … In this project, we have worked on creating a voice cloning system using deep learning. The main idea was to build a model that can listen to one person's voice and then convert it into another person’s voice, in such a way that it sounds real and natural. We used the LibriSpeech dataset for training our model because it contains a large number of voice recordings from many different speakers, which helped us teach the model how various people speak.To process the audio, first we convert the voice into features like mel spectrograms and pitch (F0), which will help to capture the sound and style of someone’s voice. The captured features were then used to train a neural network that learns how to copy the target speaker’s voice style and apply it to a new voice. We used a multi-speaker training method so that the system doesn’t just work for one or two speakers, but can handle many different voices. After training, we tested our model by giving it new voice samples and asking it to clone those voices into different speaker styles. The results were quite good. The converted voices sounded very close to the target speakers and were easy to understand. We also checked the waveforms and did listening tests to compare the original and cloned voices. The output was smooth and clear, showing that the model was able to learn speaker characteristics effectively. Overall, this project shows that voice cloning using deep learning is possible and can give good results even without a huge amount of data. It has many future uses like helping people who can’t speak, making virtual assistants more personal, or even dubbing videos in different voices. In future, we can try adding emotions or working on real- time voice conversion as well. Keywords:Voice Cloning, Deep Learning, Mel Spectrogram, Speaker Conversion, Speech Synthesis, LibriSpeech.
Chinese, a tonal language with inherent homophonic ambiguity, poses significant challenges for semantic disambiguation in natural language processing (NLP), hindering applications like speech recognition, dialog systems, and assistive technologies. Traditional … Chinese, a tonal language with inherent homophonic ambiguity, poses significant challenges for semantic disambiguation in natural language processing (NLP), hindering applications like speech recognition, dialog systems, and assistive technologies. Traditional static disambiguation methods suffer from poor adaptability in dynamic environments and low-frequency scenarios, limiting their real-world utility. To address these limitations, we propose BLAF—a novel MacBERT-BiLSTM Hybrid Architecture—that synergizes global semantic understanding with local sequential dependencies through dynamic multimodal feature fusion. This framework incorporates innovative mechanisms for the principled weighting of heterogeneous features, effective alignment of representations, and sensor-augmented cross-modal learning to enhance robustness, particularly in noisy environments. Employing a staged optimization strategy, BLAF achieves state-of-the-art performance on the SIGHAN 2015 (data fine-tuning and supplementation): 93.37% accuracy and 93.25% F1 score, surpassing pure BERT by 15.74% in accuracy. Ablation studies confirm the critical contributions of the integrated components. Furthermore, the sensor-augmented module significantly improves robustness under noise (speech SNR to 18.6 dB at 75 dB noise, 12.7% reduction in word error rates). By bridging gaps among tonal phonetics, contextual semantics, and computational efficiency, BLAF establishes a scalable paradigm for robust Chinese homophone disambiguation in industrial NLP applications. This work advances cognitive intelligence in Chinese NLP and provides a blueprint for adaptive disambiguation in resource-constrained and dynamic scenarios.
Sayali B. Patil | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
The eBook to Audio Converter project aims to transform written eBooks into high-quality audio formats, enhancing accessibility and convenience for users. This project addresses the growing demand for accessible digital … The eBook to Audio Converter project aims to transform written eBooks into high-quality audio formats, enhancing accessibility and convenience for users. This project addresses the growing demand for accessible digital content by converting various eBook formats (e.g., PDF, EPUB, DOCX) into natural-sounding spoken audio. The process involves several key steps: extracting text from eBooks, cleaning and refining the content by removing non-essential elements and converting the refined text into speech using advanced Text-to-Speech (TTS) technology. The system supports multiple eBooks formats and utilizes TTS engines to provide customizable and lifelike speech synthesis. Key features include support for diverse le types, sophisticated text parsing to ensure clarity, and integration with TTS services that over voice customization and SSML support. This project aims to facilitate a seamless transition from text to audio, making written content more accessible to individuals with visual impairments, those who prefer auditory learning, or anyone needing to multitask. By leveraging cutting-edge TTS technology and focusing on user experience, the eBook to Audio Converter project provides a robust solution that enhances digital content accessibility and usability. Index Terms - eBook-to-Audio Conversion, Text-to-Speech (TTS), Natural Language Processing (NLP), Neural Speech Synthesis, Voice Cloning, Audio Signal Processing, Digital Content Accessibility, Automated Audio Rendering.
Speech emotion identification is one of the most difficult areas of human-computer interaction, with significant ramifications for assistive technologies, customer support, and mental health monitoring. Despite significant advances in machine … Speech emotion identification is one of the most difficult areas of human-computer interaction, with significant ramifications for assistive technologies, customer support, and mental health monitoring. Despite significant advances in machine learning, accurately identifying emotional states from speech remains difficult due to the complex, nuanced nature of vocal emotional expressions across diverse speakers and contexts. This study presents a comprehensive evaluation of Speech Emotion Recognition (SER) systems across multiple machine learning paradigms using four benchmark datasets (CREMA-D, RAVDESS, SAVEE, and TESS). We implement a multi-feature extraction approach incorporating prosodic, spectral, and voice quality features, while employing data augmentation techniques to enhance model robustness. Our investigation spans traditional machine learning algorithms, ensemble methods, and deep learning architectures including CNN and RNN implementations. Performance evaluation reveals the superiority of the Stacking Classifier (accuracy: 72.54%, F1-score: 72.47%), with strong performances from Random Forest (68.31% accuracy) and ResNet (66% accuracy). This comparative analysis advances affective computing by providing detailed insights into the effectiveness of various approaches for emotion recognition in speech, with significant implications for developing more sophisticated emotional intelligence systems.
Prahallad Kishore | International Journal for Research in Applied Science and Engineering Technology
This project focuses on developing a human-level text-to-speech (TTS) system using advanceddeep learning techniques, particularly style diffusion models. Traditional TTS systems often struggle withgenerating speech that sounds truly natural and … This project focuses on developing a human-level text-to-speech (TTS) system using advanceddeep learning techniques, particularly style diffusion models. Traditional TTS systems often struggle withgenerating speech that sounds truly natural and expressive,especially when dealing with diversespeakingstyles. In this work, we explore StyleTTS 2, a novel approach that models speech styles as latent variablesand uses diffusion processes to generate high-quality audio without the need for reference speech duringinference. By integrating large-scale speech language models and adversarial training, our systemsignificantly improves the naturalness, expressiveness, and generalization of synthesized speech. The model was trained and tested on benchmarkdatasets like LJSpeech and VCTK, where it achieved performancethat matches or exceeds human recordings basedonMeanOpinionScores(MOS)andComparativeMOS (CMOS).Ourresultsdemonstratethat combining diffusion models with deep learning and style modelingcan bring TTS systems closer to real human speech inbothqualityandvariability.Wealsoconductedextensiveevaluationsonoutofdistributiontextinputs,whereourmodelmaintainedhighqualityoutput,showcasingitsrobustness.Overall,thisworkhighlightsthepotentialofdiffusionbasedmodelstopushtheboundariesofhu man-likespeechsynthesisin real-worldapplications
Speech recognition models, predominantly trained on standard speech, often exhibit lower accuracy for individuals with accents, dialects, or speech impairments. This disparity is particularly pronounced for economically or socially marginalized … Speech recognition models, predominantly trained on standard speech, often exhibit lower accuracy for individuals with accents, dialects, or speech impairments. This disparity is particularly pronounced for economically or socially marginalized communities, including those with disabilities or diverse linguistic backgrounds. Project Euphonia, a Google initiative originally launched in English dedicated to improving Automatic Speech Recognition (ASR) of disordered speech, is expanding its data collection and evaluation efforts to include international languages like Spanish, Japanese, French and Hindi, in a continued effort to enhance inclusivity. This paper presents an overview of the extension of processes and methods used for English data collection to more languages and locales, progress on the collected data, and details about our model evaluation process, focusing on meaning preservation based on Generative AI.
Speaker verification (SV) is an exceptionally effective method of biometric authentication. However, its performance is heavily influenced by the effectiveness of the extracted speaker features and their suitability for use … Speaker verification (SV) is an exceptionally effective method of biometric authentication. However, its performance is heavily influenced by the effectiveness of the extracted speaker features and their suitability for use in resource-limited environments. Transformer models and convolutional neural networks (CNNs), leveraging self-attention mechanisms, have demonstrated state-of-the-art performance in most Natural Language Processing (NLP) and Image Recognition tasks. However, previous studies indicate that standalone Transformer and CNN architectures present distinct challenges in speaker verification. Specifically, while Transformer models deliver good results, they fail to meet the requirements of low-resource scenarios and computational efficiency. On the other hand, CNNs perform well in resource-constrained environments but suffer from significantly reduced recognition accuracy. Several existing approaches, such as Conformer, combine Transformers and CNNs but still face challenges related to high resource consumption and low computational efficiency. To address these issues, we propose a novel solution that enhances the Transformer model by introducing multi-scale convolutional attention and a Global Response Normalization (GRN)-based feed-forward network, resulting in a lightweight backbone architecture called the lightweight simple transformer (LST). We further improve LST by incorporating the Res2Net structure from CNN, yielding the Res2Former model—a low-parameter, high—precision SV model. In Res2Former, we design and implement a time-frequency adaptive feature fusion(TAFF) mechanism that enables fine-grained feature propagation by fusing features at different depths at the frame level. Additionally, holistic fusion is employed for global feature propagation across the model. To enhance performance, multiple convergence methods are introduced, improving the overall efficacy of the SV system. Experimental results on the VoxCeleb1-O, VoxCeleb1-E, VoxCeleb1-H, and Cn-Celeb(E) datasets demonstrate that Res2Former achieves excellent performance, with the Large configuration attaining Equal Error Rate (EER)/Minimum Detection Cost Function (minDCF) scores of 0.81%/0.08, 0.98%/0.11, 1.81%/0.17, and 8.39%/0.46, respectively. Notably, the Base configuration of Res2Former, with only 1.73M parameters, also delivers competitive results.
With the rapid development of synthetic speech and deepfake technology, fake speech poses a severe challenge to voice authentication systems. Traditional detection methods generally rely on manual feature extraction, facing … With the rapid development of synthetic speech and deepfake technology, fake speech poses a severe challenge to voice authentication systems. Traditional detection methods generally rely on manual feature extraction, facing problems such as limited feature expression ability and insufficient cross-scenario generalization performance. To this end, this paper proposes an improved ResNet network based on a Convolutional Block Attention Module (CBAM) for end-to-end fake speech detection. This method introduces channel attention and spatial attention mechanisms into the ResNet network structure to enhance the model’s attention to the temporal characteristics of speech, thereby improving the ability to distinguish between real and fake speech. The proposed model adopts an end-to-end training strategy, directly processes the original spectrogram input, uses the residual structure to alleviate the gradient vanishing problem in the deep network, and enhances the collaborative expression ability of local details and global context through the CBAM module. The experiment is conducted on the ASVspoof2019 LA dataset, and the equal error rate (EER) is used as the main evaluation indicator. The experimental results show that compared with traditional deepfake speech detection methods, the proposed model achieves better performance in indicators such as EER, verifying the effectiveness of the CBAM attention mechanism in forged speech detection.
This paper proposes a Channel-Aware Speech Network (CAs-Net) for low-resource speech recognition tasks, aiming to improve recognition performance for languages such as Uyghur under complex noisy conditions. The proposed model … This paper proposes a Channel-Aware Speech Network (CAs-Net) for low-resource speech recognition tasks, aiming to improve recognition performance for languages such as Uyghur under complex noisy conditions. The proposed model consists of two key components: (1) the Channel Rotation Module (CIM), which reconstructs each frame's channel vector into a spatial structure and applies a rotation operation to explicitly model the local structural relationships within the channel dimension, thereby enhancing the encoder's contextual modeling capability; and (2) the Multi-Scale Depthwise Convolution Module (MSDCM), integrated within the Transformer framework, which leverages multi-branch depthwise separable convolutions and a lightweight self-attention mechanism to jointly capture multi-scale temporal patterns, thus improving the model's perception of compact articulation and complex rhythmic structures. Experiments conducted on a real Uyghur speech recognition dataset demonstrate that CAs-Net achieves the best performance across multiple subsets, with an average Word Error Rate (WER) of 5.72%, significantly outperforming existing approaches. These results validate the robustness and effectiveness of the proposed model under low-resource and noisy conditions.
Dysarthria frequently occurs in individuals with disorders such as stroke, Parkinson's disease, cerebral palsy, and other neurological disorders. Well-timed detection and management of dysarthria in these patients is imperative for … Dysarthria frequently occurs in individuals with disorders such as stroke, Parkinson's disease, cerebral palsy, and other neurological disorders. Well-timed detection and management of dysarthria in these patients is imperative for efficiently handling the development of their condition. Several previous studies have concentrated on detecting dysarthria speech using machine learning-based methods. However, the false positive rate is high due to the varying nature of speech and environmental factors such as background noise. Therefore, in this work, we employ a model based on the Swin transformer (ST), namely DSR-Swinoid. Firstly, the speech is converted into mel-spectrograms to reflect the maximum patterns of voice signals. Despite the ST's initial aim to effectively extract the local and global visual features, it still prioritizes global features. Meanwhile, in mel-spectrograms, the specific gaps due to slurred speech are considered. Therefore, our objective is to improve the ST's capacity for learning local features by introducing 4 modules: network for local feature capturing (NLF), convolutional patch concatenation, multi-path (MP), and multi-view block (MVB). The NLF module enriches the existing Swin transformer by enhancing its capability to capture local features effectively. MP integrates features from different Swin phases to emphasize local information. In the meantime, the MVB-ST block surpasses classical Swin blocks by integrating diverse receptive fields, focusing on a more comprehensive extraction of local features. Investigational outcomes explain that the DSR-Swinoid attains the best exactness of 98.66%, exceeding the outcomes by existing methods.
Speech Emotion Recognition (SER) has emerged as a vital research domain that aims to imbue machines with the capability to discern human emotional states from vocal cues. The efficacy of … Speech Emotion Recognition (SER) has emerged as a vital research domain that aims to imbue machines with the capability to discern human emotional states from vocal cues. The efficacy of deep learning (DL) models in SER is profoundly dependent on the initial data preprocessing stage. This study provides an in-depth exploration of data preprocessing techniques critical for DL-based SER, including noise reduction, signal and feature normalization, diverse acoustic feature extraction methodologies (e.g., Mel-Frequency Cepstral Coefficients (MFCCs), Mel-spectrograms, Chroma features, and standardized sets such as eGeMAPS), and various data augmentation strategies. Furthermore, we propose a comprehensive framework for the systematic evaluation of these preprocessing pipelines. This framework advocates a rigorous, incremental approach to experimentation designed to isolate and quantify the impact of individual and combined preprocessing steps. The objective is to foster the development of evidence-based guidelines and best practices in SER preprocessing, thereby contributing to the creation of more accurate, robust, and generalizable emotion recognition systems for diverse, real-world applications.
<title>Abstract</title> Detecting malicious domains generated by Domain Generation Algorithms (DGAs) remains a significant challenge, particularly for wordlist-based DGAs that mimic legitimate domain patterns. In this work, we present an interpretable … <title>Abstract</title> Detecting malicious domains generated by Domain Generation Algorithms (DGAs) remains a significant challenge, particularly for wordlist-based DGAs that mimic legitimate domain patterns. In this work, we present an interpretable and adaptable DGA detection framework that employs Large Language Models, specifically LLaMA 3 8B. Our approach integrates Supervised Fine-Tuning, In-Context Learning (ICL), and SHAP-based explainability to enhance both performance and transparency. We evaluate our system on a large-scale dataset comprising 68 DGA families, including difficult wordlist-based variants, as well as benign domains from the Tranco dataset. The fine-tuned model surpasses existing state-of-the-art detectors in accuracy and false positive rate, especially on challenging word-based DGAs. Moreover, we demonstrate how SHAP can identify failure cases and guide lightweight updates via ICL, improving detection without full retraining. This combination of interpretability and adaptability offers a practical approach for maintaining high-performance DGA detection systems over time, establishing LLMs as effective and explainable tools for real-world cybersecurity applications.
| International Journal of Advanced Trends in Computer Science and Engineering
Automatic Speech Recognition (ASR) has experienced remarkable progress, transitioning from rule-based systems to deep learning methodologies that enhance interactions between humans and machines. Earlier ASR systems depended on Hidden Markov … Automatic Speech Recognition (ASR) has experienced remarkable progress, transitioning from rule-based systems to deep learning methodologies that enhance interactions between humans and machines. Earlier ASR systems depended on Hidden Markov Models (HMMs) and manual feature extraction, whereas contemporary frameworks like Wav2Vec2 utilize self-supervised learning to boost both efficiency and precision. Created by Facebook AI Research (FAIR), Wav2Vec2 analyzes raw audio by masking certain segments and predicting them using contextual information, thereby minimizing the reliance on extensive labeled datasets. This technique proves especially beneficial for languages with limited resources and diverse speech conditions. The architecture of Wav2Vec2 includes a convolutional feature encoder and a transformer-based context network, facilitating exceptional speech recognition with minimal labeled data. Its practical uses encompass automated email generation, where transcribed speech is organized into properly formatted email content. In comparison to traditional ASR systems, Wav2Vec2 offers enhanced accuracy, quicker learning, and better generalization across various languages and accents. This study examines the most recent developments in Wav2Vec2, focusing on its effectiveness in speech recognition workflows, comparisons with conventional ASR systems, and its practical use in converting speech to email. Although Wav2Vec2 enhances transcription accuracy, it still faces challenges such as background noise, accents, and the need for real-time processing. Future investigations aim to refine Wav2Vec2 for specific domains, further enhancing its ASR capabilities.