Speech to text device

The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer.

The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network ( paper). The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as y u-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. It does this with a feedback loop that feeds symbols predicted by the model back into it to predict the next symbols, as described in the figure below. The RNN-T recognizer outputs characters one-by-one, as you speak, with white spaces where appropriate. In our implementation, the output symbols are the characters of the alphabet. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. RNN-Ts are a form of sequence-to-sequence models that do not employ attention mechanisms. This proved to be an important step in creating the RNN-T architecture adopted in this latest release, which can be seen as a generalization of CTC. Meanwhile, an independent technique called connectionist temporal classification (CTC) had helped halve the latency of the production recognizer at that time. While these models showed great promise in terms of accuracy, they typically work by reviewing the entire input sequence, and do not allow streaming outputs as the input comes in, a necessary feature for real-time voice transcription. This sequence-to-sequence approach to learning a model by generating a sequence of words or graphemes given a sequence of audio features led to the development of " attention-based" and " listen-attend-spell" models. In early systems, these components remained independently-optimized.Īround 2014, researchers began to focus on training a single neural network to directly map an input audio waveform to an output sentence. Traditionally, speech recognition systems consisted of several components - an acoustic model that maps segments of audio (typically 10 millisecond frames) to phonemes, a pronunciation model that connects phonemes together to form words, and a language model that expresses the likelihood of given phrases. Video credit: Akshay Kannan and Elnaz Sarbar This video compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel) when recognizing the same spoken sentence. The model works at the character level, so that as you speak, it outputs words character-by-character, just as if someone was typing out what you say in real-time, and exactly as you'd expect from a keyboard dictation system. This means no more network latency or spottiness - the new recognizer is always available, even when you are offline. In our recent paper, " Streaming End-to-End Speech Recognition for Mobile Devices", we present a model trained using RNN transducer (RNN-T) technology that is compact enough to reside on a phone. Today, we're happy to announce the rollout of an end-to-end, all-neural, on-device speech recognizer to power speech input in Gboard. During this time, latency remained a prime focus - an automated assistant feels a lot more helpful when it responds quickly to requests. It was the beginning of a revolution in the field: each year, new architectures were developed that further increased quality, from deep neural networks (DNNs) to recurrent neural networks (RNNs), long short-term memory networks (LSTMs), convolutional networks (CNNs), and more. In 2012, speech recognition research showed significant accuracy improvements with deep learning, leading to early adoption in products such as Google's Voice Search.

Posted by Johan Schalkwyk, Google Fellow, Speech Team