Speech to text:
Everything about speech & voice recognition

Speech Recognition, or Automatic Speech Recognition (ASR), Computer Speech Recognition, Voice to Text or Speech to Text, are all names for the field of computer science that deals with the development of technologies which can turn the spoken word into text.

Natural Language Processing

Siri, Alexa, Cortana and Ok-Google are widely known examples of interfaces that have been developed using advanced ASR models. Speech Recognition is a branch of a large scientific domain known as Natural Language Processing (NLP). 

NLP involves everything related to modern computational linguistics. Other fields associated to NLP are Natural Language Understanding and Natural Language Generation. 

The first is aimed at extracting analytical insights from speech, the second is a process that transforms data in natural language, so text to speech in other words. Both these domains require speech to text as a starting point. 

After all, you can’t create natural language from data if you do not have the text, just as you can not extract relevant analytics if you do not first have turned the spoken word into text.

An ASR system consists of statistical models which represent mapping of continuous phonetic sound sequences (spoken utterances or speech waveforms) to a recognizable text output in human language. The ASR model contains a Language Model, a Pronunciation Model (Lexicon/Dictionary) and an Acoustic Model. Once the models are consistently trained with new speech data with multiple speakers and an extended vocabulary (Language Model), the accuracy of the transcription increases. In statistical units, this accuracy is measured by the Word Error Rate (WER).

Ideally when a model is highly accurate, the WER corresponds to less than 10%. In the end the single most important aspect when training a speech recognition model is the quality of the dataset, the audio and lexicon. After all it is that data where the models get their information from or learn from. The same goes for using the models. If the audio is of low quality so will the output. Think about it, it is the same with humans, if someone speaks unclearly, we cannot understand them. It is the same for machines, if the audio is unclear it cannot understand what is being said.

Speech Recognition enables users to make audio content accessible. A lot of useful information is locked away in audio, but it is not easy to search through. By applying speech to text audio is turned into text and therefore becomes both accessible and searchable on word level. This means the output can be used as automatic subtitles for people suffering from hearing loss, and as a means of indexing content archives in addition to existing metadata in order to make large archives better searchable. Think about a journalist looking up footage for a story he or she is creating. Instead of trying to guess the right date a certain event occurred, the journalist can simply look up the footage based on keywords about the event he or she is looking for. 

 

Figure 1: Speech to text process generic model

  1. User uploads recorded audio content to the platform.
  2. The acoustic model within the Speech Recognition Engine analyses sounds.
  3. The lexicon model syncs the sounds with the right words.
  4. The language model structures the results and delivers a raw text file (JSON) with all words having a confidence score, speaker ID and timestamp.
  5. The file can be restructured as a transcript or subtitle file.

Figure 2: Speech to text process custom model

  1. User uploads recorded audio content to the platform.
  2. The acoustic model trained with customer data (audio) within the Speech Recognition Engine analyses sounds.
  3. The lexicon model trained with customer data (transcripts) syncs the sounds with the right words.
  4. The custom language model structures the results and delivers a raw text file (JSON) with all words having a confidence score, speaker ID and timestamp.
  5. The file can be restructured as a transcript or subtitle file.

Step 1: choose the functions and features of speech recognition

Speech recognition services come in many forms. Companies that offer speech recognition can focus on the transcription part; but they can also focus more on using speech to text for subtitling purposes and then there are those that offer speech to text as a means for indexing large archives of content. Whatever the use case me be surely there is an option out there that fits your requirements.

It is important to bear in mind that in the end speech recognition services provide the means to turn the spoken word into text, and with text you can do all sorts of things. At Scriptix we provide users with an API-platform to integrate this process of turning speech to text into their existing workflows.

Step 2: Convert speech to text with an API and in different languages

The great thing about automatic speech recognition is that models can be built for any language out there, all that is needed is the right dataset. What this means, is that in order to build a model in a certain language you would need thousands of hours of audio in that specific language as well as hundreds of hours of perfect transcripts in that specific language.

Using the audio data, engineers can build an acoustic model that contains specific sounds and with the transcript data, engineers can build a lexicon that contains specific words. These two make up the language model, and by applying artificial intelligence and running multiple iterations with this data the language model will become better and better in making the right combinations between sounds and words. There is not a vendor out there that supports all the languages and dialects of the world, but in theory this is possible as long as the model can be trained with the right data sets.

Step 3: Integrate speech recognition with Python

Integrating an API-platform such as the one Scriptix offers is a no-brainer for developers. By following our online API-documentation you have all the info you need to set-up a speech recognition workflow in no time. In other words, if you can connect to API’s you can integrate a service such as Scriptix speech to text in workflows you already have in place, it’s like an extra piece of the puzzle that is complementary to the services you already offer your clients.  As a quick reference implementation, users can check out our Python SDK.

Users without any technical background can use the system as well, by simply logging in with their credentials and uploading files on the homepage. Once done, in the transcripts section users can check the results, make corrections using our editor, and download them in various formats.

There are many options for automatic speech to text software out there, from paid services to free and open-source options. The difference between the two lies mainly in the quality of the output they generate. Paid services such as Scriptix speech to text are aimed at generating the best possible output for the user. To that end we work together with customers to update and customize models based on their content to generate much more accurate transcripts. With free services the approach is always a generic one, what you see is what you get. For some use cases this can be just fine, but when accuracy is important, a paid service will surely be the way to go.

Moreover, open-source projects such as Kaldi, which Scriptix also contributes to may be free but actually applying the knowledge it contains requires a specific expertise. You would need qualified machine learning engineers who know how to build and curate the right data sets in order to make an open-source project such as Kaldi work for you.

Free services can be just fine but are always limited. For people who sometimes need to process a few minutes of content this can be fine, but for larger content producers that need to process a couple of hours per week for example such a restriction does not work.

Finally, free services usually do come with a price, and that is that you give away your data for free. At Scriptix for example, we strongly believe in privacy and by default delete all customer data right after processing.

Free Services

Paid Services

Low accuracy

High accuracy

Limited processing

Unlimited processing

No support

Support

No clear data storage guidelines

Clear data storage guidelines

Applying open source framework required specific expertise

Specific expertise in house

Automatic Speech Recognition is an exciting field, but can also be a complicated one. Which is why with the FAQ section below we want to help out our visitors with questions they might have. Can’t find your question here? No worries simply reach out to us directly using the contact form or [email protected].

What is speech recognition?

Speech Recognition, or Automatic Speech Recognition (ASR), Computer Speech Recognition, Voice to Text or Speech to Text, are all names for the field of computer science that deals with the development of technologies which can turn the spoken word into text automatically. 

How does speech recognition work?

A Speech recognition engine is made up out of a language model that consists of an acoustic model and a lexicon (or vocabulary). When you upload audio to that engine, what it does is analyse the audio for sounds it recognizes (this is what the acoustic model is therefore), after that the lexicon calculates the probability of what sounds should belong to what words. In this way audio is matched with words and as such your audio files are turned into text. Each word is also linked to the specific moment in the audio, or timestamped. The output generated is a transcript that can be used to subtitle movies or make audio archives searchable.

What is offline speech recognition data?

Working in the cloud is becoming the standard more and more. Microsoft Azure, AWS and Google are building datacentres around the world, enabling users to build scalable applications at an increasing rate. There is always the question of privacy when it comes to cloud. Which is why certain organizations that hold sensitive data, such as the police or banks, prefer not to send their audio to a platform because of the risk of data leaks. Offline speech recognition can provide a solution for this. It means the speech recognition model is containerized and deployed on the customers servers. Which means the system runs within their own environment and there is no need to send audio to the cloud using an internet connection.

How do I use speech recognition?

It depends on your use case. If you are looking to transcribe interviews for example, it is sufficient to create an account with Scriptix, login and upload your files on the homepage. The results you can download and use as you wish. If you would like to add Scriptix speech to text as an additional functionality to your product portfolio, you would have to integrate our API platform in your workflows. We’ve made this as easy as possible, by following our online API Documentation any developer should be able to integrate our platform. And the good thing is if you do struggle, we are always here to help. Simply get in touch using the contact form.

How do I add speech recognition?

If you want to integrate Scriptix speech to text API’s follow our online API Documentation or have a quick look at our Python SDK as a reference implementation. Need help? Get in touch by using the contact form.