Speech to text:
Everything about speech & voice recognition

Speech Recognition, or Automatic Speech Recognition (ASR), Computer Speech Recognition, Voice to Text or Speech to Text, are all names for the field of computer science that deals with the development of technologies which can turn the spoken word into text.

Frans and Rick discussing new Scriptix features

Natural Language Processing

Siri, Alexa, Cortana and Ok-Google are widely known examples of interfaces that have been developed using advanced ASR models. Speech Recognition is a branch of a large scientific domain known as Natural Language Processing (NLP). 

NLP involves everything related to modern computational linguistics. Other fields associated to NLP are Natural Language Understanding and Natural Language Generation. 

The first is aimed at extracting analytical insights from speech, the second is a process that transforms data in natural language, so text to speech in other words. Both these domains require speech to text as a starting point. 

After all, you can’t create natural language from data if you do not have the text, just as you can not extract relevant analytics if you do not first have turned the spoken word into text.

An ASR system consists of statistical models which represent mapping of continuous phonetic sound sequences (spoken utterances or speech waveforms) to a recognizable text output in human language. The ASR model contains a Language Model, a Pronunciation Model (Lexicon/Dictionary) and an Acoustic Model. Once the models are consistently trained with new speech data with multiple speakers and an extended vocabulary (Language Model), the accuracy of the transcription increases. In statistical units, this accuracy is measured by the Word Error Rate (WER).

Ideally when a model is highly accurate, the WER corresponds to less than 10%. In the end the single most important aspect when training a speech recognition model is the quality of the dataset, the audio and lexicon. After all it is that data where the models get their information from or learn from. The same goes for using the models. If the audio is of low quality so will the output. Think about it, it is the same with humans, if someone speaks unclearly, we cannot understand them. It is the same for machines, if the audio is unclear it cannot understand what is being said.

Speech Recognition enables users to make audio content accessible. A lot of useful information is locked away in audio, but it is not easy to search through. By applying speech to text audio is turned into text and therefore becomes both accessible and searchable on word level. This means the output can be used as automatic subtitles for people suffering from hearing loss, and as a means of indexing content archives in addition to existing metadata in order to make large archives better searchable. Think about a journalist looking up footage for a story he or she is creating. Instead of trying to guess the right date a certain event occurred, the journalist can simply look up the footage based on keywords about the event he or she is looking for. 

Explanation of a speech to text process generic model

Figure 1: Speech to text process generic model

  1. User uploads recorded audio content to the platform.
  2. The acoustic model within the Speech Recognition Engine analyses sounds.
  3. The lexicon model syncs the sounds with the right words.
  4. The language model structures the results and delivers a raw text file (JSON) with all words having a confidence score, speaker ID and timestamp.
  5. The file can be restructured as a transcript or subtitle file.
A speech to text model process explained

Figure 2: Speech to text process custom model

  1. User uploads recorded audio content to the platform.
  2. The acoustic model trained with customer data (audio) within the Speech Recognition Engine analyses sounds.
  3. The lexicon model trained with customer data (transcripts) syncs the sounds with the right words.
  4. The custom language model structures the results and delivers a raw text file (JSON) with all words having a confidence score, speaker ID and timestamp.
  5. The file can be restructured as a transcript or subtitle file.

Step 1: choose the functions and features of speech recognition

Speech recognition services come in many forms. Companies that offer speech recognition can focus on the transcription part; but they can also focus more on using speech to text for subtitling purposes and then there are those that offer speech to text as a means for indexing large archives of content. Whatever the use case me be surely there is an option out there that fits your requirements.

It is important to bear in mind that in the end speech recognition services provide the means to turn the spoken word into text, and with text you can do all sorts of things. At Scriptix we provide users with an API-platform to integrate this process of turning speech to text into their existing workflows.

Step 2: Convert speech to text with an API and in different languages

The great thing about automatic speech recognition is that models can be built for any language out there, all that is needed is the right dataset. What this means, is that in order to build a model in a certain language you would need thousands of hours of audio in that specific language as well as hundreds of hours of perfect transcripts in that specific language.

Using the audio data, engineers can build an acoustic model that contains specific sounds and with the transcript data, engineers can build a lexicon that contains specific words. These two make up the language model, and by applying artificial intelligence and running multiple iterations with this data the language model will become better and better in making the right combinations between sounds and words. There is not a vendor out there that supports all the languages and dialects of the world, but in theory this is possible as long as the model can be trained with the right data sets.

Step 3: Integrate speech recognition with Python

Integrating an API-platform such as the one Scriptix offers is a no-brainer for developers. By following our online API-documentation you have all the info you need to set-up a speech recognition workflow in no time. In other words, if you can connect to API’s you can integrate a service such as Scriptix speech to text in workflows you already have in place, it’s like an extra piece of the puzzle that is complementary to the services you already offer your clients.  As a quick reference implementation, users can check out our Python SDK.

Users without any technical background can use the system as well, by simply logging in with their credentials and uploading files on the homepage. Once done, in the transcripts section users can check the results, make corrections using our editor, and download them in various formats.

There are many options for automatic speech to text software out there, from paid services to free and open-source options. The difference between the two lies mainly in the quality of the output they generate. Paid services such as Scriptix speech to text are aimed at generating the best possible output for the user. To that end we work together with customers to update and customize models based on their content to generate much more accurate transcripts. With free services the approach is always a generic one, what you see is what you get. For some use cases this can be just fine, but when accuracy is important, a paid service will surely be the way to go.

Moreover, open-source projects such as Kaldi, which Scriptix also contributes to may be free but actually applying the knowledge it contains requires a specific expertise. You would need qualified machine learning engineers who know how to build and curate the right data sets in order to make an open-source project such as Kaldi work for you.

Free services can be just fine but are always limited. For people who sometimes need to process a few minutes of content this can be fine, but for larger content producers that need to process a couple of hours per week for example such a restriction does not work.

Finally, free services usually do come with a price, and that is that you give away your data for free. At Scriptix for example, we strongly believe in privacy and by default delete all customer data right after processing.

Free Services

Paid Services

Low accuracy

High accuracy

Limited processing

Unlimited processing

No support


No clear data storage guidelines

Clear data storage guidelines

Applying open source framework required specific expertise

Specific expertise in house

Creating an archive of subtitled video content with speech to text

Subtitles are important and useful in all sorts of media. They help you reach a broader audience, improve entertainment value, and enhance the accessibility and visibility of your content. In addition, the converted transcript of the subtitles can be used to turn your media into a searchable archive. The challenge lies in the fact that creating subtitles can be extremely time consuming. Fortunately, speech to text software like Scriptix helps you resolve this problem.

Turn your media into a searchable archive

Imagine being able to transcribe any video, graduation speech, seminar or course and turn it into a searchable archive. Speech to text software, like Scriptix, offers you the possibility to process any conversation, be it live or recorded, and convert it into written text. The only thing left for you is to index the transcript in order to create a searchable archive.

Increasing your videos’ value and searchability with customized speech to text models

Subtitling your video content offers dozens of great benefits. But having your videos automatically transcribed can be a challenge, especially if they contain unfamiliar jargon or unique dialects. Fortunately, you don’t have to settle for time consuming manual transcriptions. With the power of customization, we can create a custom speech to text model that will transcribe your videos far more accurately. Then, you can take full advantage of your subtitled videos, increasing accessibility, creating a searchable video archive, and more.y

Full how to: subtitling livestreams, movies, webinars, classes, debates and lectures in batches or in real-time

For too many people creating educational video content, subtitles are little more than an afterthought. But they shouldn’t be. Whether you’re creating entertainment or educational content, subtitles are one of the most powerful tools in your toolbox. They increase the accessibility of your videos, empowering those with hearing disabilities to watch and engage. They give users flexibility with where and how they’ll watch your content. And they give you the ability to archive your content in a searchable format so it’s easier for your users to find exactly what they’re looking for. And what’s more, with speech-to-text software like Scriptix, you can even transcribe content in real time and process multiple videos. In other words, it’s never been easier to subtitle your video content, from webinars and debates to lectures and more.

3 tips for making your meetings more accessible and inclusive with speech to text

In today’s world, transparency, inclusivity, and accessibility are more than just popular buzzwords. They’re the future of ethical governance and public sector practices. That’s why it’s essential that you begin building a more open culture as soon as possible. And one of the easiest places to begin is in the way you conduct your meetings. By building greater inclusivity into your meetings, you’ll be planting seeds that will bear fruit across your entire organization. And with Scriptix, you can implement some best practices for ethical meetings, including boosting accessibility with archived recordings and meeting transcriptions along with subtitling your meetings in real time. But this just scratches the surface of how you can make your meetings more inclusive and accessible. Read on to discover seven tips for taking openness to the next level in your meetings

How speech to text empowers students & educational institutions

Speech to text technology offers students, teachers, and educational institutions a host of valuable benefits. Obviously, closed captions ensure that students suffering from deafness and hearing loss are able to consume the same content as everyone else. But that’s only the tip of the iceberg. Research has revealed that subtitles help students focus better, remember more information, boost literacy skills, and more. Speech to text software makes it possible for students to transcribe lectures so that they can review in-class content more effectively. With speech to text, the entire educational enterprise can be enhanced in a number of important ways. And with Scriptix, taking advantage of speech to text has never been easier.

Why good subtitles are essential

Good subtitles are essential. They make your content more accessible to deaf and hearing impaired people. But that’s only one of the reasons that all of your video content should be subtitled. They also create a better, more consistent viewing experience for all people. They make it possible to turn any video into a learning opportunity. And they even give content creators the ability to build their platform by boosting their site’s SEO score and making it a snap to repurpose old content. If you’ve been under the impression that subtitles were just a nice thing to do for people who are hard of hearing, think again. Subtitles can empower you and your entire audience in dozens of incredible ways. And with Scriptix, it’s never been simpler to get your content transcribed and subtitled.

Full guide on creating a transcript for your podcast

The podcast industry shows no signs of slowing down. In fact, recent estimates say that 2021 will see a 10% increase in the number of monthly US listeners. So right now is the perfect time for you to make your voice heard. But if you’re thinking about starting a podcast, you should consider the difference a transcription can make. By publishing a transcription of your podcast with the audio, you’ll boost your SEO score and be more likely to be found by your target audience. And if you create a video version of your podcast, a transcription will allow you to enhance it with subtitles that make your content more accessible. And then there’s the possibility of repurposing podcast content into blogs and articles, so your audience will have access to your content the way they want it – by listening, watching, or reading. To put it simply, transcriptions can make a world of difference for your podcast.

How speech to text benefits your SEO strategy

Getting people to your site without using PPC ads can be a major challenge. And yet, if you don’t find a way to do it, you’ll waste your entire marketing budget on digital advertising. That’s why you need a solid SEO strategy. By optimizing your site and giving search engines what they want, you’ll ensure that your web pages rank higher and you get more organic traffic.

If you’re interested in making your site as SEO-friendly as possible, you’ll want to incorporate a variety of tactics, including creating high-quality, relevant content, streamlining your UI, and making sure that your site’s structure is clean and makes sense. But those aren’t the only ways to improve your site’s SEO. With speech to text technology, you can multiply your SEO efforts by transcribing and subtitling all of your video content, enjoying a wide variety of benefits in the process.

How voice bots can benefit your website

Businesses and other organizations have been using chatbots to enhance their users’ online experience for years. Chatbots allow users to get answers to their questions immediately, without searching or browsing. But they aren’t perfect. Fortunately, recent advances in natural language processing and speech to text technology have made it possible to create voicebots, online virtual assistants that respond to the spoken word. Voicebots allow users to ask questions, search, and navigate a site by simply speaking. And since the speech is processed in real-time, voicebots give the illusion of talking to an actual human being.

If you’d like to increase efficiency and offer a more streamlined user experience for visitors to your website, voicebots are a fantastic solution. While they aren’t everywhere just yet, they will be in the near future. So, it’s a great time to learn more about them and begin implementing them into your website. By doing so, you’re sure to stand out from the crowd.

Automatic Speech Recognition is an exciting field, but can also be a complicated one. Which is why with the FAQ section below we want to help out our visitors with questions they might have. Can’t find your question here? No worries simply reach out to us directly using the contact form or [email protected].

What is speech recognition?

Speech Recognition, or Automatic Speech Recognition (ASR), Computer Speech Recognition, Voice to Text or Speech to Text, are all names for the field of computer science that deals with the development of technologies which can turn the spoken word into text automatically. 

How does speech recognition work?

A Speech recognition engine is made up out of a language model that consists of an acoustic model and a lexicon (or vocabulary). When you upload audio to that engine, what it does is analyse the audio for sounds it recognizes (this is what the acoustic model is therefore), after that the lexicon calculates the probability of what sounds should belong to what words. In this way audio is matched with words and as such your audio files are turned into text. Each word is also linked to the specific moment in the audio, or timestamped. The output generated is a transcript that can be used to subtitle movies or make audio archives searchable.

What is offline speech recognition data?

Working in the cloud is becoming the standard more and more. Microsoft Azure, AWS and Google are building datacentres around the world, enabling users to build scalable applications at an increasing rate. There is always the question of privacy when it comes to cloud. Which is why certain organizations that hold sensitive data, such as the police or banks, prefer not to send their audio to a platform because of the risk of data leaks. Offline speech recognition can provide a solution for this. It means the speech recognition model is containerized and deployed on the customers servers. Which means the system runs within their own environment and there is no need to send audio to the cloud using an internet connection.

How do I use speech recognition?

It depends on your use case. If you are looking to transcribe interviews for example, it is sufficient to create an account with Scriptix, login and upload your files on the homepage. The results you can download and use as you wish. If you would like to add Scriptix speech to text as an additional functionality to your product portfolio, you would have to integrate our API platform in your workflows. We’ve made this as easy as possible, by following our online API Documentation any developer should be able to integrate our platform. And the good thing is if you do struggle, we are always here to help. Simply get in touch using the contact form.

How do I add speech recognition?

If you want to integrate Scriptix speech to text API’s follow our online API Documentation or have a quick look at our Python SDK as a reference implementation. Need help? Get in touch by using the contact form.