In-Browser Speech to Text Using the Web Speech API

Nil Seri
4 min readNov 28, 2024

--

Exploring the Web Speech API through an Angular App

Photo by Zdeněk Macháček on Unsplash

SpeechRecognition

The SpeechRecognition interface of the Web Speech API is the controller interface for the recognition service.

The Web Speech API has two functions:
speech synthesis — text to speech
speech recognition — speech to text.

You can check the browser compatibility table here.

Apparently, it is not supported in Brave browser, giving me a “network” error so I switched back to Chrome.

Brave Browser Error

The SpeechRecognitionEvent interface of the Web Speech API represents the event object for the results and nomatch events, and contains all the data associated with an interim or final speech recognition result.

results property is a list of SpeechRecognitionResult objects. Inspecting that result shows a list of SpeechRecognitionAlternative objects and the first one includes the transcript of what you said and a confidence value between 0 — 1.

SpeechRecognitionEvent
SpeechRecognitionResultList

Parameters:

continuous:
true — Controls whether continuous results are captured
false — just a single result each time recognition is started.

interimResults:
true — the speech recognition system should return interim results.
false — the speech recognition system should return just final results.

lang:
a string representing the BCP 47 language tag such as en-US, en-GB, de-DE, fr-FR, es-ES, tr-TR, etc.

BCP 47 Language Tags is the Internet Best Current Practices (BCP) for language tags. You will most commonly find language tags written with 2 subtags — language and region.

SpeechRecognition Methods:

start(): starts the speech recognition service listening to incoming audio with intent to recognize grammars.

stop(): stops the speech recognition service from listening to incoming audio, and attempts to return a SpeechRecognitionResult using the audio captured so far.

Events:

  • listen using addEventListener() or by assigning an event listener to the oneventname.

start: Fired when the speech recognition service has begun listening to incoming audio with intent to recognize grammars.

result: Fired when the speech recognition service returns a result — a word or phrase has been positively recognized.

error: Fired when a speech recognition error occurs.

end: Fired when the speech recognition service has disconnected.

There are also other methods such as audiostart, soundstart, speechstart as well as start and their end versions.

start vs audiostart vs soundstart vs speechstart events

Their flow occurs as follows:

startaudiostartsoundstartspeechstartspeechend soundend audioend result end

Adobe Stock

for Speech Recognition (Speech-to-Text):

Calling this feature speech recognition “in the browser” is not exactly accurate. When using the SpeechRecognition interface of the Web Speech API, your speech input is often sent to remote servers (e.g., Google’s servers in Chrome) for processing.

for Speech Synthesis (Text-to-Speech):

SpeechSynthesis, the text-to-speech part of the API, generally does not send data to servers. It works locally using the speech synthesis engines installed in the operating system or the browser itself.

I have developed an Angular project with Angular v19 and Bootstrap. You can find the code here:

Since speechRecognition events occur outside of Angular’s zone, Angular’s change detection may not be aware of changes to the transcript property. To fix this, you need to wrap the update inside Angular’s NgZone.

You can try it out here.

Here are some screenshots:

start screen
started — interim results
stopped — result

Happy Coding!

Giphy

--

--

Nil Seri
Nil Seri

Written by Nil Seri

I would love to change the world, but they won’t give me the source code | coding 👩🏻‍💻 | coffee ☕️ | jazz 🎷 | anime 🐲 | books 📚 | drawing 🎨

Responses (1)