Reducing latency in AI Speech Synthesis

By Dave Bitter

4 min read

AI-powered speech synthesis is getting incredibly realistic. This opens up many possibilities to generate realistic audio based on the text you provide. Whilst relatively fast, the latency still isn’t low enough for “real-time synthesis”. Let’s optimise that!

Reducing latency in _AI Speech Synthesis_
Authors

In my previous article, I showed how you can interact with ChatGPT through Voice UI on the web. If you haven’t already, read that article first to know what I built. What sells the illusion of having a real-time conversation with the AI is the low latency. Because the response is so quick, it doesn’t feel like the AI needs to process your information and create an audio file to playback to you. Even though this part feels realistic, the robotic voice for the speech synthesis doesn’t. My colleague Christofer Falkman pointed me to a way I could make Aiva, the ChatGPT-powered assistant, even more realistic. Using ElevenLabs’s AI-powered speech synthesis I can replace this robotic voice with an incredibly realistic voice. Time to upgrade Aiva!

Let’s implement it

As you might remember from the previous article, Aiva works like this:

Schema showing the turn based conversation flow of Aiva

Instead of using the native SpeechSynthesis Web API, we now replace it with a call to the ElevenLabs API where we get returned binary data in the form of a buffer. Or, simplified, we get a bit of data we can play as audio. As soon as the entire piece of text is sent to that API, returned to the application, and played as audio, the user gets to talk again. This yields a pretty cool result with an incredibly realistic voice:

Schema showing the turn based conversation flow of Aiva with sidestep to the ElevenLabs API

It’s slow!

You might notice that the AI-powered speech synthesis is quite a bit slower. The latency increases with the length of the text. It first needs to turn all of the text into audio before it can play the first sentence. The longer the text, the slower the response. This is breaking the user experience of having a natural conversation.

Low latency over realistic voice

In my opinion, having a low-latency robotic voice in a real-time conversation is a better user experience. It doesn’t take you out of the conversation as much. So what now? Don’t use the AI version? Of course not, let’s fix this!

Time to take some well-known approaches

First, I looked into whether the audio could be streamed from the API. Unfortunately, I couldn’t find this option and didn’t want to limit my choice of which AI-power speech synthesis products I could use. Then, I thought about how, as a developer, I would normally handle large slow requests. Imagine you are making a dashboard with 10.000 rows of data. You could opt for pagination where you click a button to go to the next page of results. Chunking the data in pages of rows works great. Taking this into a conversation, we could chunk the text into sentences and play the audio for each sentence. Whilst this is the approach I took, this posed a new problem. Let’s imagine I retrieve audio for a sentence of the text, play the audio, and then do this over and over again for the entire piece of text. This means that the latency is pretty much just split up, but still present:

Schema showing the conversation chunked into sentences in a serialized manner

As you can see, there is quite a bit of time while the audio is playing which I could utilize to retrieve the audio data for the next sentence. This is pretty similar to how humans talk. You think of what to say and while you are talking you’re already thinking of the next sentence. This results in a fluent flow where there is always audio playing:

Schema showing the conversation chunked into sentences in a synchronous manner

The end result

That did the trick! The latency is now low enough to not intrude on the natural flow of the conversation:


Upcoming events

  • Coven of Wisdom - Herentals - Winter `24 edition

    Worstelen jij en je team met automated testing en performance? Kom naar onze meetup waar ervaren sprekers hun inzichten en ervaringen delen over het bouwen van robuuste en efficiënte applicaties. Schrijf je in voor een avond vol kennis, heerlijk eten en een mix van creativiteit en technologie! 🚀 18:00 – 🚪 Deuren open 18:15 – 🍕 Food & drinks 19:00 – 📢 Talk 1 20:00 – 🍹 Kleine pauze 20:15 – 📢 Talk 2 21:00 – 🙋‍♀️ Drinks 22:00 – 🍻 Tot de volgende keer? Tijdens deze meetup gaan we dieper in op automated testing en performance. Onze sprekers delen heel wat praktische inzichten en ervaringen. Ze vertellen je hoe je effectieve geautomatiseerde tests kunt schrijven en onderhouden, en hoe je de prestaties van je applicatie kunt optimaliseren. Houd onze updates in de gaten voor meer informatie over de sprekers en hun specifieke onderwerpen. Over iO Wij zijn iO: een groeiend team van experts die end-to-end-diensten aanbieden voor communicatie en digitale transformatie. We denken groot en werken lokaal. Aan strategie, creatie, content, marketing en technologie. In nauwe samenwerking met onze klanten om hun merken te versterken, hun digitale systemen te verbeteren en hun toekomstbestendige groei veilig te stellen. We helpen klanten niet alleen hun zakelijke doelen te bereiken. Samen verkennen en benutten we de eindeloze mogelijkheden die markten in constante verandering bieden. De springplank voor die visie is talent. Onze campus is onze broedplaats voor innovatie, die een omgeving creëert die talent de ruimte en stimulans geeft die het nodig heeft om te ontkiemen, te ontwikkelen en te floreren. Want werken aan de infinite opportunities van morgen, dat doen we vandaag.

    | Coven of Wisdom Herentals

    Go to page for Coven of Wisdom - Herentals - Winter `24 edition
  • Mastering Event-Driven Design

    PLEASE RSVP SO THAT WE KNOW HOW MUCH FOOD WE WILL NEED Are you and your team struggling with event-driven microservices? Join us for a meetup with Mehmet Akif Tütüncü, a senior software engineer, who has given multiple great talks so far and Allard Buijze founder of CTO and founder of AxonIQ, who built the fundaments of the Axon Framework. RSVP for an evening of learning, delicious food, and the fusion of creativity and tech! 🚀 18:00 – 🚪 Doors open to the public 18:15 – 🍕 Let’s eat 19:00 – 📢 Getting Your Axe On Event Sourcing with Axon Framework 20:00 – 🍹 Small break 20:15 – 📢 Event-Driven Microservices - Beyond the Fairy Tale 21:00 – 🙋‍♀️ drinks 22:00 – 🍻 See you next time? Details: Getting Your Axe On - Event Sourcing with Axon Framework In this presentation, we will explore the basics of event-driven architecture using Axon Framework. We'll start by explaining key concepts such as Event Sourcing and Command Query Responsibility Segregation (CQRS), and how they can improve the scalability and maintainability of modern applications. You will learn what Axon Framework is, how it simplifies implementing these patterns, and see hands-on examples of setting up a project with Axon Framework and Spring Boot. Whether you are new to these concepts or looking to understand them more, this session will provide practical insights and tools to help you build resilient and efficient applications. Event-Driven Microservices - Beyond the Fairy Tale Our applications need to be faster, better, bigger, smarter, and more enjoyable to meet our demanding end-users needs. In recent years, the way we build, run, and operate our software has changed significantly. We use scalable platforms to deploy and manage our applications. Instead of big monolithic deployment applications, we now deploy small, functionally consistent components as microservices. Problem. Solved. Right? Unfortunately, for most of us, microservices, and especially their event-driven variants, do not deliver on the beautiful, fairy-tale-like promises that surround them.In this session, Allard will share a different take on microservices. We will see that not much has changed in how we build software, which is why so many “microservices projects” fail nowadays. What lessons can we learn from concepts like DDD, CQRS, and Event Sourcing to help manage the complexity of our systems? He will also show how message-driven communication allows us to focus on finding the boundaries of functionally cohesive components, which we can evolve into microservices should the need arise.

    | Coven of Wisdom - Utrecht

    Go to page for Mastering Event-Driven Design
  • The Leadership Meetup

    PLEASE RSVP SO THAT WE KNOW HOW MUCH FOOD WE WILL NEED What distinguishes a software developer from a software team lead? As a team leader, you are responsible for people, their performance, and motivation. Your output is the output of your team. Whether you are a front-end or back-end developer, or any other discipline that wants to grow into the role of a tech lead, RSVP for an evening of learning, delicious food, and the fusion of leadership and tech! 🚀 18:00 – 🚪 Doors open to the public 18:15 – 🍕 Let’s eat 19:00 – 📢 First round of Talks 19:45 – 🍹 Small break 20:00 – 📢 Second round of Talks 20:45 – 🙋‍♀️ drinks 21:00 – 🍻 See you next time? First Round of Talks: Pixel Perfect and Perfectly Insane: About That Time My Brain Just Switched Off Remy Parzinski, Design System Lead at Logius Learn from Remy how you can care for yourself because we all need to. Second Round of Talks: Becoming a LeadDev at your client; How to Fail at Large (or How to Do Slightly Better) Arno Koehler Engineering Manager @ iO What are the things that will help you become a lead engineer? Building Team Culture (Tales of trust and positivity) Michel Blankenstein Engineering Manager @ iO & Head of Technology @ Zorggenoot How do you create a culture at your company or team? RSVP now to secure your spot, and let's explore the fascinating world of design systems together!

    | Coven of Wisdom - Amsterdam

    Go to page for The Leadership Meetup

Share