Reducing latency in AI Speech Synthesis
Reducing latency in AI Speech Synthesis
By Dave Bitter
4 min read
AI-powered speech synthesis is getting incredibly realistic. This opens up many possibilities to generate realistic audio based on the text you provide. Whilst relatively fast, the latency still isn’t low enough for “real-time synthesis”. Let’s optimise that!
- Authors
- Name
- Dave Bitter
- linkedinDave Bitter
- twitter@dave_bitter
- Github
- githubDaveBitter
- Website
- websiteBlog
In my previous article, I showed how you can interact with ChatGPT through Voice UI on the web. If you haven’t already, read that article first to know what I built. What sells the illusion of having a real-time conversation with the AI is the low latency. Because the response is so quick, it doesn’t feel like the AI needs to process your information and create an audio file to playback to you. Even though this part feels realistic, the robotic voice for the speech synthesis doesn’t. My colleague Christofer Falkman pointed me to a way I could make Aiva, the ChatGPT-powered assistant, even more realistic. Using ElevenLabs’s AI-powered speech synthesis I can replace this robotic voice with an incredibly realistic voice. Time to upgrade Aiva!
Let’s implement it
As you might remember from the previous article, Aiva works like this:
Instead of using the native SpeechSynthesis Web API, we now replace it with a call to the ElevenLabs API where we get returned binary data in the form of a buffer. Or, simplified, we get a bit of data we can play as audio. As soon as the entire piece of text is sent to that API, returned to the application, and played as audio, the user gets to talk again. This yields a pretty cool result with an incredibly realistic voice:
It’s slow!
You might notice that the AI-powered speech synthesis is quite a bit slower. The latency increases with the length of the text. It first needs to turn all of the text into audio before it can play the first sentence. The longer the text, the slower the response. This is breaking the user experience of having a natural conversation.
Low latency over realistic voice
In my opinion, having a low-latency robotic voice in a real-time conversation is a better user experience. It doesn’t take you out of the conversation as much. So what now? Don’t use the AI version? Of course not, let’s fix this!
Time to take some well-known approaches
First, I looked into whether the audio could be streamed from the API. Unfortunately, I couldn’t find this option and didn’t want to limit my choice of which AI-power speech synthesis products I could use. Then, I thought about how, as a developer, I would normally handle large slow requests. Imagine you are making a dashboard with 10.000 rows of data. You could opt for pagination where you click a button to go to the next page of results. Chunking the data in pages of rows works great. Taking this into a conversation, we could chunk the text into sentences and play the audio for each sentence. Whilst this is the approach I took, this posed a new problem. Let’s imagine I retrieve audio for a sentence of the text, play the audio, and then do this over and over again for the entire piece of text. This means that the latency is pretty much just split up, but still present:
As you can see, there is quite a bit of time while the audio is playing which I could utilize to retrieve the audio data for the next sentence. This is pretty similar to how humans talk. You think of what to say and while you are talking you’re already thinking of the next sentence. This results in a fluent flow where there is always audio playing:
The end result
That did the trick! The latency is now low enough to not intrude on the natural flow of the conversation:
Upcoming events
The Test Automation Meetup
PLEASE RSVP SO THAT WE KNOW HOW MUCH FOOD WE WILL NEED Test automation is a cornerstone of effective software development. It's about creating robust, predictable test suites that enhance quality and reliability. By diving into automation, you're architecting systems that ensure consistency and catch issues early. This expertise not only improves the development process but also broadens your skillset, making you a more versatile team member. Whether you're a developer looking to enhance your testing skills or a QA professional aiming to dive deeper into automation, RSVP for an evening of learning, delicious food, and the fusion of coding and quality assurance! 🚀🚀 18:00 – 🚪 Doors open to the public 18:15 – 🍕 Let’s eat 19:00 – 📢 First round of Talks 19:45 – 🍹 Small break 20:00 – 📢 Second round of Talks 20:45 – 🍻 Drinks 21:00 – 🙋♀️ See you next time? First Round of Talks: The Power of Cross-browser Component Testing - Clarke Verdel, SR. Front-end Developer at iO How can you use Component Testing to ensure consistency cross-browser? Second Round of Talks: Omg who wrote this **** code!? - Erwin Heitzman, SR. Test Automation Engineer at Rabobank How can tests help you and your team? Beyond the Unit Test - Christian Würthner, SR. Android Developer at iO How can you do advanced automated testing for, for instance, biometrics? RSVP now to secure your spot, and let's explore the fascinating world of test automation together!
| Coven of Wisdom - Amsterdam
Go to page for The Test Automation MeetupCoven of Wisdom - Herentals - Winter `24 edition
Worstelen jij en je team met automated testing en performance? Kom naar onze meetup waar ervaren sprekers hun inzichten en ervaringen delen over het bouwen van robuuste en efficiënte applicaties. Schrijf je in voor een avond vol kennis, heerlijk eten en een mix van creativiteit en technologie! 🚀 18:00 – 🚪 Deuren open 18:15 – 🍕 Food & drinks 19:00 – 📢 Talk 1 20:00 – 🍹 Kleine pauze 20:15 – 📢 Talk 2 21:00 – 🙋♀️ Drinks 22:00 – 🍻 Tot de volgende keer? Tijdens deze meetup gaan we dieper in op automated testing en performance. Onze sprekers delen heel wat praktische inzichten en ervaringen. Ze vertellen je hoe je effectieve geautomatiseerde tests kunt schrijven en onderhouden, en hoe je de prestaties van je applicatie kunt optimaliseren. Houd onze updates in de gaten voor meer informatie over de sprekers en hun specifieke onderwerpen. Over iO Wij zijn iO: een groeiend team van experts die end-to-end-diensten aanbieden voor communicatie en digitale transformatie. We denken groot en werken lokaal. Aan strategie, creatie, content, marketing en technologie. In nauwe samenwerking met onze klanten om hun merken te versterken, hun digitale systemen te verbeteren en hun toekomstbestendige groei veilig te stellen. We helpen klanten niet alleen hun zakelijke doelen te bereiken. Samen verkennen en benutten we de eindeloze mogelijkheden die markten in constante verandering bieden. De springplank voor die visie is talent. Onze campus is onze broedplaats voor innovatie, die een omgeving creëert die talent de ruimte en stimulans geeft die het nodig heeft om te ontkiemen, te ontwikkelen en te floreren. Want werken aan de infinite opportunities van morgen, dat doen we vandaag.
| Coven of Wisdom Herentals
Go to page for Coven of Wisdom - Herentals - Winter `24 editionMastering Event-Driven Design
PLEASE RSVP SO THAT WE KNOW HOW MUCH FOOD WE WILL NEED Are you and your team struggling with event-driven microservices? Join us for a meetup with Mehmet Akif Tütüncü, a senior software engineer, who has given multiple great talks so far and Allard Buijze founder of CTO and founder of AxonIQ, who built the fundaments of the Axon Framework. RSVP for an evening of learning, delicious food, and the fusion of creativity and tech! 🚀 18:00 – 🚪 Doors open to the public 18:15 – 🍕 Let’s eat 19:00 – 📢 Getting Your Axe On Event Sourcing with Axon Framework 20:00 – 🍹 Small break 20:15 – 📢 Event-Driven Microservices - Beyond the Fairy Tale 21:00 – 🙋♀️ drinks 22:00 – 🍻 See you next time? Details: Getting Your Axe On - Event Sourcing with Axon Framework In this presentation, we will explore the basics of event-driven architecture using Axon Framework. We'll start by explaining key concepts such as Event Sourcing and Command Query Responsibility Segregation (CQRS), and how they can improve the scalability and maintainability of modern applications. You will learn what Axon Framework is, how it simplifies implementing these patterns, and see hands-on examples of setting up a project with Axon Framework and Spring Boot. Whether you are new to these concepts or looking to understand them more, this session will provide practical insights and tools to help you build resilient and efficient applications. Event-Driven Microservices - Beyond the Fairy Tale Our applications need to be faster, better, bigger, smarter, and more enjoyable to meet our demanding end-users needs. In recent years, the way we build, run, and operate our software has changed significantly. We use scalable platforms to deploy and manage our applications. Instead of big monolithic deployment applications, we now deploy small, functionally consistent components as microservices. Problem. Solved. Right? Unfortunately, for most of us, microservices, and especially their event-driven variants, do not deliver on the beautiful, fairy-tale-like promises that surround them.In this session, Allard will share a different take on microservices. We will see that not much has changed in how we build software, which is why so many “microservices projects” fail nowadays. What lessons can we learn from concepts like DDD, CQRS, and Event Sourcing to help manage the complexity of our systems? He will also show how message-driven communication allows us to focus on finding the boundaries of functionally cohesive components, which we can evolve into microservices should the need arise.
| Coven of Wisdom - Utrecht
Go to page for Mastering Event-Driven Design