- Shiny Inc
- Posts
- Are voice assistants getting good?
Are voice assistants getting good?
Finally, we're getting close to voice apps being not shit. 2024 could be the breakout year.
One thing that has always left me a little bit stumped is how bad Siri and Alexa are.
That might be a little bit strong as the technology powering both of these is running on hundreds of millions of devices but putting that aside, it really should be better.
I might be in the minority here, but having an always on voice assistant that helps you complete tasks instantly as soon as you think of it, is how I want my tech to work!
It’s SO close but it’s not quite there.
iPhone + Airpods = Cyborg?
Each day when we walk around we have a super-computer in our pockets in the shape of an iPhone and many of us also either have the best headphones ever created in the Airpods or a crazy smart watch in the Apple Watch.
Add all these things together and you have a system that can listen to what we say and do stuff.
This can be simple things like sending someone an SMS or as complicated as triggering a multi-step Apple Shortcut that ties together 10 apps, grabbing and formatting data and firing off web-hooks into other apps that you may use.
By the way, if you don’t know what Shortcuts are, take a look at this guy.
While this is great, there is one part of this chain that’s a little bit hit or miss.
The Voice bit.
More often than I’d like, Siri and Alexa don’t understand what you’re saying, it gives you a random answer that you don’t want or it cuts in when you haven’t finished speaking.
Before speaking to Brian Atwood of Sindarin, my small brain had no idea how hard of a problem this was. All I knew was that saying, “Hey Siri, <do “the thing"> and have it now work, just doesn’t feel like living in the future.
That said, this is changing quickly and I’m confident we’ll be there very soon.
What do we need to make this better?
There are a lot of people who are way smarter than me working on making voice assistants better and we’re getting a lot closer very quickly.
Here’s a simple list of the things we need to make voice assistants Jarvis level.
Understand what we are saying
Sound like a human
Be fast at responding
Have conversational flexibility
Now let’s walk through each part of how to make these better.
Understand what’s being said
For a voice assistant to be useful, it has to know what you’re saying. Sounds simple right. Well nope.
Until recently, this was pretty difficult.
Opening up a voice-to-text app and speaking into it, a large % of the time the app/assistant would recognise some words and not others or even finish the sentence before you’re ready. This leads to a bit of a word mess and a sentence that doesn’t make send and lacks context.
Luckily though recent advances in Machine Learning (too technical to explain in this short post) have made these a lot more accurate and WAY WAY faster.
Products are upgrading there capabilities here everyday but we can credit OpenAI Whisper model for making this better for everyone.
Try it here on Replicate.
Sound like a human
This also is something that may on the surface sound like a simple and obvious thing that need to be in place but it’s actually quite difficult to do.
Human voices are full of small and subtle variations in tone, pitch, speed and volume that uniquely convey the emotion and meaning we are trying to convey to the listener.
Doing this with an AI…tricky.
So far the startup that has solved this in the most complete way is called Eleven Labs. They’ve built an incredibly robust platform that not only allows anyone to create entirely synthetic voices and make them programmable but also to create a digital clone of your own voice in a matter of seconds.
There’s a couple of grades to this too so if you feel the want to go full professional, then the option is there too.
With this, we can bring these voices into our apps without them sounding like robots (counter intuitive I know)!
For technical people you can check out the documentation here and for less technical folk you can use Eleven Labs via something like Zapier.
Another player here is Play HT if you want to try an alternative.
Not going to use this post as a tool comparison but may do something in the future.
Ok onto the next part…
Be fast at responding
Having a conversation with Alexa, Siri and, to some extent ChatGPT Voice, you’ll notice a bit of a delay between when you finish talking and when the response comes back.
The delay that’s present between question and reply is called Latency. Explained below (5 year old style).
To put Latency into context, the goal here is to make the assistants reply come back quickly enough to make the conversation seen natural, but also not so fast that it cuts in whenever you pause to think about the next word.
Here is where a problem arrises.
Natural conversations flow as we know how to read peoples expressions that give us an unconcious understanding of when we should speak and when to keep quiet. As you can imagine, this is hard to do when the AI can’t see your face (in this example at least).
This is the last 5% problem.
Many can get 95% of the way, as ChatGPT does with it’s voice assistant on mobile, but there’s just enough latency there to make you realise that you’re talking to an AI.
Now, we’re not far off these reaching the uncanny valley. Just try SmarterChild by Sindarin to get a taste of what’s coming down the pipe.
Conversational Flexibility
The year is 2024 and by this point all of us have been on the phone with our bank or other random service that has a pre-recorded bot ask you questions with the aim of getting you through to the right department.
More often than not, they’re not great.
Simple yes-no trees to put you through to the right department and not much flexibility to just communicate exactly what you need. But, give a response that doesn’t fit within the keywords that it’s been given then it’s a no go.
AI changes this by being able to communicate with you like a real person having a real conversation understanding the context and interpreting what it is you’re looking for.
Finally, let’s talk about the final piece of the puzzle..
Using Tools
Since their release, GPTs from OpenAI have enabled anyone to create AI powered assistants that can do not only the above but also use tools and integrate with other applications via Actions.
Powered by a nuanced understanding of user requests these give more accurate, context-aware responses.
This is where the idea of integrating Actions and APIs with GPTs becomes transformative. Such integration allows voice assistants not only to understand and respond to queries but also to execute tasks and interact with a wide range of services and applications.
Imagine asking your voice assistant to book a table at a restaurant, and it not only understands your request but also knows your preferences, checks availability, and completes the booking through a restaurant's API.
Or consider a scenario where you ask your assistant to compile a report by pulling data from various business analytics tools, something achievable through API integrations. This level of functionality, combined with the conversational flexibility and rapid response times of GPT models, could indeed bring us closer to the Jarvis-like assistants many of us envision.
That’s it for today. Catch you on the next one.
Tom
Reply