Last week Open AI released their Realtime API which allows developers to create speech-to-speech experiences.
This makes interacting with AI more human-like due to lower latency and voice-only communication. Instead of previously having to make 3 API calls : speech-to-text -> chat completions -> text-to-speech; speech-to-speech interactions now take one call with the new Realtime API.
One of the many sectors that benefits form this advancement is language learning. Speak's CTO Andew Hsu demoed how they are using the Realtime API for roleplay scenarios in Spanish.
https://x.com/adhsu/status/1841172452938039536
Having an on demand conversation partner is one of the benefits the Realtime API adds to language learning, but there are some drawbacks. Let's look at both the pros and cons.
prompt field in the API request body which would guide the processing to expect what the audio from the user is. This led to more accuracy, the Realtime API does not have something like this yet and from playing with it in the playground, it sometime picked up the wrong audio.We at JoJo found that for Italian, Eleven labs has more native sounding voices. So we are sticking with the previous 3 API call model because it offers a higher degree of accuracy. Despite the higher latency and users having to confirm the transcription, at least they know that the conversations they are having from their end are accurate.
Just like the GPT models have gotten increasingly cheaper, I think the same will have to happen for the Realtime API to work at scale. It seems like now, only established companies would be able to afford to incorporate this in production or make it a premium-only product for higher tier paid users.
I also think they’d need significant guardrails and high quality prompt engineering to maximize the effectiveness and reduce hallucination.
I do see a world where using the Realtime API + a source of guided high quality language learning content could become the main source of language learning. People still prefer and trust a human teacher more, so it will take some time for people to trust AI for language learning and change their behavior. This is why accuracy in AI language learning is so important. I’m excited to see what happens and how JoJo will be a part of the new age of education with AI.
Thats good, but why so much?
are you talking about the API cost?