How will Open AI's Realtime API impact language learning?

Last week Open AI released their Realtime API which allows developers to create speech-to-speech experiences.

This makes interacting with AI more human-like due to lower latency and voice-only communication. Instead of previously having to make 3 API calls : speech-to-text -> chat completions -> text-to-speech; speech-to-speech interactions now take one call with the new Realtime API.

One of the many sectors that benefits form this advancement is language learning. Speak's CTO Andew Hsu demoed how they are using the Realtime API for roleplay scenarios in Spanish.

https://x.com/adhsu/status/1841172452938039536

Having an on demand conversation partner is one of the benefits the Realtime API adds to language learning, but there are some drawbacks. Let's look at both the pros and cons.

Pros

On demand tutor and conversation parter in a low-pressure environment to get comfortable with before engaging in real world conversations.
Conversations simulate the low-latency of an in-person conversation
Offers more exposure to speaking and listening since they are the harder and last phases of language learning that language learners conquer, compared to reading and writing.

Cons

Voices do not sound native like. Eleven labs has more native sounding voices in my opinion.
The STT whisper-1 model has a prompt field in the API request body which would guide the processing to expect what the audio from the user is. This led to more accuracy, the Realtime API does not have something like this yet and from playing with it in the playground, it sometime picked up the wrong audio.
At $0.06 per minute of audio input and $0.24 per minute of audio output, it gets expensive very quickly, especially at scale. Someone said it cost them $14 for 5 minutes of conversation.

We at JoJo found that for Italian, Eleven labs has more native sounding voices. So we are sticking with the previous 3 API call model because it offers a higher degree of accuracy. Despite the higher latency and users having to confirm the transcription, at least they know that the conversations they are having from their end are accurate.

What's next?

Just like the GPT models have gotten increasingly cheaper, I think the same will have to happen for the Realtime API to work at scale. It seems like now, only established companies would be able to afford to incorporate this in production or make it a premium-only product for higher tier paid users.

I also think they’d need significant guardrails and high quality prompt engineering to maximize the effectiveness and reduce hallucination.

I do see a world where using the Realtime API + a source of guided high quality language learning content could become the main source of language learning. People still prefer and trust a human teacher more, so it will take some time for people to trust AI for language learning and change their behavior. This is why accuracy in AI language learning is so important. I’m excited to see what happens and how JoJo will be a part of the new age of education with AI.