This is Snapshot, where members of German Autolabs discuss unique working challenges, from overseas hardware manufacturing to natural language understanding. Today our senior Android developer, Mohsen Mirhoseini, explains how to send voice over Bluetooth Low Energy for a faster, more robust experience.
How about Bluetooth classic?
Three years ago when we first started thinking about Chris, one of our biggest challenges was: how are we going to transfer voice?
Initially, we tried to send the audio using classic Bluetooth — the kind that you find in a normal pair of wireless headphones. But at German Autolabs we have a very particular use-case: in-car connectivity. And right away we saw that we were blocking the Bluetooth chip on the user’s phone by connecting it to Chris.
This was bad. We had users who were accustomed to connecting their phones to their cars. Now we were adding something in the middle and saying: please connect to Chris, don’t connect to your car anymore, don’t use your fancy steering wheel controls, don’t answer calls with your car, everything is different now: use Chris.
To compound matters, we had specific technical challenges with Chris. When you put beamforming microphones and a speaker into a small box, they can easily reflect each other and create a world of echoes, especially during calls.
How did this all stack up? We were asking too much from our users. They were used to doing something one way, and then we tried to force them to do it our way. Back to the drawing board.
It takes two to tango.
During the last three years, we had a lot of issues with this classic Bluetooth profile. For example, switching between HFP (Hands-Free Profile) and A2DP (Advanced Audio Distribution Profile). HFP is the profile we use for calling and has the recording channel, while A2DP is the higher quality single-channel most commonly used for sending music data.
Switching between the two profiles always comes with a 1–2 second delay, depending on the model of phone and speaker connected. To fight this issue we had to introduce a buffer on the Chris side because we knew that the switch could not be instantaneous. By buffering the user data and voice, we could then play it back to the speech recognition module from scratch after the profile switch to make 100% sure we understood the user intent correctly.
Even for TTS (Text To Speech), switching between two profiles was a bit of a horrible experience. The user would say something to Chris, and even though Chris could very quickly decide how best to respond, we would still have to wait one second to switch Bluetooth profiles.
Dialogue is a two-sided event where any delay in between means that you’ll lose your audience. This is what was happening to us, and this was why we realized that we had to change.
When we talk about Bluetooth Low Energy, the Low Energy part means that less data is being transferred, less frequently. Take a smartwatch for example. You send a signal that there’s a new notification and the watch will vibrate. That’s it: simple, low energy.
We already used BLE for some of these haulage jobs: transferring dynamic images and gesture commands between Chris and the phone, for example. But what if BLE could do more than just grunt work? Would the bandwidth allow for good pressurized encryption of voice?
To create packets of voice to send over BLE, you need a codec to wrap the audio up into. We went for the Opus codec — also used by WhatsApp and PlayStation to send voice — and picked it for its superior quality at lower audio resolutions. In blind tests, Opus outperforms even big-name codecs like mp3 and AAC.
Once Opus was introduced on the Chris side in tandem with Bluetooth BLE, we had an entirely new and much more efficient way to deliver voice:
- The user says “Hey Chris…”
- Chris starts recording
- Chris compresses the recording into manageable packets
- Packets are transferred to the phone
- Phone decompresses the packets
- Phone opens and reattaches the packets
- The unified packet is delivered to the voice recognition module
- Eureka! We understand what the user wanted at the start.
As new technologies continue to evolve, we’re always looking for the optimal way to implement those technologies for the best user experience. And sometimes, it’s not exactly the way you expect.