Snapshot: Secrets of an NLP professional

This is Snapshot, where members of German Autolabs discuss unique working challenges. This time round, Duygu Altinok, our senior Natural Language Processing engineer, explains how she makes sense out of a noisy in-vehicle environment and why she just can’t get enough of the Portuguese language.

I really like playing with speech data. Like many Natural Language Processing developers out there, my first job was in Information Retrieval. Then I moved to neural network-based methods until I got a chance to work in an Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) development team. This was when I started to make friends with speech data.

Though speech signals are very exotic, the raw signal may look just like a “usual” acoustic signal at first glance:

Figure I: Utterance “Satz” from a German native speaker, audio taken from Duden Dictionary

One can distinguish the vowel due to its rising acoustic energy. Different vowels leave their marks in different ways. Compare below “a” to “e”: a front vowel to a back vowel, and notice how frontness and backness are marked in the frequency bands:

Figure II: Utterance “Gesetz” from a German native speaker, audio taken from Duden Dictionary

Very beautiful, isn’t it? I usually use statistical methods for processing the speech, but still sometimes I search for clues in the signal as to why my implementation failed at certain points.

True beauty comes with imperfections. I’ll never fall out of love with speech processing.

Car engine noise and a myriad of other noises occur in the car. Radio crosstalk, driver-passenger crosstalk and children crosstalk are all extrinsic factors. Non-native accents, foreign origin words, address recognition, contact name recognition, recognition for entities such as artist, song & album names, short word recognition (ja, naja, yes, no for instance) are all intrinsic factors.

Speech recognition is inherently challenging both in acoustic variance (different accents, foreign named entities) and linguistic terms (entity name recognition), but German Autolabs’ voice assistant also has to recognize speech in a very noisy environment. For example, this audio sample is pure noise and does not include any speech at all. You’ll notice how the look of this spectrogram is very different:

Figure III: Random noise occurring in the car. It does not resemble speech signal at all.

The following two audio samples are also different to the “clean” audio of the previous section in a new way: they include radio crosstalk and user speech at the same time. One can clearly see the speech energy focused, but also a continuous band of music is swiping across the spectrogram. The raw signal looks more “uniform” due to the music noise, but it includes the speech signal inside. Processing this variety of mixed signal is challenging, as the acoustic signals are meshed:

Figure 4: Radio crosstalk
Figure 5: Radio crosstalk

Noise is always a challenge in a car environment, but it can be solved by some separation methods. So how about the intrinsic acoustic problems of speech?

Think of the word “WhatsApp” — it occurs frequently in our voice assistant domain, as we support the sending and reading of WhatsApp messages. The origin is English language, joined with German language as a proper noun. In the original sound system, the 2nd vowel is a diphthong, a combination of the two vowels “a” and “e”. Compare how a native American speaker and a German speaker pronounce the same word very differently:

Figure 6: WhatsApp pronunciation by an American speaker, audio taken from Duden Dictionary
Figure 7: WhatsApp pronunciation by a German speaker colleague

Acoustic signatures are just… different. The phonetic systems of English and German languages vary, so the speech signals for the same word that you see above have ended up very different to each other. How does it affect our recognition? It can cause some headaches for German speech recognizers, for sure. Putting a peach into a basket of apples needs special care. Can we handle it? Yep. We play some extra tricks for foreign words.

After obtaining text from the speech signal, we want to charge meaning to the text. I like to use modern tools including word embeddings, neural networks and my own secret recipes. Word embeddings for our driver assistance domain appears compact and well distributed:

Figure 8: German Autolabs Voice Assistant word embeddings

Everyone can enjoy neural networks, from sequential labeling to classification — one only needs to pick a flavor. Here’s an example of an architecture that I really enjoy implementing. Simple but very powerful:

Figure 9: A neural network architecture for text classification

Portuguese is my favorite language, definitely. The language has an incredible prosody, both soft and passionate at the same time. It sounds simultaneously like a love ballad and an attack march. I enjoy overhearing my lovely Portuguese colleagues, which is especially handy since we sit next to each other in the office.

Speaking of the office and future plans for German Autolabs: it’s a secret! But I can assure one thing always: both as a team and through our products, we’re always full of surprises. We will continue surprising the conversational AI world with new ideas and implementations and changing the world with new, smarter ways to understand what you want to say.

Duygu Altinok is senior Natural Language Processing engineer at German Autolabs. Follow her on Linkedin and GitHub.

German Autolabs builds voice assistance solutions for professional drivers. For more information, visit Thanks for reading.

We build logistics voice assistance for mobile workers.