The Formula 1 (F1) live steaming service, F1 TV, has live automated closed captions in three different languages: English, Spanish, and French.
For the 2021 season, FORMULA 1 has achieved another technological breakthrough, building a fully automated workflow to create closed captions in three languages and broadcasting to 85 territories using Amazon Transcribe. Amazon Transcribe is an automatic speech recognition (ASR) service that allows you to generate audio transcription.
In this post, we share how Formula 1 joined forces with the AWS Professional Services team to make it happen. We discuss how they used Amazon Transcribe and its custom vocabulary feature as well as custom-built postprocessing logic to improve their live transcription accuracy in three languages.
For F1, everything is about extreme speed: with pit stops as short as 2 seconds, speeds of up to 375 KPH (233 MPH), and 5g forces on drivers under braking and through corners. In this fast-paced and dynamic environment, milliseconds dictate the difference between pole position or second on the grid. The role of the race commentators is to weave the multitude of parallel events and information into a single exciting narrative. This form of commentary greatly increases the engagement and excitement of viewers.
F1 has a strong affinity to cutting edge technology, and partnered with AWS to build a scalable and sustainable closed caption solution for F1 TV, their Over-the-top (OTT) platform, that can support a growing calendar and language portfolio. F1 now provides real-time live captions in three languages across four series: F1 in British English, US Spanish and French; and F2, F3, and Porsche Supercup in British English and US Spanish. This was achieved using Amazon Transcribe to automatically convert the commentary into subtitles.
This task provides many unique challenges. With the excitement of an F1 race, it’s common to have commentators with differing accents move quickly from one topic to another as the race unfolds. Being a sport steeped in technology, commentators often refer to F1 domain-specific terminology such as DRS (Drag Reduction System), aerodynamic, downforce, or halo (a safety device) for example. Moreover, F1 is a global sport, traveling across the world and drawing drivers from many different countries. Looking only at the 2021 season, 16/20 drivers had non-English names and 17/20 had non-Spanish names or non-French names. With the advanced customization features available in Amazon Transcribe, we tailored the underlying language models to recognize domain-specific terms that are rare in general language use, which boosted transcription accuracy.
In the following sections, we take a deep dive into how AWS Professional Services partnered with F1 to build a robust, state-of-the-art, real-time race commentary captioning system by enhancing Amazon Transcribe to understand the particularities of the F1 world. You will learn how to utilize Amazon Transcribe in real-time broadcasts and supercharge live captioning for your use case with custom vocabularies, postprocessing steps, and a human-in-the-loop validation layer.
The solution works as a proxy to Amazon Transcribe. Custom vocabularies are passed as parameters to Amazon Transcribe, and the resulting captions are postprocessed. The postprocessed text is then moderated by an F1 moderator before being transformed to captions that are displayed to the viewers. The following diagram shows the sequential process.
Live transcriptions: Understanding use case specific terminology and context
The output of Automatic Speech Recognition (ASR) systems is highly context-dependent. ASR language models benefit from utilizing the words across a fully spoken sentence. For example, in the following sentence, the system uses the words ‘WORLD CHAMPIONSHIP’ towards the end of the sentence to inform context and allow ‘FORMER ONE’ to be correctly transcribed as ‘FORMULA 1’.
GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMER ONE
GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMULA 1 WORLD CHAMPIONSHIP IN 2019
Amazon Transcribe supports both batch and streaming transcription models. In batch transcription, the model issues a transcription using the full context provided in the audio segment. Amazon Transcribe streaming transcription enables you to send an audio stream and receive a transcription stream in real time. Generating subtitles for a live broadcast requires a streaming model because transcriptions should appear on screen shortly after the commentary is spoken. This real-time need presents unique challenges compared to batch transcriptions and often affects the quality of the results because the language model has limited knowledge of the future context.
Amazon Transcribe is pre-trained to capture a wide range of use cases. However, F1 domain-specific terminology, names, and locations aren’t present in the Amazon Transcribe general language model. Getting those words correct is nevertheless crucial for the understanding of the narrative, such as who is leading the race, circuit corners, and technical details.
Amazon Transcribe allows you to develop with custom vocabularies and custom language models to improve transcription accuracy. You can use them separately for streaming transcriptions or together for batch transcriptions.
Custom vocabularies consist of a list of specific words that you want Amazon Transcribe to recognize in the audio input. These are generally domain-specific words and phrases, such as proper nouns. You can inform Amazon Transcribe how to pronounce these terms with information such as SoundsLike (in regular orthography) or the IPA (International Phonetic Alphabet) description of the term. Custom vocabularies are available for all languages supported by Amazon Transcribe. Custom vocabularies improve the ability of Amazon Transcribe to recognize terms without using the context in which they’re spoken.
The following table shows some examples of a custom vocabulary.
ʃ ɑ ɹ l l ə k l ɛ ɹ
f ɝ ɹ ɑ ɹ ɪ
m ɛ ɹ s eɪ d i z
The custom vocabulary includes the following details:
Phrase – The term that should be recognized.
DisplayAs – How the word or phrase looks when it’s output. If not declared, the output would be the phrase.
SoundsLike – The term broken into small pieces with the respective pronunciations in the specified language using standard orthography.
IPA – The International Phonetic Alphabet representation for the term.
Custom language models are valuable when there are larger corpuses of text data that can be used to train models. With the additional data, the models learn to predict the probabilities of sequences of words in the domain-specific context. For this project, F1 chose to use custom vocabulary given the unique words and phrases that are unique to F1 racing.
Postprocessing: the final layer of performance boosting
Due to the fast-paced nature of F1 commentary with rapidly changing context as well as commentator accents, inaccurate transcriptions may still occur. However, recurring mistakes can be easily fixed using text replacement. For example, “Kvyat and Albon” may be misunderstood as “create an album” by the British English language model. Because “create an album” is an unlikely term to occur in F1 commentaries, we can safely replace them with their assumed real meanings in a postprocessing routine. On top of that, postprocessing terms can be defined as general, or based on location and race series filters. Such selection allows for more specific term replacement, reducing the chance of erroneous replacements with this approach.
For this project, we gathered thousands of replacements for each language using hours of real-life F1 audio commentary that was analyzed by F1 domain specialists. On top of that, during every live event, F1 runs a transcribed commentary through a human-in-the-loop tool (described in the next section), which allows sentence rejection before the subtitles appear on screen. This data is used later to continuously improve the custom vocabulary and postprocessing rules. The following table shows examples of postprocessing rules for English captions. The location filter is a replacement filter based on race location, and the race series filter is based on the race series.
Race Series Filter
CREATE AN ALBUM
KVYAT AND ALBON
CURVE A PARABOLIC
CIRCUIT THE CATALONIA
CIRCUIT DE CATALUNYA
Another important function of postprocessing is the standardization and formatting of numbers. When generating transcriptions for live broadcasts such as television, it’s a best practice to use digits when displaying numbers because they’re faster to read and occupy less space on screen. In English, Amazon Transcribe automatically displays numbers bigger than 10 as digits, and numbers between 0–10 are converted to digits under specific conditions, such as when there are more than one in a row. For example, “three four five” converts to 345. In an effort to standardize number transcriptions, we digitize all numbers.
As of August 8, 2021, transcriptions only output numbers as digits instead of words for a defined list of languages in both batch and streaming (for more information, see Transcribing numbers and punctuation). Notably, this list doesn’t include Spanish (es-US and es-ES) or French (fr-FR and fr-CA). With the postprocessing routine, numbers were also formatted to handle integers, decimals, and ordinals, as well F1-specific lap time formatting.
The following shows an example of number postprocessing for different languages that were built for F1.
Human in the loop: Continuous improvement and adaptation
Amazon Transcribe custom vocabularies and postprocessing boost the service’s real-time performance significantly. However, the fast-paced and quickly changing environment remains a challenge for automated transcriptions. It’s better for a person reliant on closed captions to miss out on a phase of commentary, rather than see an incorrect transcription that may be misleading. To this end, F1 employs a human in the loop as a final validation, where a moderator has a number of seconds to verify if a word or an entire sentence should be removed before it’s included in the video stream. Any removed sentences are then used to improve the custom vocabularies and postprocessing step for the next races.
Minor grammatical errors don’t greatly decrease the understandability of a sentence. However, using the wrong F1 terminology breaks a sentence. Usually ASR systems are evaluated on word error rate (WER), which quantifies how many insertions, deletions, and substitutions are required to change the predicted sentence to the correct one.
Although WER is important, F1-specific terms are even more crucial. For this, we created an accuracy score that measures the accuracy of people names (such as Charles Leclerc), teams (McLaren), locations (Hungaroring), and other F1 terms (DRS) transcribed in a commentary. These scores allow us to evaluate how understandable the transcriptions are to F1 fans and, combined with WER, allow us to maintain high-quality transcriptions and improvements in Amazon Transcribe.
The F1 TV enhanced live transcriptions system was released on March 26, 2021, during the Formula 1 Gulf Air Bahrain Grand Prix. By the first race, the solution had already achieved a strong reduction in WER and F1-specific accuracy improvements for all three languages, compared to the Amazon Transcribe standard model. In the following tables, we highlight the WER and F1 specific accuracy improvements for the different languages. The numbers compare the developed solution using Amazon Transcribe using custom vocabularies and postprocessing with Amazon Transcribe generic model. The lower the WER, the better.
Standard Amazon Transcribe WER
Amazon Transcribe with CV and Postprocessing WER
Standard Amazon Transcribe Accuracy
Amazon Transcribe with CV and Postprocessing Accuracy
Other F1 terms
Other F1 terms
Other F1 terms
Although the approach significantly improves the WER measures, its main influence is seen on F1 names, teams, and locations. Because the F1 specific terms are often in local languages, custom vocabularies, and postprocessing steps can quickly teach Amazon Transcribe to consider those terms and correctly transcribe them. The postprocessing step then further adapts the outcome transcriptions to F1’s domain to provide highly accurate automated transcriptions. In the following examples, we present phrases in English, Spanish, and French where Amazon Transcribe custom vocabularies, postprocessing, and number handling techniques successfully improved the transcription accuracy.
For Spanish, we have the original Amazon Transcribe output “EL PILOTO BRITÁNICO LORIS JAMIL TODOS ESTÁ A DOS SEGUNDOS PUNTO TRES DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN UNO VEINTINUEVE DOSCIENTOS TREINTA Y CUATRO” compared to the final transcription “EL PILOTO BRITÁNICO LEWIS HAMILTON ESTÁ A 2.3 s DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN 1:29.234.”
The custom vocabulary and postprocessing combination converted “LORIS JAMIL TODOS” to “LEWIS HAMILTON,” and the number handling routine converted the lap time to digits and added the appropriate punctuation (1:29.234).
For English, compare the original output “THE GERMAN DRIVER THE BASTION BETTER COMPLETED THE LAST LAP IN ONE 15 632” to the final transcription “THE GERMAN DRIVER SEBASTIAN VETTEL COMPLETED THE LAST LAP IN 1:15.632.”
The custom vocabulary and postprocessing combination converted “THE BASTION BETTER” to “SEBASTIAN VETTEL.”
In French, we can compare the original output “VICTOIRE POUR LES MISS MILLE TONNE DIX-HUIT POLE CENT TROIS PODIUM QUATRE VICTOIRES ICI” to the final output “VICTOIRE POUR LEWIS HAMILTON 18 POLE 103 PODIUM 4 VICTOIRES ICI.”
The custom vocabulary and postprocessing combination converted “LES MISS MILLE TONNE” to “LEWIS HAMILTON,” and the number handling routine converted the numbers to digits.
The following short video shows live captions in action during the Formula 1 Gulf Air Bahrain Grand Prix 2021.
In this post, we explained how F1 is now able to provide live closed captions on their OTT (Over-The-Top) platform to benefit viewers with accessibility needs and those who want to ensure they do not miss any live commentary.
In collaboration with AWS Professional Services, F1 has set up live transcriptions in English, Spanish, and French by using Amazon Transcribe and applying enhancements to capture domain-specific terminology.
Whether for sport broadcasting, streaming educational content, or conferences and webinars, AWS Professional Services is ready to help your team develop a real-time captioning system that is accurate and customizable by making full use of your domain-specific knowledge and the advanced features of Amazon Transcribe. For more information, see AWS Professional Services or reach out through your account manager to get in touch.
About the Authors
Beibit Baktygaliyev is a Senior Data Scientist with AWS Professional Services. As a technical lead, he helps customers to attain their business goals through innovative technology. In his spare time, Beibit enjoys sports and spending time with his family and friends.
Maira Ladeira Tanke is a Data Scientist at AWS Professional Services. She works with customers across industries to help them achieve business outcomes with AI and ML technologies. In her spare time, Maira likes to play with her cat Smila. She also loves to travel and spend time with her family and friends.
Sara Kazdagli is a Professional Services consultant specialized in Data Analytics and Machine Learning. She helps customers across different industries to build innovative solutions and make data-driven decisions. Sara holds a MSc in Software Engineering and a MSc in Data Science. In her spare time, she like to go on hikes and walks with her Australian shepherd dog Kiba.
Pablo Hermoso Moreno is a Data Scientist in the AWS Professional Services Team. He works with clients across industries using Machine Learning to tell stories with data and reach more informed engineering decisions faster. Pablo’s background is in Aerospace Engineering and having worked in the motorsport industry he has an interest in bridging physics and domain expertise with ML. In his spare time, he enjoys rowing and playing guitar.
Read MoreAWS Machine Learning Blog