Using FFmpeg with Google Cloud Speech-to-Text

By mullaned2002

November 6, 2023

132

Google Cloud Speech-to-Text is a fully managed service that converts speech to text in real time. It can be used to transcribe audio and video files, create subtitles for videos, and build voice-activated applications.

The service supports a wide range of audio formats, including WAV, MP3, and AAC. It can also transcribe audio in a variety of languages, including English, Spanish, French, German, Japanese and many more

Google Cloud Speech-to-Text is easy to use. You can simply upload your audio file to the service, and it will automatically transcribe it into text. You can also use the service to transcribe live audio, such as a phone call or a meeting. Speech-to-Text samples are given here for V1 and V2.

Problem

However, what if the input audio encoding is not supported by the STT API?

Supported audio encodings are given at https://cloud.google.com/speech-to-text/docs/encoding

Fortunately, there are a number of third-party tools available to assist with encoding conversion.

FFmpeg is a popular multimedia framework for handling audio and video data. It can be used to encode, decode, transcode, and stream audio and video content. In this blog, we will demonstrate how to use FFmpeg in various scenarios to obtain the correct encoding for calling an STT API.

Running it locally from the command lineInvoking it through a Python programBuilding a container image with GCP buildpacks and FFmpegRunning from Vertex AI Workbench

Running it locally from the command line

Download ffmpeg software from – https://www.ffmpeg.org/download.html and install it.

Take a sample input audio source is encoded in “acelp.kelvin” which is not supported by STT API

To determine how the audio source is encoded, run ffmpeg -i input.wav output will shown the encoding

code_block<ListValue: [StructValue([(‘code’, ‘Stream #0:0: Audio: acelp.kelvin (5[1][0][0] / 0x0135), 8000 Hz, 2 channels, s16p, 17 kb/s’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52dc288760>)])]>

Run below command to change the encoding to “pcm_s161e”

code_block<ListValue: [StructValue([(‘code’, ‘ffmpeg -i “input.wav ” -f wav -bitexact -acodec pcm_s16le -ac 2 ”output.wav”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52dc2886a0>)])]>

The output of the ffmpeg -i outpt.wav command indicates that the encoding of the file has been changed and that the file is now ready to be passed through the STT API.

code_block<ListValue: [StructValue([(‘code’, ‘Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 2 channels, s16, 705 kb/s’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52dc2888b0>)])]>

For a full list of ffmpeg options see http://ffmpeg.org/documentation.html

Invoking it through a Python program

Add the version of ffmpeg to your requirements.txt file and run pip install -r requirements.txt

code_block<ListValue: [StructValue([(‘code’, ‘ffmpeg-python==0.2.0’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52dc288640>)])]>

The following Python code snippet takes an input file and the output audio is encoded in pcm_s16le.

Input files can be either local to the machine or stored in a GCS Bucket. If files are stored in a GCS Bucket, they must first be downloaded to the machine where the ffmpeg software is running, and then re-uploaded after the encoding is modified.

code_block<ListValue: [StructValue([(‘code’, “def convert_using_ffmpeg(input_file, output_file):rn try:rn (rn ffmpegrn .input(input_file)rn .output(output_file, format=’wav’, acodec=’pcm_s16le’, ac=2)rn .run(overwrite_output=True)rn )rn except Exception as e:rn logging.error(e)rn return Falsern rn return True”), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52dc288ca0>)])]>

Building a container image with GCP buildpacks and FFmpeg

Google Cloud’s buildpacks transforms your application source code into container images that are ready for production. Buildpacks use a default builder but you can customize run images to add the packages that are required for building your service.

1.1. Create a Dockerfile

The first step is to create a Dockerfile (builder.Dockerfile). This file will describe how to build a base ffmpeg image.

code_block<ListValue: [StructValue([(‘code’, ‘FROM gcr.io/buildpacks/builderrnUSER rootrnRUN apt-get -y updaternRUN apt-get install -y ffmpeg’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52dc288d90>)])]>

This Dockerfile will use the default “gcr.io/buildpacks/builder” image as a base image. It will then install the FFmpeg package using the apt-get command.

1.2. Build the Container Image

Once you have created your Dockerfile, you can build the container image using the following commands

code_block<ListValue: [StructValue([(‘code’, ‘docker build -t ffmpeg-run-image:1 -f builder.Dockerfile .rn rnpack build ffmpeg-service-image –builder gcr.io/buildpacks/builder:latest –run-image ffmpeg-run-image:1’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52dc288e80>)])]>

1.3. Push the Container Image

Once you have built the base ffmpeg image, you can run it using the following command

code_block<ListValue: [StructValue([(‘code’, ‘docker tag ffmpeg-service-image:latest gcr.io/<PROJECT_ID/ffmpeg-service-image:latestrnrnrndocker push gcr.io/<PROJECT_ID/ffmpeg-service-image:latest’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e52de5e1040>)])]>

Your image is now ready with the ffmpeg software. Instead of build packs, you can use any image and follow the standard container build process to include ffmpeg.

Running from Vertex AI Workbench (Ubuntu)

Running on a Vertex Jupyter Workbench is similar to running a Python program locally. From a Jupyter notebook, FFmpeg can be installed using sudo and the Python code snippet given above can be executed as is. The installation commands for ffmpeg may vary depending on the environment of your notebook.

Conclusion

Encoding conversion is a straightforward process; the platform of choice for the conversion process is determined by the number and size of the audio files and the amount of conversion needed. Converting a small number of files can be accomplished using command line or running a Python program locally. If this is a recurring task, building an image and using a serverless platform like Cloud Run are ideal. Vertex Workbench is more efficient in processing audio files of increasing size, particularly when parallel processing is needed.

FFmpeg requires that audio files be in the same location as the software, which can lead to performance issues when processing large audio files, especially when audio files are stored in a GCS Bucket. To avoid this, it is recommended to choose a processing platform on GCP and that is in the same region as the GCS Bucket where the audio files are stored. This will reduce the amount of time it takes to process the files, as the data does not need to be transferred between regions.

Head over to the detailed documentation and try walkthroughs to get started on using Speech-to-Text API.

Cloud BlogRead More

Previous articleNo more double vision: How Miinto improved its customer experience using Vertex AI Vision

Next articleGoogle Cloud at SC23: Learn how to accelerate your HPC and AI workloads

Using FFmpeg with Google Cloud Speech-to-Text

Problem

Running it locally from the command line

Invoking it through a Python program

Building a container image with GCP buildpacks and FFmpeg

Running from Vertex AI Workbench (Ubuntu)

Conclusion

The overwhelmed person’s guide to Google Cloud: week of April 18

Introducing new ML model monitoring capabilities in BigQuery

The power of choice: Simplifying your regulatory and compliance journey

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 18

Databricks DBRX is now available in Amazon SageMaker JumpStart

Introducing new ML model monitoring capabilities in BigQuery

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Machine learning for Java developers: Machine learning data pipelines

Making Your Pictures Worth a Thousand Labels! (with Cloud Vision API)

Introducing client authentication with Mutual TLS on Google Cloud Load Balancing

POPULAR CATEGORY