Custom Audio Input Bytes To Azure Cognitive Speech Translation Service In Python

I am in need to able to translate custom audio bytes which I can get from any source and translate the voice into the language I need (currently Hindi). I have been trying to pass

Solution 1:

I got the solution to the problem by myself. I think it works with PullAudioInputStream too. But it worked for me using PushAudioInputStream. You don't need to create custom classes it would work like the following:

import azure.cognitiveservices.speech as speechsdk
from import AudioStreamFormat, PullAudioInputStream, PullAudioInputStreamCallback, AudioConfig, PushAudioInputStream

from threading import Thread, Event

speech_key, service_region = "key", "region"

channels = 1
bitsPerSample = 16
samplesPerSecond = 16000
audioFormat = AudioStreamFormat(samplesPerSecond, bitsPerSample, channels)

translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=speech_key, region=service_region)

fromLanguage = 'en-US'
toLanguage = 'hi'
translation_config.speech_recognition_language = fromLanguage

translation_config.voice_name = "hi-IN-Kalpana-Apollo"# Remove Custom classes as they are not needed.

custom_push_stream =

audio_config = AudioConfig(stream=custom_push_stream)

recognizer = speechsdk.translation.TranslationRecognizer(translation_config=translation_config, audio_config=audio_config)

# Create an event
synthesis_done = Event()

        size = len(
        print('AUDIO SYNTHESIZED: {} byte(s) {}'.format(size, '(COMPLETED)'if size == 0else''))
        if size > 0:
            t_sound_file = open("translated_output.wav", "wb+")
        # Setting the event

    if evt.result.reason == speechsdk.ResultReason.TranslatedSpeech:
        print("RECOGNIZED '{}': {}".format(fromLanguage, result.text))
        print("TRANSLATED into {}: {}".format(toLanguage, result.translations['hi']))
    elif evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("RECOGNIZED: {} (text could not be translated)".format(result.text))
    elif evt.result.reason == speechsdk.ResultReason.NoMatch:
        print("NOMATCH: Speech could not be recognized: {}".format(result.no_match_details))
    elif evt.reason == speechsdk.ResultReason.Canceled:
        print("CANCELED: Reason={}".format(result.cancellation_details.reason))
        if result.cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("CANCELED: ErrorDetails={}".format(result.cancellation_details.error_details))


# Read and get data from an audio file
open_audio_file = open("speech_wav_audio.wav", 'rb')
file_bytes =

# Write the bytes to the stream

# Start the recognition

# Waiting for the event to complete

# Once the event gets completed you can call Stop recognition

I have used Event from thread since start_continuous_recognition starts in a different thread and you won't get data from callback events if you don't use threading. synthesis_done.wait will solve this problem by waiting for the event to complete and only then will call the stop_continuous_recognition. Once you obtain the audio bytes you can do whatever you wish in the synthesis_callback. I have simplified the example and took bytes from a wav file.

Solution 2:

The example code provided uses a callback as the stream parameter to AudioConfig, which doesn’t seem to be allowed.

This code should work without throwing an error:

pull_audio_input_stream_callback = CustomPullAudioInputStreamCallback()
pull_audio_input_stream = PullAudioInputStream(pull_stream_callback=pull_audio_input_stream_callback, stream_format=audioFormat)

audio_config = AudioConfig(use_default_microphone=False, stream=pull_audio_input_stream)

