After getting into the text, image and video, I absolutely had to start to take a look on the audio side. This is my goal in this article which is first a practising one. So, the idea of this post is very simple … how to capture voice information and transcribe it into text with your computer?
If you want to create your own digital assistant, this is clearly is how and where to start anyway …
Python is a language or rather a rich environment to perform this Voice -> Text transcription, so we just need some libraries:
- pyAudio (https://pypi.org/project/PyAudio/)
pip install PyAudio
Be careful if like me you run Ubuntu, the installation via pip may not work. Instead, prefer:
sudo apt-get install python-pyaudio python3-pyaudio
- speech_recognition
pip install SpeechRecognition
This API allows you to do voice-to-text transcription. For this it can rely on different engines:
- CMU Sphinx (offline)
- Google Speech Recognition (the one we’re going to use here)
- Google Cloud Speech API
- Wit.ai
- Microsoft Bing Voice Recognition
- Houndify API
- IBM Speech to Text
- Snowboy Hotword Detection (offline)
Index
Microphone management
First of all initizalize speech_recognition :
r = sr.Recognizer()
Then we can list the microphones available on the computer:
sr.Microphone.list_microphone_names()
['HDA Intel HDMI: 0 (hw:0,3)',
'HDA Intel HDMI: 1 (hw:0,7)',
'HDA Intel HDMI: 2 (hw:0,8)',
'HDA Intel HDMI: 3 (hw:0,9)',
'HDA Intel HDMI: 4 (hw:0,10)',
'HDA Intel PCH: ALC3232 Analog (hw:1,0)',
'HDA NVidia: HDMI 0 (hw:2,3)',
'HDA NVidia: HDMI 1 (hw:2,7)',
'HDA NVidia: HDMI 2 (hw:2,8)',
'HDA NVidia: HDMI 3 (hw:2,9)',
'hdmi',
'pulse',
'default']
At this level you must choose the right microphone by specifying the device_index parameter as below:
micro = sr.Microphone(device_index=5)
Or just use the one by default :
micro = sr.Microphone()
First live recording
Doing your first speech recognition is extremely easy and takes a few lines in Python. for that we open the microphone channel (line 1) and we listen …
with micro as source:
print("Speak!")
audio_data = r.listen(source)
print("End!")
result = r.recognize_google(audio_data)
print (">", result)
Note in line 5 the use of the function recognize_google () which allows the Google service to analyze your audio stream and to transcribe the text to you. The result should be if you said “good morning”:
Speak!
End!
> good morning
How does it work?
- Enter the display of Speak! and End! speak and say “Good Morning (we’ll see later how to manage other languages like French)
- Once End! displayed you will notice that the execution is waiting for something. In fact the program calls the Google function and waits for the textual transcription of the audio tape.
Saving a wav file
It can be useful to record your voice in a wav file to transcribe it later or later.
For that we will use pyAudio like this:
import pyaudio
import wave
chunk = 1024 # Record in chunks of 1024 samples
sample_format = pyaudio.paInt16 # 16 bits per sample
channels = 2
fs = 44100 # Record at 44100 samples per second
seconds = 10
filename = "output.wav"
p = pyaudio.PyAudio() # Create an interface to PortAudio
print('Start Recording ...')
stream = p.open(format=sample_format,
channels=channels,
rate=fs,
frames_per_buffer=chunk,
input=True)
frames = [] # Initialize array to store frames
# Store data in chunks for 3 seconds
for i in range(0, int(fs / chunk * seconds)):
data = stream.read(chunk)
frames.append(data)
# Stop and close the stream
stream.stop_stream()
stream.close()
# Terminate the PortAudio interface
p.terminate()
print('... Finished recording')
# Save the recorded data as a WAV file
wf = wave.open(filename, 'wb')
wf.setnchannels(channels)
wf.setsampwidth(p.get_sample_size(sample_format))
wf.setframerate(fs)
wf.writeframes(b''.join(frames))
wf.close()
This portion of code records your microphone for 10 seconds and stores the result in the output.wav file
Voice recognition in French with Google
Imagine that you recorded your voice with the following lyrics:
"L'histoire commence un beau matin tout le monde va bien les élèves sont heureux"
r = sr.Recognizer()
with sr.AudioFile(filename) as source:
audio = r.record(source)
try:
data = r.recognize_google(audio, language="fr-FR")
print(data)
except:
print("Please try again")
Histoire commence un beau matin tout le monde va bien les élèves sont heureux
Note in line 5 the use of the option language = “fr-FR” which allows the use of a speech recognition model in French.
And there you have it, we saw in this article how to transcribe voice to text with Python and Google Speech Recognition. In a future article we may add a touch of NLP later in order to start a simple voice assistant much like we did for the analysis of movie reviews.