During a recent meeting a client asked about converting a large volume of voice recordings into text for analysis. The client’s hypothesis is that when merged with other data sources the recordings will provide additional insight into their customers and improve the performance of several predictive models.
There are many great off the shelf speech-to-text options available today. Popular choices include:
- Amazon Transcribe
- Google Speech Recognition
- Google Cloud Speech API
- IBM Speech to Text
- Microsoft Azure Speech
Our client's data is highly sensitive and can not be sent to external APIs for processing. This constraint removed the majority of options from further consideration. To keep from sending data externally we are experimenting with CMUSphinx, an open source state of the art speech recognition tool that can be used offline. To speed up the process we have installed CMUSphinx across a Greenplum cluster which will allow us to distribute the conversion work over multiple servers. The client has 100s of GB of .wav audio files to convert. When we deploy this code to a larger cluster we see significant performance gains over a single CMUSphinx instance.
In this blog we will provide an overview of setting up and working with CMUSphinx within Greenplum for in-database speech-to-text processing.
Setting up the Greenplum environment
For testing we are using single node Greenplum 5.10.2 running in a docker container. For more information on the container we used for experimentation and to download the files follow this link.
To setup the docker container for this experiment we installed CMUSphinx, a Python interface package pocketsphinx and required dependencies.
#bash into Docker container docker exec -it gpdb-ds /bin/bash #install dependencies while root user yum -y install wget pulseaudio-libs-devel alsa-lib-devel yum -y groupinstall "Development Tools" #change to gpadmin user (Python 2.7 in search path) su - gpadmin #download and install sphinxbase and pocketsphinx wget https://sourceforge.net/projects/cmusphinx/files/sphinxbase/5prealpha/sphinxbase-5prealpha.tar.gz tar -xvf sphinxbase-5prealpha.tar.gz cd sphinxbase-5prealpha ./configure make make install cd .. wget https://sourceforge.net/projects/cmusphinx/files/pocketsphinx/5prealpha/pocketsphinx-5prealpha.tar.gz tar -xvf pocketsphinx-5prealpha.tar.gz cd pocketsphinx-5prealpha ./configure make make install # add python dependencies pip install --upgrade pip pip install SpeechRecognition pydub pocketsphinx
Loading audio files into Greenplum
We stored the raw audio in a table that holds the original file name along with the pickled version of the audio file in a bytea type column.
The following code is an example of loading a single wav file. To run, create a python file and paste in the following code. Note that the Greenplum connection details are imported through seperate config file.
#!/usr/bin/python import psycopg2 from config import config import pickle filename = '/test.wav'; params = config() conn = psycopg2.connect(**params) cur = conn.cursor() # create table to insert data into cur.execute('CREATE TABLE public.waves (filename text, pickles bytea);') with open(filename, 'rb') as fd: # pickle wave file contents pkl = pickle.dumps(fd.read()) # insert pickle into db table try: cur.execute("INSERT INTO public.waves VALUES (%s, %s)", (filename, psycopg2.Binary(pkl),)) except (Exception, psycopg2.DatabaseError) as error: print(error) cur.close() conn.close()
Converting speech to text in-database
The next step is to create a PL/Python user defined function (UDF) that converts the pickled wav file of speech into text. This is done using CMUSphinx which we installed earlier. To pass Greenplum data to CMUSphinx we have the option of removing data from the database for external processing or we can call Python from within Greenplum via PL/Python.
The advantage of using Python within Greenplum via PL/Python is that we can utilize the distributed architecture of Greenplum to parallelize the workload. Put another way, within Greenplum we can launch multiple CMUSphinx instances and process the data faster. We do all this without having to write additional code to distribute the workload (i.e. map reduce) which is huge time savings. An added bonus with Greenplum is that we can utilize the NLP and machine learning functionality to analyze the text data once ready without moving the data again.
The below PL/Python UDF takes the pickled wav files as input and returns text.
-- create user defined function DROP FUNCTION IF EXISTS public.speach_to_text(bytea); CREATE OR REPLACE FUNCTION public.speach_to_text(wav bytea) RETURNS text AS $$ import speech_recognition as sr import pickle import io rec = sr.Recognizer() audio_data = pickle.loads(wav) with sr.AudioFile(io.BytesIO(audio_data)) as source: y = rec.record(source) results = rec.recognize_sphinx(y) return(results) $$ LANGUAGE 'plpythonu'; -- run a test SELECT filename ,public.speach_to_text(pickles) FROM public.waves LIMIT 1;
As of the release of this blog we are still in the process of testing the accuracy of results. Initial spot checks show CMUSphinx is not as accurate as Google's API but still performs more than adequate for a minimal viable product with the data we have. CMUSphinx offers several tuning parameters and also allows for model updates. We plan to continue iterating on this solution to improve results in the future.
Thank you for reading this blog. For more information on Greenplum or any of the other content on the A42 Labs blog do not hesitate to contact us: firstname.lastname@example.org.