During a recent meeting a client asked about converting a large volume of voice recordings into text for analysis. The client’s hypothesis is that when merged with other data sources the recordings will provide additional insight into their customers and improve the performance of several predictive models.

There are many great off the shelf speech-to-text options available today. Popular choices include:

Our client's data is highly sensitive and can not be sent to external APIs for processing. This constraint removed the majority of options from further consideration. To keep from sending data externally we are experimenting with CMUSphinx, an open source state of the art speech recognition tool that can be used offline. To speed up the process we have installed CMUSphinx across a Greenplum cluster which will allow us to distribute the conversion work over multiple servers. The client has 100s of GB of .wav audio files to convert. When we deploy this code to a larger cluster we see significant performance gains over a single CMUSphinx instance.

In this blog we will provide an overview of setting up and working with CMUSphinx within Greenplum for in-database speech-to-text processing.




Setting up the Greenplum environment

For testing we are using single node Greenplum 5.10.2 running in a docker container. For more information on the container we used for experimentation and to download the files follow this link.

To setup the docker container for this experiment we installed CMUSphinx, a Python interface package pocketsphinx and required dependencies.

                  
#bash into Docker container
docker exec -it gpdb-ds /bin/bash

#install dependencies while root user
yum -y install wget pulseaudio-libs-devel alsa-lib-devel
yum -y groupinstall "Development Tools"

#change to gpadmin user (Python 2.7 in search path)
su - gpadmin

#download and install sphinxbase and pocketsphinx
wget https://sourceforge.net/projects/cmusphinx/files/sphinxbase/5prealpha/sphinxbase-5prealpha.tar.gz
tar -xvf sphinxbase-5prealpha.tar.gz
cd sphinxbase-5prealpha
./configure
make
make install

cd ..
wget https://sourceforge.net/projects/cmusphinx/files/pocketsphinx/5prealpha/pocketsphinx-5prealpha.tar.gz
tar -xvf pocketsphinx-5prealpha.tar.gz
cd pocketsphinx-5prealpha
./configure
make
make install

# add python dependencies
pip install --upgrade pip
pip install SpeechRecognition pydub pocketsphinx
                  
                
Loading audio files into Greenplum

We stored the raw audio in a table that holds the original file name along with the pickled version of the audio file in a bytea type column.

The following code is an example of loading a single wav file. To run, create a python file and paste in the following code. Note that the Greenplum connection details are imported through seperate config file.

                  
#!/usr/bin/python

import psycopg2
from config import config
import pickle

filename = '/test.wav';

params = config()
conn = psycopg2.connect(**params)
cur = conn.cursor()

# create table to insert data into
cur.execute('CREATE TABLE public.waves (filename text, pickles bytea);')

with open(filename, 'rb') as fd:

    # pickle wave file contents
    pkl = pickle.dumps(fd.read())

    # insert pickle into db table
    try:
        cur.execute("INSERT INTO public.waves VALUES (%s, %s)", (filename, psycopg2.Binary(pkl),))
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)

cur.close()
conn.close()
                  
                
Converting speech to text in-database

The next step is to create a PL/Python user defined function (UDF) that converts the pickled wav file of speech into text. This is done using CMUSphinx which we installed earlier. To pass Greenplum data to CMUSphinx we have the option of removing data from the database for external processing or we can call Python from within Greenplum via PL/Python.

The advantage of using Python within Greenplum via PL/Python is that we can utilize the distributed architecture of Greenplum to parallelize the workload. Put another way, within Greenplum we can launch multiple CMUSphinx instances and process the data faster. We do all this without having to write additional code to distribute the workload (i.e. map reduce) which is huge time savings. An added bonus with Greenplum is that we can utilize the NLP and machine learning functionality to analyze the text data once ready without moving the data again.

The below PL/Python UDF takes the pickled wav files as input and returns text.

                  
-- create user defined function
DROP FUNCTION IF EXISTS public.speach_to_text(bytea);
CREATE OR REPLACE FUNCTION public.speach_to_text(wav bytea)
RETURNS text AS
$$
  import speech_recognition as sr
  import pickle
  import io

  rec = sr.Recognizer()
  audio_data = pickle.loads(wav)

  with sr.AudioFile(io.BytesIO(audio_data)) as source:

      y = rec.record(source)
      results = rec.recognize_sphinx(y)

      return(results)

$$ LANGUAGE 'plpythonu';


-- run a test
SELECT filename
      ,public.speach_to_text(pickles)
FROM public.waves
LIMIT 1;
                  
                

Next steps

As of the release of this blog we are still in the process of testing the accuracy of results. Initial spot checks show CMUSphinx is not as accurate as Google's API but still performs more than adequate for a minimal viable product with the data we have. CMUSphinx offers several tuning parameters and also allows for model updates. We plan to continue iterating on this solution to improve results in the future.

Thank you for reading this blog. For more information on Greenplum or any of the other content on the A42 Labs blog do not hesitate to contact us: info@a42labs.io.


  • Share this post!

Jarrod Vawdrey

Chief Technology Officer @ A42 Labs

Jarrod is an innovative hands-on technology leader with experience hiring, leading and mentoring high performing data science and engineering teams.

He is passionate about solving problems and building data driven solutions that help startups, enterprises and government agencies realize the value of their data assets.

His interests include distributed computing, machine learning, information extraction from unstructured data and the integration of analytics into applications and data services.