When you start a new data science project, you often begin with CSVs or other flat files, but lack the type information for the columns. While most database systems have a built-in utility for quickly importing external flat files, this is missing in Postgres and its forks like Greenplum Database.

To simplify this process for ourselves, we wrote our own, and are sharing how you can do the same. You can find the source on our github.

Best-Guess Inference for Data Types

Currently the script makes its best guess for the data type of each column based on the format of the data in the first n rows of the column (n passed to the function as a parameter). We use regex comparisons to decide if the column's values from the original file fit the format of an Integer, a Float or Double-Precision Float, a Date, a Timestamp, or Text. The code defaults to the Text type if it fails to match all others. Currently our function only searches for these 6 types, but we hope to expand to more of Postgres's data types in the future.


Setup

Initializing the UDF

Copy and paste the UDF code into your favorite editor or run script via psql as follows.

                  
[gpadmin@mdw ~]$ psql -U gpadmin -d gpadmin -a -f infer_types.sql
                  
                
Running the code

We start with the csv file `infer_type_test.csv` in the current working directory. It has 6 columns, one for each data type we're checking and one for testing exceptions.

                  
[gpadmin@mdw ~]$ cat infer_type_test.csv

col_int,col_float,col_string,col_date,col_tstmp,col_string2
1234,0.234,1a.2,1800-1-1,1989-12-20,1
001,1.234,12-34,1900-1-1,1989-1-1 12:03:04,1.0
1,1e3,asdf,1989-12-20,1900-1-1,1989-12-20
2,2.0,qwer,1989-12-20,2000-12-31,1900-1-1
                  
                

Run the `import_csv.py` script to create a temporary table in GreenPlum with all TEXT columns and copy the csv into that table

                  
[gpadmin@mdw ~]$ python import_csv.py infer_type_test.csv infer_type_test
[gpadmin@mdw ~]$ psql

...
gpadmin=# select * from infer_type_test;
col_int | col_float | col_string |  col_date  |     col_tstmp     | col_string2
---------+-----------+------------+------------+-------------------+-------------
001     | 1.234     | 12-34      | 1900-1-1   | 1989-1-1 12:03:04 | 1.0
1       | 1e3       | asdf       | 1989-12-20 | 1900-1-1          | 1989-12-20
1234    | 0.234     | 1a.2       | 1800-1-1   | 1989-12-20        | 1
2       | 2.0       | qwer       | 1989-12-20 | 2000-12-31        | 1900-1-1
(4 rows)
                  
                

The function `infer_table_types(‘[table name]’)` creates a new table named "[table name]_infer_types" with the inferred data types (using logic above) for each column.

                  
gpadmin=# select infer_table_types('infer_type_test');
...
(1 row)
                  
                

Now SELECT to check the data in our new table with inferred column types:

                  
gpadmin=# select * from infer_type_test_infer_types;
 col_int | col_float | col_string |  col_date  |      col_tstmp      | col_string2
---------+-----------+------------+------------+---------------------+-------------
       1 |      1000 | asdf       | 1989-12-20 | 1900-01-01 00:00:00 | 1989-12-20
       2 |         2 | qwer       | 1989-12-20 | 2000-12-31 00:00:00 | 1900-1-1
       1 |     1.234 | 12-34      | 1900-1-1   | 1989-01-01 12:03:04 | 1.0
    1234 |     0.234 | 1a.2       | 1800-1-1   | 1989-12-20 00:00:00 | 1
(4 rows)
                  
                

And you can check the inferred type for each column in this table using the \d command in psql

                  
gpadmin=# \d infer_type_test_infer_types
      Table "public.infer_type_test_infer_types"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 col_int     | integer                     |
 col_float   | double precision            |
 col_string  | text                        |
 col_date    | text                        |
 col_tstmp   | timestamp without time zone |
 col_string2 | text                        |
Distributed randomly
                
              

I hope you find this tool as useful as we do!

If your organization is looking for support with data science projects that are using Postgres, Greenplum Database or other tools, please contact us at info@A42Labs.io.


  • Share this post!

Michiel Shortt

Data Scientist @ A42 Labs

Michiel is a Data Scientist with a strong background in optimization and applications of advanced mathematical algorithms. Interested in the limits of computing since a teenager, he has had a lifelong interest in the mathematics behind computer science, in particular AI and machine learning.

His professional experience includes software development and the application of machine learning and optimization algorithms to diverse business problems, especially in the fields of industrial engineering and logistics.