Rails 4 and PostgreSQL Arrays

Rails 4 supports arrays fields for PostgreSQL in a nice way, although it is not a very known feature. In order to demonstrate its usage it’s useful to explain the context where this was used.

PostgreSQL Arrays and Rails Migrations

Suppose we have a Product model with the following fields: name, category_id and tags. The name field will be a simple string, category_id will be the foreign key of a record in the Category model and tags will be created by inputting a string of comma-separated words, so: “one, two, forty two” will become the tags: “one”, “two” and “forty two” respectively.

Creating these tables via migrations is nothing new, except for the column tags which will have the Array type in this case. To create this kind of column we use the following syntax in our migration:

create_table :categories do |t|
  t.string :name, null: false
end

create_table :products do |t|
  t.string :name, null: false
  t.references :category, null: false
  t.text :tags, array: true, default: []
end

Let’s explore what we can do with this kind of field using the postgres console:

$ rails db
> insert into products(name, category_id, tags) values('T-Shirt', 3, '{clothing, summer}');
> insert into products(name, category_id, tags) values('Sweater', 3, ARRAY['clothing', 'winter']);
> select * from products;
1  |  T-Shirt  |  3  | {clothing, summer}
2  |  Sweater  |  3  | {clothing, winter}

As we can see we need to specify each tag following this syntax:

‘{ val1, val2, … }’ or ARRAY['val1', 'val2', ...]

Let’s play a little more to understand how this column behaves when queried:

> select * from products where tags = '{clothing, summer}';
1  |  T-Shirt  |  3  | {clothing, summer}

> select * from products where tags = '{summer, clothing}';
(0 rows)

> select * from products where 'winter' = ANY(tags);
2  |  Sweater  |  3  |  {clothing, winter}

As this example demonstrates, searching for records by an array with its values in the order they were inserted works, but with the same values in a different order does not. We were also able to find a record searching for a specific tag using the ANY function.

There’s a lot more to talk about arrays in PostgreSQL, but for our example this is enough. You can find more information at the PostgreSQL official documentation about arrays and its functions.

How Rails treats PostgreSQL arrays

It’s also valuable to see how to use the array field within Rails, let’s try:

$ rails c

Product.create(name: 'Shoes', category: Category.first, tags: ['a', 'b', 'c'])
#> 

Product.find(26).tags
#> ["a", "b", "c"]

So Rails treats an array column in PostgreSQL as an Array in Ruby, pretty reasonable!

Validations

We want each product to be unique, let’s see some examples to clarify this concept.

Given we have the following product:

Product.create(name: 'Shoes', category: Category.first, tags: ['a', 'b', 'c'])

We can easily create another one if we change the name attribute:

Product.create(name: 'Slippers', category: Category.first, tags: ['a', 'b', 'c'])

We can also create another product with different tags:

Product.create(name: 'Shoes', category: Category.first, tags: ['a', 'b'])

But we don’t want to create a product with the same attributes, even if the tags are in a different order:

Product.create(name: 'Shoes', category: Category.first, tags: ['a', 'c', 'b'])
#> false

As PostgreSQL only finds records by tags given the exact order in which they were inserted, then how can we ensure the uniqueness of a product with tags in an order-independent way?

After much thought we decided that a good approach would involve creating an unique index with all the columns in the products table but with tags sorted when a row is inserted in the database. Something like:

CREATE UNIQUE INDEX index_products_on_category_id_and_name_and_tags
ON products USING btree (category_id, name, sort_array(tags));

And sort_array is our custom function responsible for sorting the array, since PostgreSQL does not have a built in function like this.

Creating a custom function in PostgreSQL using PL/pgSQL

To create a custom function we used the PL/pgSQL language, and since we are adding database specific code like this we can’t use the default schema.rb anymore. Let’s change this in config/application.rb:

# Use SQL instead of AR schema dumper when creating the database
config.active_record.schema_format = :sql

With this configuration set, our schema.rb file will be replaced by a structure.sql file without side effects, our current migrations don’t need to be changed at all. Now we can create a migration with our sort_array code:

def up
  execute <<-SQL
    CREATE FUNCTION sort_array(unsorted_array anyarray) RETURNS anyarray AS $$
      BEGIN
        RETURN (SELECT ARRAY_AGG(val) AS sorted_array
        FROM (SELECT UNNEST(unsorted_array) AS val ORDER BY val) AS sorted_vals);
      END;
    $$ LANGUAGE plpgsql IMMUTABLE STRICT;

    CREATE UNIQUE INDEX index_products_on_category_id_and_name_and_tags ON products USING btree (category_id, name, sort_array(tags));
  SQL
end

def down
  execute <<-SQL
    DROP INDEX IF EXISTS index_products_on_category_id_and_name_and_tags;
    DROP FUNCTION IF EXISTS sort_array(unsorted_array anyarray);
  SQL
end

Now, let's take it slow and understand step by step

CREATE FUNCTION sort_array(unsorted_array anyarray) RETURNS anyarray

The line above tells that we are creating a function named sort_array and that it receives a parameter named unsorted_array of type anyarray and returns something of this same type. This anyarray, in fact, is a pseudo-type that indicates that a function accepts any array data type.

RETURN (SELECT ARRAY_AGG(val) AS sorted_array
FROM (SELECT UNNEST(unsorted_array) AS val ORDER BY val) AS sorted_vals);

The trick here is the use of the function unnest that expands an Array to a set of rows. Now we can order these rows and after that we use another function called array_agg that concatenates the input into a new Array.

$$ LANGUAGE plpgsql IMMUTABLE STRICT;

The last trick is the use of the keywords IMMUTABLE  and STRICT. With the first one we guarantee that our function will always return the same output given the same input, we can't use it in our index if we don't specify so. The other one tells that our function will always return null if some of the parameters are not specified.

And that's it! With this we can check for uniqueness in a performant way with some method like:

def duplicate_product_exists?
  relation = self.class.
    where(category_id: category_id).
    where('lower(name) = lower(?)', name).
    where('sort_array(tags) = sort_array(ARRAY[?])', tags)

  relation = relation.where.not(id: id) if persisted?

  relation.exists?
end

Case insensitive arrays

There is still a problem with our code though, the index is not case insensitive!  What if a user inserts a product with tags ['a', 'b'] and another one inserts the same product but with tags ['A', 'b']? Now we have duplication in our database! We have to deal with this, but unfortunately this will increase the complexity of our sort_array function a little bit. To fix this problem we only need to change one single line:

From this:

FROM (SELECT UNNEST(unsorted_array) AS val ORDER BY val) AS sorted_vals);

To:

FROM
(SELECT
  UNNEST(string_to_array(lower(array_to_string(unsorted_array, ',')), ','))
  AS val ORDER BY val)
AS sorted_vals);

The difference is that instead of passing unsorted_array directly to the function unnest we are transforming it in an String, calling lower on it and transforming it back to an Array before passing it on. With this change it doesn't matter if the user inserts ['a'] or ['A'], every tag will be saved in lowercase in the index. Problem solved!

As we can see, it's not an easy task to deal with uniqueness and arrays in the database, but the overall result was great.

Would you solve this problem in a different way? Share with us!

  • Petteri Räty

    Since the end of the blog post asks how the problem can be solved in a different way, I will note that, if you make Tag a separate model, there’s less need for specialized code. However, I considered the point of the post to be to introduce PostgresSQL Arrays and not necessarily to provide to best solution. The article would of course benefit from some thoughts on pros/cons. Thanks for introducing this.

  • Bernardo Chaves

    The suggestion about the pros/cons is a great idea. In this case we didn’t create a Tag model because our uniqueness validation problem would still remain. Also, we don’t use tags for anything else besides product categorization. But I agree with you that if we need something else from tags it might be a good idea to use a proper model. Thanks for the feedback!

  • Petteri Räty

    For the uniqueness a standard unique index over the foreign key and tag name will suffice, if it can be case sensitive. For the case insensitivity with a tag model I would use the schema_plus gem https://github.com/lomba/schema_plus to allow expressions for the index. With this gem the schema format can be kept ruby.

  • Bernardo Chaves

    Cool gem, didn’t know about it. That’s an interesting approach to the problem, thanks for sharing!

  • Jim Nasby

    Yeah, it really sucks that Postgres doesn’t have the concept of a class. I suspect that normally when you do a separate model then you’d end up with a single tags table in the database and a many-many. That’s going to suck performance-wise.

    What we’d really want is the ability to tell Postgres in one place “Hey, here’s how you handle tags” and then you could just instantiate that in whataver table you want.

  • Jim Nasby

    If you want case-insensitive, just use citext: http://www.postgresql.org/docs/9.4/static/citext.html

    You can do the same thing in plSQL, which will probably be faster than plpgsql:

    CREATE OR REPLACE FUNCTION array_sort(anyarray) RETURNS TABLE( sorted_array anyarray ) LANGUAGE sql AS $body$
    SELECT ARRAY_AGG(val) AS sorted_array
    FROM (
    SELECT UNNEST( $1 ) AS val
    ORDER BY val
    ) AS sorted_vals
    $body$;

  • Bernardo Chaves

    That’s a really cool idea and maybe it’s possible to use a custom type for this, but it’s something I still need to explore. Here is the doc: http://www.postgresql.org/docs/current/static/sql-createtype.html

  • Bernardo Chaves

    Great tip about citext! And it’s cool to use only plSQL in this case, I’m not sure about the speed difference though but might be worth a benchmark. Thanks for sharing!