Many to many and upserts

Note: This is a sample chapter of the upcoming beta version of our “What’s new in Ecto 2.0” free book. Reserve Download your copy now if you want to receive the next beta and be notified of future versions.

In the previous chapter we have learned about many_to_many associations and how to map external data to associated entries with the help of Ecto.Changeset.cast_assoc/3. While in the previous chapter we were able to follow the rules imposed by cast_assoc/3, doing so is not always possible nor desired.

In this chapter we are going to look at Ecto.Changeset.put_assoc/4 in contrast to cast_assoc/3 and explore some examples. We will also peek at the upsert features coming in Ecto 2.1.

put_assoc vs cast_assoc

Imagine we are building an application that has blog posts and such posts may have many tags. Not only that, a given tag may also belong to many posts. This is a classic scenario where we would use many_to_many associations. Our migrations would look like:

create table(:posts) do
  add :title
  add :body
  timestamps()
end

create table(:tags) do
  add :name
  timestamps()
end

create unique_index(:tags, [:name])

create table(:posts_tags, primary_key: false) do
  add :post_id, references(:posts)
  add :tag_id, references(:tags)
end

Note we added a unique index to the tag name because we don’t want to have duplicated tags in our database. It is important to add an index at the database level instead of using a validation since there is always a chance two tags with the same name would be validated and inserted simultaneously, passing the validation and leading to duplicated entries.

Now let’s also imagine we want the user input such tags as a list of words split by comma, such as: “elixir, erlang, ecto”. Once this data is received in the server, we will break it apart into multiple tags and associate them to the post, creating any tag that does not yet exist in the database.

While the constraints above sound reasonable, that’s exactly what put us in trouble with cast_assoc/3. Remember the cast_assoc/3 changeset function was designed to receive external parameters and compare them with the associated data in our structs. To do so correctly, Ecto requires tags to be sent as a list of maps. However here we expect tags to be sent in a string separated by comma.

Furthermore cast_assoc/3 relies on the primary key field for each tag sent in order to decide if it should be inserted, updated or deleted. Again, because the user is simply passing a string, we don’t have the ID information at hand.

When we can’t cope with cast_assoc/3, it is time to use put_assoc/4. In put_assoc/4, we give Ecto structs or changesets instead of parameters, giving us the ability to manipulate the data as we want. Let’s define the schema and the changeset function for a post which may receive tags as a string:

defmodule MyApp.Post do
  use Ecto.Schema

  schema "posts" do
    field :title
    field :body
    many_to_many :tags, MyApp.Tag, join_through: "posts_tags"
    timestamps()
  end

  def changeset(struct, params \\ %{}) do
    struct
    |> Ecto.Changeset.cast(params, [:title, :body])
    |> Ecto.Changeset.put_assoc(:tags, parse_tags(params))
  end

  defp parse_tags(params)  do
    (params["tags"] || "")
    |> String.split(",")
    |> Enum.map(&String.trim/1)
    |> Enum.reject(& &1 == "")
    |> Enum.map(&get_or_insert_tag/1)
  end

  defp get_or_insert_tag(name) do
    Repo.get_by(MyApp.Tag, name: name) ||
      Repo.insert!(MyApp.Tag, %Tag{name: name})
  end
end

In the changeset function above, we moved all the handling of tags to a separate function, called parse_tags/1, which checks for the parameter, breaks its entries apart via String.split/2, then removes any left over whitespace with String.trim/1, rejects any empty string and finally checks if the tag exists in the database or not, creating one in case none exists.

The parse_tags/1 function is going to return a list of MyApp.Tag structs which are then passed to put_assoc/3. By calling put_assoc/3, we are telling Ecto those should be the tags associated to the post from now on. In case a previous tag was associated to the post and not given in put_assoc/3, Ecto will also take care of removing the association between the post and the removed tag from the database.

And that’s all we need to use many_to_many associations with put_assoc/3. put_assoc/3 works with has_many, belongs_to and all others association types. However, our code is not yet ready for production. Let’s see why.

Constraints and race conditions

Remember we added a unique index to the tag :name column when creating the tags table. We did so to protect us from having duplicate tags in the database.

By adding the unique index and then using get_by with a insert! to get or insert a tag, we introduced a potential error in our application. If two posts are submitted at the same time with a similar tag, there is a chance we will check if the tag exists at the same time, leading both submissions to believe there is no such tag in the database. When that happens, only one of the submissions will succeed while the other one will fail. That’s a race condition: your code will error from time to time, only when certain conditions are met. And those conditions are time sensitive.

Many developers have a tendency to think such errors won’t happen in practice or, if they happened, they would be irrelevant. But in practice they often lead to very frustrating user experiences. I have heard a first-hand example coming from a mobile game company. In the game, a player is able to play quests and on every quest you have to choose a guest character from another player out of a short list to go on the quest with you. At the end of the quest, you have the option to add the guest character as a friend.

Originally the whole guest list was random but, as time passed, players started to complain sometimes old accounts, often inactive, were being shown in the guests options list. To improve the situation, the game developers started to sort the guest list by most recently active. This means that, if you have just played recently, there is a higher chance of you to be on someone guest lists.

However, when they did such change, many errors started to show up and users were suddenly furious in the game forum. That’s because when they sorted players by activity, as soon two players logged in, their characters would likely appear on each others guest list. If those players picked each others characters, the first to add the other as friend at the end of a quest would be able to succeed but an error would appear when the second player tried to add that character as a friend since the relationship already existed in the database! Not only that, all the progress done in the quest would be lost, because the server was unable to properly persist the quest results to the database. Understandably, players started to file complaints.

Long story short: we must address the race condition.

Luckily Ecto gives us a mechanism to handle constraint errors from the database.

Checking for constraint errors

Since our get_or_insert_tag(name) function fails when a tag already exists in the database, we need to handle such scenarios accordingly. Let’s rewrite it taking race conditions into account in mind:

defp get_or_insert_tag(name) do
  %Tag{}
  |> Ecto.Changeset.change(name: name)
  |> Ecto.Changeset.unique_constraint(:name)
  |> Repo.insert
  |> case do
    {:ok, tag} -> tag
    {:error, _} -> Repo.get_by!(MyApp.Tag, name: name)
  end
end

Instead of inserting the tag directly, we know build a changeset, which allows us to use the unique_constraint annotation. Now if the Repo.insert operation fails because the unique index for :name is violated, Ecto won’t raise, but return an {:error, changeset} tuple. Therefore, if the Repo.insert succeeds, it is because the tag was saved, otherwise the tag already exists, which we then fetch with Repo.get_by!.

While the mechanism above fixes the race condition, it is a quite expensive one: we need to perform two queries for every tag that already exists in the database: the (failed) insert and then the repository lookup. Given that’s the most common scenario, we may want to rewrite it to the following:

defp get_or_insert_tag(name) do
  Repo.get_by(MyApp.Tag, name: name) || maybe_insert_tag(name)
end

defp maybe_insert_tag(name) do
  %Tag{}
  |> Ecto.Changeset.change(name: name)
  |> Ecto.Changeset.unique_constraint(:name)
  |> Repo.insert
  |> case do
    {:ok, tag} -> tag
    {:error, _} -> Repo.get_by!(MyApp.Tag, name: name)
  end
end

The above performs 1 query for every tag that already exists, 2 queries for every new tag and possibly 3 queries in the case of race conditions. While the above would perform slightly better on average, Ecto 2.1 has a better option in stock.

Upserts

Ecto 2.1 supports the so-called “upsert” command which is an abbreviation for “update or insert”. The idea is that we try to insert a record and in case it conflicts with an existing entry, for example due to a unique index, we can choose how we want the database to act by either raising an error (the default behaviour), ignoring the insert (no error) or by updating the conflicting database entries.

“upsert” in Ecto 2.1 is done with the :on_conflict option. Let’s rewrite get_or_insert_tag(name) once more but this time using the :on_conflict option. Remember that “upsert” is a new feature in PostgreSQL 9.5, so make sure you are up to date.

Your first try in using :on_conflict may be by setting it to :nothing, as below:

defp get_or_insert_tag(name) do
  Repo.insert!(%MyApp.Tag{name: name}, on_conflict: :nothing)
end

While the above won’t raise an error in case of conflicts, it also won’t update the struct given, so it will return a tag without ID. One solution is to force an update to happen in case of conflicts, even if the update is about setting the tag name to its current name. In such cases, PostgreSQL also requires the :conflict_target option to be given, which is the column (or a list of columns) we are expecting the conflict to happen:

defp get_or_insert_tag(name) do
  Repo.insert!(%MyApp.Tag{name: name},
               on_conflict: [set: [name: name]], conflict_target: :name)
end

And that’s it! We try to insert a tag with the given name and if such tag already exists, we tell Ecto to update its name to the current value, updating the tag and fetching its id. While the above is certainly a step up from all solutions so far, it still performs one query per tag. If 10 tags are sent, we will perform 10 queries. Can we further improve this?

Upserts and insert_all

Ecto 2.1 did not only add the :on_conflict option to Repo.insert/2 but also to the Repo.insert_all/3 function introduced in Ecto 2.0. This means we can build one query that attempts to insert all missing tags and then another query that fetches all of them at once. Let’s see how our Post schema will look like after those changes:

defmodule MyApp.Post do
  use Ecto.Schema

  # Schema is the same
  schema "posts" do
    field :title
    field :body
    many_to_many :tags, MyApp.Tag, join_through: "posts_tags"
    timestamps()
  end

  # Changeset is the same
  def changeset(struct, params \\ %{}) do
    struct
    |> Ecto.Changeset.cast(params, [:title, :body])
    |> Ecto.Changeset.put_assoc(:tags, parse_tags(params))
  end

  # Parse tags has slightly changed
  defp parse_tags(params)  do
    (params["tags"] || "")
    |> String.split(",")
    |> Enum.map(&String.trim/1)
    |> Enum.reject(& &1 == "")
    |> insert_and_get_all()
  end

  defp insert_and_get_all([]) do
    []
  end
  defp insert_and_get_all(names) do
    maps = Enum.map(names, &%{name: &1})
    Repo.insert_all MyApp.Tag, maps, on_conflict: :nothing
    Repo.all(from t in MyApp.Tag, where: t.name in ^names)
  end
end

Instead of attempting to get and insert each tag individually, the code above work on all tags at once, first by building a list of maps which is given to insert_all and then by looking up all tags with the existing names. Therefore, regardless of how many tags are sent, we will perform only 2 queries (unless no tag is sent, in which we return an empty list back promptly). This solution is only possible in Ecto 2.1 thanks to the :on_conflict option, which guarantees insert_all won’t fail in case a given tag name already exists.

Finally, keep in mind that we haven’t used transactions in any of the examples so far. Such decision was deliberate. Since getting or inserting tags is an idempotent operation, i.e. we can repeat it many times and it will always give us the same result back. Therefore, even if we fail to introduce the post to the database due to a validation error, the user will be free to resubmit the form and we will just attempt to get or insert the same tags once again. The downside of this approach is that tags will be created even if creating the post fails, which means some tags may not have posts associated to them. In case that’s not desired, the whole operation could be wrapped in a transaction or modeled with the Ecto.Multi abstraction we will learn about in future chapters.



What's new in Ecto 2.0 -- Download your copy

13 responses to “Many to many and upserts”

  1. wdiechmann says:

    A wonderful write!

    Am I correct to assuming that

    maps = Enum.map(names, &%{name: &1})

    should/could be written like

    _ = Enum.map(names, &%{name: &1})

    as the return value is not used?

  2. josevalim says:

    Oh, we should use it on `insert_all`. I will update the code snippet, thank you!

  3. Smith Aitufe says:

    I totally enjoyed your teaching.

    Thanks so much for the wonderful piece. I am looking forward to future posts.

    I love elixir, phoenix and ecto. Just a request, I will love more posts on supervisors, web workers and real life examples.

  4. Jeremy Miranda says:

    shouldn’t you be passing params instead of struct on
    this:
    def changeset(struct, params \ %{}) do
    struct
    |> Ecto.Changeset.cast(params, [:title, :body])
    |> Ecto.Changeset.put_assoc(:tags, parse_tags(params))
    end

    instead of:
    def changeset(struct, params \ %{}) do
    struct
    |> Ecto.Changeset.cast(struct, [:title, :body])
    |> Ecto.Changeset.put_assoc(:tags, parse_tags(params))
    end

    if not could you explain please. Thanks.

  5. josevalim says:

    You are right. Fixed!

  6. victoriawagman says:

    Hi

    Why does it say:
    add :title
    add :body

    instead of
    field :title
    field :body

    Thank you for writing & sharing!

  7. josevalim says:

    That’s a mistake, thank you! I have fixed it.

  8. evuez says:

    Being able to do an upsert is pretty cool, but with binary ids if I do an upsert on an existing row, the returned struct contains a newly generated id instead of the existing one (which kinda makes sense but is not that useful). Is there a way to get the actual id back in the same query?

    Thanks for the article anyway 🙂

  9. Hassan Zaki says:

    Thanks for the great post.

    There’s a missing “(” in insert_and_get_all/1

    Repo.all(from t in MyApp.Tag, where: t.name in ^names)

  10. josevalim says:

    If the database is not returning it, then there is nothing Ecto can do. :S

  11. josevalim says:

    Fixed, thank you!

  12. Jesper Christiansen says:

    I’m a bit confused.. So for the creation of a new record, I provide a list of tags as a string “tag1, tag2” etc. But when I go to edit the post I just created, do I need to do something special for the tags? I get an error the second I load the /edit route.

    def edit(conn, %{“id” => id}) do
    post = Repo.get!(Post, id)
    |> Repo.preload([:tags])

    changeset = Post.changeset(post)
    render(conn, “edit.html”, post: post, changeset: changeset)
    end

    That gives me the following error on load:

    you are attempting to change relation :tags of App.Post, but there is missing data.
    If you are attempting to update an existing entry, please make sure
    you include the entry primary key (ID) alongside the data. etc etc etc

    What am I doing wrong? :/

  13. josevalim says:

    I believe we forgot this:

    many_to_many :tags, MyApp.Tag, join_through: “posts_tags”, on_replace: :delete

    I.e. if you follow the instructions in your error message, it will mention about the `:on_replace` option and in this case we want to delete tags we no longer pass.

    Thanks for the comment and let us know how it goes!