{"id":8581,"date":"2019-02-11T10:58:26","date_gmt":"2019-02-11T12:58:26","guid":{"rendered":"http:\/\/blog.plataformatec.com.br\/?p=8581"},"modified":"2019-02-12T09:59:04","modified_gmt":"2019-02-12T11:59:04","slug":"migrations-in-databases-with-large-amount-of-data","status":"publish","type":"post","link":"https:\/\/blog.plataformatec.com.br\/2019\/02\/migrations-in-databases-with-large-amount-of-data\/","title":{"rendered":"Migrations in databases with large amount of data"},"content":{"rendered":"

There is a discussion that always comes up when dealing with database migrations.<\/p>\n

Should I use the migrations to also migrate data? I mean, I’ve already altered the structure so it would be easy to change the data by including an SQL as well, and this would guarantee that everything is working after the deploy. Right?<\/p>\n

It could work, but in most cases, it could also cause a database lock and a major production problem.<\/p>\n

In general, the guidelines are to move the commands responsible to migrate the data to a separate task and then execute it after the migrations are up to date.<\/p>\n

In some cases, this could also take an eternity when you are dealing with a database with millions of records. Update statements are expensive to the database and sometimes it is preferable to create a new table with the right info and after everything is ok to rename after the right one. But sometimes we don’t want or we simply just can\u2019t rename the table for N reasons.<\/p>\n

When you are dealing with millions of records and need to migrate data, one thing you can do is to create a SQL script responsible to migrate the data in batches. This is faster because you won\u2019t create a single transaction in the database to migrate all records and will consume less memory.<\/p>\n

One thing to consider when using migration scripts is to disable all indexes in the table, indexes are meant to improve the read performance but can slow down the write action significantly. This happens because every time you write a new record in the table, the database will re-organize the data. Now imagine this in a scenario of millions of records, it could take way too much then it should.<\/p>\n

Every database has its own characteristics, but most of the things you can do in one, you can do in another, this is due to the SQL specification that every database implements. So when you are writing these scripts, it is important to always look at the documentation. I\u2019ll be using PostgreSQL in this example, but the idea can be applied to most databases.<\/p>\n

Let\u2019s take a look into one example<\/h3>\n

Suppose we are dealing with e-commerce and we are noticing a slowness in the orders page, and by analyzing the query we notice an improvement can be done by denormalizing one of its tables. Let\u2019s work with some data to show how this could be done.<\/p>\n

CREATE TABLE \"users\" (\nid serial PRIMARY KEY,\naccount_id int not NULL,\nname varchar(10)\n);\n\nCREATE TABLE orders (\nid SERIAL PRIMARY KEY,\ndata text not NULL,\nuser_id int not NULL\n);\n\n-- Generates 200_000_000 orders\nINSERT INTO \"users\" (account_id)\nSELECT generate_series(1,50000000);\n\nINSERT INTO orders (data, user_id)\nSELECT 't-shirt' AS data,\n       generate_series(1,50000000);\n\nINSERT INTO orders (data, user_id)\nSELECT 'backpack' AS data,\n       generate_series(1,50000000);\n\nINSERT INTO orders (data, user_id)\nSELECT 'sunglass' AS data,\n       generate_series(1,50000000);\n\nINSERT INTO orders (data, user_id)\nSELECT 'sketchbook' AS data,\n       generate_series(1,50000000);\n\nCREATE index ON \"users\" (\"account_id\");\nCREATE index ON \"orders\" (\"user_id\");\n<\/code><\/pre>\n
SELECT\n  \"orders\".\"id\",\n  \"orders\".\"data\"\nFROM \"orders\"\nINNER JOIN \"users\" ON (\"orders\".\"user_id\" = \"users\".\"id\")\nWHERE \"users\".account_id = 4500000;\n<\/code><\/pre>\n

The results from this query take about 45s<\/strong> to return. If we run the explain analyze in this query, we will see that the join<\/code> is taking too long, even though it is a simple query.<\/p>\n

\"\"<\/p>\n

One of the things we can do to improve this query is to denormalize the orders<\/code> table and create another column user_account_id<\/code> that will be a copy of the account_id<\/code> column from the users<\/code> table. This way we can remove the join<\/code> and make it easier to read the info.<\/p>\n

ALTER TABLE \"orders\" ADD COLUMN \"user_account_id\" integer;\n<\/code><\/pre>\n

If we weren\u2019t dealing with such large data, the easiest way of doing it would be to write a simple UPDATE FROM<\/code> and go on with life, but with this much of data it could take too long to finish.<\/p>\n

UPDATE orders\nSET user_account_id = users.account_id\nFROM users\nWHERE orders.user_id = users.id;\n<\/code><\/pre>\n

Updating records in batches<\/h3>\n

One way that we will explore in this blog post is to migrate this amount of data using a script that performs the update in batches.<\/p>\n

We will need to control the items to be updated, if your table has a sequential id column it will be easy, otherwise, you will need to find a way to iterate through the records. One way you can control this is through creating another table<\/code> or temp table<\/code>, to store the data that needs to be changed, you could use a ROW_NUMBER<\/code> function to generate the sequential ID or just create a sequential column. The only limitation with temp table<\/code> is the database hardware that needs to be able to handle this much of records in memory.<\/p>\n

PostgreSQL: Documentation: 9.3: CREATE TABLE<\/a><\/p>\n

Lucky us, we have a sequential column in our table that we can use to control the items. To iterate through the records in PostgreSQL you can use some control structures such as FOR<\/code> or WHILE<\/code>.<\/p>\n

PostgreSQL: Documentation: 9.2: Control Structures<\/a><\/p>\n

You can also print some messages during the process to provide some feedback while the queries are running, chances are that it may take a while to finish if you are dealing with a large dataset.<\/p>\n

https:\/\/www.postgresql.org\/docs\/9.6\/plpgsql-errors-and-messages.html<\/a><\/p>\n

DO $\nDECLARE\n   row_count integer := 1;\n   batch_size  integer := 50000; -- HOW MANY ITEMS WILL BE UPDATED AT TIME\n   from_number integer := 0;\n   until_number integer := batch_size;\n   affected integer;\nBEGIN\n\nrow_count := (SELECT count(*) FROM orders WHERE user_account_id IS NULL);\n\nRAISE NOTICE '% items to be updated', row_count;\n\n-- ITERATES THROUGH THE ITEMS UNTIL THERE IS NO MORE RECORDS TO BE UPDATED\nWHILE row_count > 0 LOOP\n  UPDATE orders\n  SET user_account_id = users.account_id\n  FROM users\n  WHERE orders.user_id = users.id\n  AND orders.id BETWEEN from_number AND until_number;\n\n  -- OBTAINING THE RESULT STATUS\n  GET DIAGNOSTICS affected = ROW_COUNT;\n  RAISE NOTICE '-> % records updated!', affected;\n\n  -- UPDATES THE COUNTER SO IT DOESN'T TURN INTO AN INFINITE LOOP\n  from_number := from_number + batch_size;\n  until_number := until_number + batch_size;\n  row_count := row_count - batch_size;\n\n  RAISE NOTICE '% items to be updated', row_count;\nEND LOOP;\n\nEND $;\n<\/code><\/pre>\n

Given us a message output will be like this until the script finishes:<\/p>\n

NOTICE:  200000000 items to be updated\nCONTEXT:  PL\/pgSQL function inline_code_block line 12 at RAISE\nNOTICE:  -> 50000 records updated!\nCONTEXT:  PL\/pgSQL function inline_code_block line 23 at RAISE\nNOTICE:  199950000 items to be updated\nCONTEXT:  PL\/pgSQL function inline_code_block line 30 at RAISE\nNOTICE:  -> 50001 records updated!\n<\/code><\/pre>\n

After the script finishes, we can create an index in the new column since it will be used for reading purposes.<\/p>\n

CREATE index ON \"orders\" (\"user_account_id\");<\/code><\/p>\n

If we ran the EXPLAIN ANALYZE<\/code> command again we can see the performance improvements.<\/p>\n

\"\"<\/p>\n

We can see that before, only the join<\/code> was taking a little bit more than 7s<\/code> in the query, approximately 15% of the loading time. If we look closer, we can also notice that the next three lines were related to the join<\/code>, and after we denormalized the table they were gone too.<\/p>\n

You can follow the EXPLAIN ANALYZE<\/code> evolution here<\/a><\/p>\n

Hope it helps!<\/p>\n","protected":false},"excerpt":{"rendered":"

There is a discussion that always comes up when dealing with database migrations. Should I use the migrations to also migrate data? I mean, I’ve already altered the structure so it would be easy to change the data by including an SQL as well, and this would guarantee that everything is working after the deploy. … \u00bb<\/a><\/p>\n","protected":false},"author":67,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[297,219],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/8581"}],"collection":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/users\/67"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/comments?post=8581"}],"version-history":[{"count":24,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/8581\/revisions"}],"predecessor-version":[{"id":8618,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/8581\/revisions\/8618"}],"wp:attachment":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/media?parent=8581"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/categories?post=8581"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/tags?post=8581"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}