{"id":7690,"date":"2018-07-18T16:29:39","date_gmt":"2018-07-18T19:29:39","guid":{"rendered":"http:\/\/blog.plataformatec.com.br\/?p=7690"},"modified":"2018-10-17T16:47:20","modified_gmt":"2018-10-17T19:47:20","slug":"whats-new-in-flow-v0-14","status":"publish","type":"post","link":"http:\/\/blog.plataformatec.com.br\/2018\/07\/whats-new-in-flow-v0-14\/","title":{"rendered":"What&#8217;s new in Flow v0.14"},"content":{"rendered":"<p>Flow v0.14 has been recently released with more fine grained control on data emission and tighter integration with GenStage.<\/p>\n<p>In this blog post we will start with a brief recap of Flow and go over the new changes. We end the post with a description of the new Elixir Development Subscription service by Plataformatec and how it has helped us bring those improvements to Flow.<\/p>\n<h2>Quick introduction to Flow<\/h2>\n<p><a href=\"https:\/\/github.com\/elixir-lang\/flow\">Flow<\/a> is a library for computational parallel flows in Elixir. It is built <a href=\"https:\/\/github.com\/elixir-lang\/gen_stage\">on top of GenStage<\/a> which specifies how Elixir processes should communicate with back-pressure.<\/p>\n<p>Flow is inspired by the MapReduce and Apache Spark models but <a href=\"http:\/\/blog.plataformatec.com.br\/2017\/03\/small-data-with-elixir\/\">focuses on single node performance<\/a>. It aims to use all cores of your machines efficiently.<\/p>\n<p>The &#8220;hello world&#8221; of data processing is a word counter. Here is how we would count the words in a file with <code>Flow<\/code>:<\/p>\n<pre><code class=\"elixir\">File.stream!(\"path\/to\/some\/file\")\n|&gt; Flow.from_enumerable()\n|&gt; Flow.flat_map(&amp;String.split(&amp;1, \" \"))\n|&gt; Flow.partition()\n|&gt; Flow.reduce(fn -&gt; %{} end, fn word, acc -&gt;\nMap.update(acc, word, 1, &amp; &amp;1 + 1)\nend)\n|&gt; Enum.to_list()\n<\/code><\/pre>\n<p>If you have a machine with 4 cores, the example above will create 9 light-weight Elixir processes that run concurrently:<\/p>\n<ul>\n<li>1 process for reading from the file (<code>Flow.from_enumerable\/1<\/code>)<\/li>\n<li>4 processes for performing map operations (everything before <code>Flow.partition\/2<\/code>)<\/li>\n<li>4 processes for performing reduce operations (everything after <code>Flow.partition\/2<\/code>)<\/li>\n<\/ul>\n<p>The key operation in the example above is precisely the <code>partition\/2<\/code> call. Since we want to count words, we need to make sure that we will always route the same word to the same partition, so all occurrences belong to a single place and not scattered around.<\/p>\n<p>The other insight here is that map operations can always stream the data, as they simply transform it. The <code>reduce<\/code> operation, on the other hand, needs to accumulate the data until all input data is fully processed. If the Flow is unbounded (i.e. it never finishes), then you need to specify windows and triggers to check point the data (for example, check point the data every minute or after 100_000 entries or on some condition specified by business rules).<\/p>\n<p>My ElixirConf 2016 keynote also provides <a href=\"https:\/\/www.youtube.com\/watch?v=srtMWzyqdp8&amp;feature=youtu.be&amp;t=244\">an introduction to Flow<\/a> (<a href=\"https:\/\/elixirconf.com\/\">tickets to ElixirConf 2018 are also available<\/a>!).<\/p>\n<p>With this in mind, let&#8217;s see what Flow v0.14 brings.<\/p>\n<h2>Explicit control over reducing stages<\/h2>\n<p>Flow v0.14 gives more explicit control on how the reducing stage works. Let&#8217;s see a pratical example. Imagine you want to connect to Twitter&#8217;s firehose and count the number of mentions of all users on Twitter. Let&#8217;s start by adapting our word counter example:<\/p>\n<pre><code class=\"elixir\">SomeTwitterClient.stream_tweets!()\n|&gt; Flow.from_enumerable()\n|&gt; Flow.flat_map(fn tweet -&gt; tweet[\"mentions\"] end)\n|&gt; Flow.partition()\n|&gt; Flow.reduce(fn -&gt; %{} end, fn mention, acc -&gt;\nMap.update(acc, mention, 1, &amp; &amp;1 + 1)\nend)\n|&gt; Enum.to_list()\n<\/code><\/pre>\n<p>We changed our code to use some fictional twitter client that streams tweets and then proceeded to retrieve the mentions in each each tweet. The mentions are routed to partitions, which counts them. If we attempted to run the code above, the code would run until the machine eventually runs out of memory, as the Twitter firehose never finishes.<\/p>\n<p>A possible solution is to use a window that controls the data accumulation. We will say that we want to accumulate the data for minute. When the minute is over, the &#8220;reduce&#8221; operation will emit its accumulator, which we will persist to some storage:<\/p>\n<pre><code class=\"elixir\">window = Flow.Window.periodic(1, :minute, :discard)\n\nSomeTwitterClient.stream_tweets!()\n|&gt; Flow.from_enumerable()\n|&gt; Flow.flat_map(fn tweet -&gt; tweet[\"mentions\"] end)\n|&gt; Flow.partition(window: window)\n|&gt; Flow.reduce(fn -&gt; %{} end, fn mention, acc -&gt;\nMap.update(acc, mention, 1, &amp; &amp;1 + 1)\nend)\n|&gt; Flow.each_state(fn acc -&gt; MyDb.persist_count_so_far(acc) end)\n|&gt; Flow.start_link()\n<\/code><\/pre>\n<p>The first change is in the first line. We create a window that lasts 1 minute and discards any accumulated state before starting the next window. We pass the window as argument to <code>Flow.partition\/1<\/code>.<\/p>\n<p>The remaining changes are after the <code>Flow.reduce\/3<\/code>. Whenever the current window terminates, we see that a trigger is emitted. This trigger means that the <code>reduce\/3<\/code> stage will stop accumulating data and invoke the next functions in the Flow. One of these functions is <code>each_state\/2<\/code>, that receives the state accumulated so far and persists it to a database.<\/p>\n<p>Finally, since the flow is infinite, we are no longer calling <code>Enum.to_list\/1<\/code> at the end of the flow, but rather <code>Flow.start_link\/1<\/code>, allowing it to run permanently as part of a supervision tree.<\/p>\n<p>While the solution above is fine, it unfortunately has two implicit decisions in it:<\/p>\n<ul>\n<li><code>each_state<\/code> only runs when the window finishes (i.e. a trigger is emitted) but this relationship is not clear in the code<\/p>\n<\/li>\n<li>\n<p>The control of the accumulator is kept in multiple places: the window definition says the accumulator must be discarded after <code>each_state<\/code> while <code>reduce<\/code> controls its initial value<\/p>\n<\/li>\n<\/ul>\n<p>Flow v0.14 introduces a new function named <code>on_trigger\/2<\/code> to make these relationships clearer. As the name implies, <code>on_trigger\/2<\/code> is invoked with the reduced state whenever there is a trigger. The callback given to <code>on_trigger\/2<\/code> must return a tuple with a list of the events to emit and the new accumulator. Let&#8217;s rewrite our example:<\/p>\n<pre><code class=\"elixir\">window = Flow.Window.periodic(1, :minute)\n\nSomeTwitterClient.stream_tweets!()\n|&gt; Flow.from_enumerable()\n|&gt; Flow.flat_map(fn tweet -&gt; tweet[\"mentions\"] end)\n|&gt; Flow.partition(window: window)\n|&gt; Flow.reduce(fn -&gt; %{} end, fn mention, acc -&gt;\nMap.update(acc, mention, 1, &amp; &amp;1 + 1)\nend)\n|&gt; Flow.on_trigger(fn acc -&gt;\nMyDb.persist_count_so_far(acc)\n{[], %{}} # Nothing to emit, reset the accumulator\nend)\n|&gt; Flow.start_link()\n<\/code><\/pre>\n<p>As you can see, the window no longer controls when data is discarded. <code>on_trigger\/2<\/code> gives developers full control on how to change the accumulator and which events to emit. For example, you may choose to keep part of the accumulator for the next window. Or you could process the accumulator to pick only the most mentioned users to emit to the next step in the flow.<\/p>\n<p>Flow v0.14 also introduces a <code>emit_and_reduce\/3<\/code> function that allows you to emit data while reducing. Let&#8217;s say we want to track popular users in two ways:<\/p>\n<ol>\n<li>whenever a user reaches 100 mentions, we immediately send it to the next stage for processing and reset its counter<\/p>\n<\/li>\n<li>\n<p>for the remaining users, we will get the top 10 most mentioned per partition and send them to the next stage<\/p>\n<\/li>\n<\/ol>\n<p>We can perform this as:<\/p>\n<pre><code class=\"elixir\">window = Flow.Window.periodic(1, :minute)\n\nSomeTwitterClient.stream_tweets!()\n|&gt; Flow.from_enumerable()\n|&gt; Flow.flat_map(fn tweet -&gt; tweet[\"mentions\"] end)\n|&gt; Flow.partition(window: window)\n|&gt; Flow.emit_and_reduce(fn -&gt; %{} end, fn mention, acc -&gt;\ncounter = Map.get(acc, mention, 0) + 1\n\nif counter == 100 do\n{[mention], Map.delete(acc, mention)}\nelse\n{[], Map.put(acc, mention, counter)}\nend\nend)\n|&gt; Flow.on_trigger(fn acc -&gt;\nmost_mentioned =\nacc\n|&gt; Enum.sort(acc, fn {_, count1}, {_, count2} -&gt; count1 &gt;= count2 end)\n|&gt; Enum.take(10)\n\n{most_mentioned, %{}}\nend)\n|&gt; Flow.shuffle()\n|&gt; Flow.map(fn mention -&gt; IO.puts(mention) end)\n|&gt; Flow.start_link()\n<\/code><\/pre>\n<p>In the example above, we changed <code>reduce\/3<\/code> to <code>emit_and_reduce\/3<\/code>, so we can emit events as we process them. Then we changed <code>Flow.on_trigger\/2<\/code> to also emit the most mentioned users.<\/p>\n<p>Finally, we have added a call to <code>Flow.shuffle\/1<\/code>, that will receive all of the events emitted by <code>emit_and_reduce\/3<\/code> and <code>on_trigger\/2<\/code> and shuffle them into a series of new stages for further parallel processing.<\/p>\n<p>If you are familiar with data processing pipelines, you may be aware of two pitfalls in the solution above: 1. we are using processing time for handling events and 2. instead of a periodic window, it would probably be best to process events on sliding windows. For the former, you can learn more about <a href=\"https:\/\/hexdocs.pm\/flow\/Flow.Window.html#content\">the pitfalls of processing time vs event time in Flow&#8217;s documentation<\/a>. For the latter, we note that Flow does not support sliding windows out of the box but they are straight-forward to implement on top of <code>reduce\/3<\/code> and <code>on_trigger\/2<\/code> above.<\/p>\n<p>At the end of the day, the new functionality in Flow v0.14 gives developers more control over their flows while also making the code clearer.<\/p>\n<h2>Tighter integration with GenStage<\/h2>\n<p>Flow v0.14 also introduces new functions to make integration with <a href=\"http:\/\/github.com\/elixir-lang\/gen_stage\">GenStage<\/a> easier. One of these functions is <code>through_stages\/3<\/code>, which complements <code>from_stages\/2<\/code> and <code>into_stages\/3<\/code>, allowing developers to pipe a flow through already running, hand-written stages:<\/p>\n<pre><code class=\"elixir\">Flow.from_stages([MyProducer])\n|&gt; Flow.map(...)\n|&gt; Flow.partition(...)\n|&gt; Flow.reduce(...)\n|&gt; Flow.through_stages([MyProducerConsumer])\n|&gt; Flow.map(...)\n|&gt; Flow.into_stages([MyConsumer])\n<\/code><\/pre>\n<p>While the above is handy, it is a little bit awkward. Since the <code>*_stages<\/code> functions expect already running stages, it means that you need to start those stages in a separate part of your application and then integrate them into the Flow.<\/p>\n<p>For this reason, Flow v0.14 also introduces <code>from_specs\/2<\/code>, <code>through_specs\/3<\/code> and <code>into_specs\/3<\/code>, which receives <a href=\"https:\/\/hexdocs.pm\/elixir\/Supervisor.html#module-child-specification\">child specifications<\/a> that control how the stages are started. In this case, the Flow takes care of starting those stages and passing the data through them.<\/p>\n<h2>The Elixir Development Subscription<\/h2>\n<p>Some of the improvements done to Flow in version v0.14 were motivated by the feedback we have received from companies participating in our new service called <strong>Elixir Development Subscription<\/strong>.<\/p>\n<p>The Elixir Development Subscription service helps companies build Elixir applications with speed and confidence, by leveraging Plataformatec&#8217;s engineering team for support and assistance.<\/p>\n<p>If you are adopting Elixir or any of the tools in its ecosystem, such as GenStage, Flow, Phoenix, Ecto and others, and you would like to learn more about the service, please fill in the form below, and we will reach out to you as new spots become available!<\/p>\n<div id=\"elixir-development-subscription-ce5dbf074cb453546755\" role=\"main\"><\/div>\n<p><script type=\"text\/javascript\" src=\"https:\/\/d335luupugsy2.cloudfront.net\/js\/rdstation-forms\/stable\/rdstation-forms.min.js\"><\/script><\/p>\n<p><script type=\"text\/javascript\">\njQuery(document).ready(function() {\n  var css = '#conversion-elixir-development-subscription-ce5dbf074cb453546755 div.field > label { font-weight: bold !important; padding: 10px 0 !important; };';\n  var style = document.createElement('style');\n  style.type = 'text\/css';\n  style.appendChild(document.createTextNode(css));\n  document.head.appendChild(style);\n  new RDStationForms('elixir-development-subscription-ce5dbf074cb453546755-html', 'UA-8268430-1').createForm();\n});\n<\/script><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Flow v0.14 has been recently released with more fine grained control on data emission and tighter integration with GenStage. In this blog post we will start with a brief recap of Flow and go over the new changes. We end the post with a description of the new Elixir Development Subscription service by Plataformatec and &#8230; <a class=\"read-more-link\" href=\"http:\/\/blog.plataformatec.com.br\/2018\/07\/whats-new-in-flow-v0-14\/\">\u00bb<\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[143],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/7690"}],"collection":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/comments?post=7690"}],"version-history":[{"count":11,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/7690\/revisions"}],"predecessor-version":[{"id":7861,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/7690\/revisions\/7861"}],"wp:attachment":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/media?parent=7690"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/categories?post=7690"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/tags?post=7690"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}