{"id":4628,"date":"2015-05-21T09:00:34","date_gmt":"2015-05-21T12:00:34","guid":{"rendered":"http:\/\/blog.plataformatec.com.br\/?p=4628"},"modified":"2016-07-13T12:26:29","modified_gmt":"2016-07-13T15:26:29","slug":"introducing-reducees","status":"publish","type":"post","link":"http:\/\/blog.plataformatec.com.br\/2015\/05\/introducing-reducees\/","title":{"rendered":"Introducing reducees"},"content":{"rendered":"<p>Elixir provides the concept of collections, which may be in-memory data structures, as well as events, I\/O resources and more. Those collections are supported by the Enumerable protocol, which is an implementation of an abstraction we call &#8220;reducees&#8221;.<\/p>\n<p>In this article, we will outline the design decisions behind such abstraction, often exploring ideas from Haskell, Clojure and Scala that eventually led us to develop this new abstraction called reducees, focusing specially on the constraints and performance characteristics of the Erlang Virtual Machine.<\/p>\n<h2>Recursion and Elixir<\/h2>\n<p><a href=\"http:\/\/elixir-lang.org\">Elixir is a functional programming language<\/a> that runs on the Erlang VM. All the examples on this article will be written in Elixir although we will introduce the concepts bit by bit.<\/p>\n<p>Elixir provides linked-lists. Lists can hold many items and, with pattern matching, it is easy to extract the head (the first item) and the tail (the rest) of a list:<\/p>\n<pre><code class=\"iex\">iex&gt; [h|t] = [1, 2, 3]\niex&gt; h\n1\niex&gt; t\n[2, 3]\n<\/code><\/pre>\n<p>An empty list won&#8217;t match the pattern <code>[h|t]<\/code>:<\/p>\n<pre><code class=\"iex\">[h|t] = []\n** (MatchError) no match of right hand side value: []\n<\/code><\/pre>\n<p>Suppose we want to recurse every element in the list, multiplying each element by 2. Let&#8217;s write a double function:<\/p>\n<pre><code class=\"elixir\">defmodule Recursion do\n  def double([h|t]) do\n    [h*2|double(t)]\n  end\n\n  def double([]) do\n    []\n  end\nend\n<\/code><\/pre>\n<p>The function above recursively traverses the list, doubling the head at each step and invoking itself with the tail. We could define a similar function if we wanted to triple every element in the list but it makes more sense to abstract our current implementation. Let&#8217;s define a function called <code>map<\/code> that applies a given function to each element in the list:<\/p>\n<pre><code class=\"elixir\">defmodule Recursion do\n  def map([h|t], fun) do\n    [fun.(h)|map(t, fun)]\n  end\n\n  def map([], _fun) do\n    []\n  end\nend\n<\/code><\/pre>\n<p><code>double<\/code> could now be defined in terms of <code>map<\/code> as follows:<\/p>\n<pre><code class=\"elixir\">def double(list) do\n  map(list, fn x -&gt; x * 2 end)\nend\n<\/code><\/pre>\n<p>Manually recursing the list is straight-forward but it doesn&#8217;t really compose. Imagine we would like to implement other functional operations like <code>filter<\/code>, <code>reduce<\/code>, <code>take<\/code> and so on for lists. Then we introduce sets, dictionaries, and queues into the language and we would like to provide the same operations for all of them.<\/p>\n<p>Instead of manually implementing all of those operations for each data structure, it is better to provide an abstraction that allows us to define those operations only once, and they will work with different data structures.<\/p>\n<p>That&#8217;s our next step.<\/p>\n<h2>Introducing Iterators<\/h2>\n<p>The idea behind iterators is that we ask the data structure what is the next item until the data structure no longer has items to emit.<\/p>\n<p>Let&#8217;s implement iterators for lists. This time, we will be using Elixir documentation and doctests to detail how we expect iterators to work:<\/p>\n<pre><code class=\"elixir\">defmodule Iterator do\n  @doc \"\"\"\n  Each step needs to return a tuple containing\n  the next element and a payload that will be\n  invoked the next time around.\n\n      iex&gt; next([1, 2, 3])\n      {1, [2, 3]}\n      iex&gt; next([2, 3])\n      {2, [3]}\n      iex&gt; next([3])\n      {3, []}\n      iex&gt; next([])\n      :done\n  \"\"\"\n  def next([h|t]) do\n    {h, t}\n  end\n\n  def next([]) do\n    :done\n  end\nend\n<\/code><\/pre>\n<p>We can implement <code>map<\/code> on top of <code>next<\/code>:<\/p>\n<pre><code class=\"elixir\">def map(collection, fun) do\n  map_next(next(collection), fun)\nend\n\ndefp map_next({h, t}, fun) do\n  [fun.(h)|map_next(next(t), fun)]\nend\n\ndefp map_next(:done, _fun) do\n  []\nend\n<\/code><\/pre>\n<p>Since <code>map<\/code> uses the <code>next<\/code> function, as long as we implement <code>next<\/code> for a new data structure, <code>map<\/code> (and all future functions) should work out of the box. This brings the polymorphism we desired but it has some downsides.<\/p>\n<p>Besides not having ideal performance, it is quite hard to make iterators work with resources (events, I\/O, etc), leading to messy and error-prone code.<\/p>\n<p>The trouble with resources is that, if something goes wrong, we need to tell the resource that it should be closed. After all, we don&#8217;t want to leave file descriptors or database connections open. This means we need to extend our <code>next<\/code> contract to introduce at least one other function called <code>halt<\/code>.<\/p>\n<p><code>halt<\/code> should be called if the iteration is interrupted suddenly, either because we are no longer interested in the next items (for example, if someone calls <code>take(collection, 5)<\/code> to retrieve only the first five items) or because an error happened. Let&#8217;s start with take:<\/p>\n<pre><code class=\"elixir\">def take(collection, n) do\n  take_next(next(collection), n)\nend\n\n# Invoked on every step\ndefp take_next({h, t}, n) when n &gt; 0 do\n  [h|take_next(next(t), n - 1)]\nend\n\n# If we reach this, the collection finished\ndefp take_next(:done, _n) do\n  []\nend\n\n# If we reach this, we took all we cared about before finishing\ndefp take_next(value, 0) do\n  halt(value) # Invoke halt as a \"side-effect\" for resources\n  []\nend\n<\/code><\/pre>\n<p>Implementing <code>take<\/code> is somewhat straight-forward. However we also need to modify <code>map<\/code> since every step in the user supplied function can fail. Therefore we need to make sure we call <code>halt<\/code> on every possible step in case of failures:<\/p>\n<pre><code class=\"elixir\">def map(collection, fun) do\n  map_next(next(collection), fun)\nend\n\ndefp map_next({h, t}, fun) do\n  [try do\n     fun.(h)\n   rescue\n     e -&gt;\n       # Invoke halt as a \"side-effect\" for resources\n       # in case of failures and then re-raise\n       halt(t)\n       raise(e)\n   end|map_next(next(t), fun)]\nend\n\ndefp map_next(:done, _fun) do\n  []\nend\n<\/code><\/pre>\n<p>This is not elegant nor performant. Furthermore, it is very error prone. If we forget to call <code>halt<\/code> at some particular point, we can end-up with a dangling resource that may never be closed.<\/p>\n<h2>Introducing reducers<\/h2>\n<p>Not long ago, Clojure introduced the concept of <a href=\"http:\/\/clojure.com\/blog\/2012\/05\/08\/reducers-a-library-and-model-for-collection-processing.html\">reducers<\/a>.<\/p>\n<p>Since Elixir protocols were heavily inspired on Clojure protocols, I was very excited to see their take on collection processing. Instead of imposing a particular mechanism for traversing collections as in iterators, reducers are about sending computations to the collection so the collection applies the computation on itself. From the announcement: &#8220;the only thing that knows how to apply a function to a collection is the collection itself&#8221;.<\/p>\n<p>Instead of using a <code>next<\/code> function, reducers expect a <code>reduce<\/code> implementation. Let&#8217;s implement this <code>reduce<\/code> function for lists:<\/p>\n<pre><code class=\"elixir\">defmodule Reducer do\n  def reduce([h|t], acc, fun) do\n    reduce(t, fun.(h, acc), fun)\n  end\n\n  def reduce([], acc, _fun) do\n    acc\n  end\nend\n<\/code><\/pre>\n<p>With reduce, we can easily calculate the sum of a collection:<\/p>\n<pre><code class=\"elixir\">def sum(collection) do\n  reduce(collection, 0, fn x, acc -&gt; x + acc end)\nend\n<\/code><\/pre>\n<p>We can also implement map in terms of reduce. The list, however, will be reversed at the end, requiring us to <code>reverse<\/code> it back:<\/p>\n<pre><code class=\"elixir\">def map(collection, fun) do\n  reversed = reduce(collection, [], fn x, acc -&gt; [fun.(x)|acc] end)\n  # Call Erlang reverse (implemented in C for performance)\n  :lists.reverse(reversed)\nend\n<\/code><\/pre>\n<p>Reducers provide many advantages:<\/p>\n<ul>\n<li>They are conceptually simpler and faster<\/li>\n<li>Operations like <code>map<\/code>, <code>filter<\/code>, etc are easier to implement than the iterators one since the recursion is pushed to the collection instead of being part of every operation<\/li>\n<li>It opens the door to parallelism as its operations are no longer serial (in contrast to iterators)<\/li>\n<li>No conceptual changes are required to support resources as collections<\/li>\n<\/ul>\n<p>The last bullet is the most important for us. Because the collection is the one applying the function, we don&#8217;t need to change <code>map<\/code> to support resources, all we need to do is to implement <code>reduce<\/code> itself. Here is a pseudo-implementation of reducing a file line by line:<\/p>\n<pre><code class=\"elixir\">def reduce(file, acc, fun) do\n  descriptor = File.open(file)\n\n  try do\n    reduce_next(IO.readline(descriptor), acc, fun)\n  after\n    File.close(descriptor)\n  end\nend\n\ndefp reduce_next({line, descriptor}, acc, fun) do\n  reduce_next(IO.readline(descriptor), fun.(line, acc), fun)\nend\n\ndefp reduce_next(:done, acc, _fun) do\n  acc\nend\n<\/code><\/pre>\n<p>Even though our file reducer uses something that looks like an iterator, because that&#8217;s the best way to traverse the file, from the <code>map<\/code> function perspective we don&#8217;t care which operation is used internally. Furthermore, it is guaranteed the file is closed after reducing, regardless of success or failure.<\/p>\n<p>There are, however, two issues when implementing reducers as proposed in Clojure into Elixir.<\/p>\n<p>First of all, some operations like <code>take<\/code> cannot be implemented in a purely functional way. For example, Clojure relies <a href=\"http:\/\/clojure.org\/atoms\">on reference types<\/a> on its <a href=\"https:\/\/github.com\/clojure\/clojure\/blob\/4964194bae036dc2cddd3f9c03479657aeab6e8c\/src\/clj\/clojure\/core\/reducers.clj#L234-L246\">take implementation<\/a>. This may not be an issue depending on the language\/platform (it certainly isn&#8217;t in Clojure) but it is an issue in Elixir as side-effects would require us to spawn new processes every time take is invoked.<\/p>\n<p>Another drawback of reducers is, because the collection is the one controlling the reducing, we cannot implement operations like <code>zip<\/code> that requires taking one item from a collection, then suspending the reduction, then taking an item from another collection, suspending it, and starting again by resuming the first one and so on. Again, at least not in a purely functional way.<\/p>\n<p>With reducers, we achieve the goal of a single abstraction that works efficiently with in-memory data structures and resources. However, it is limited on the amount of operations we can support efficiently, in a purely functional way, so we had to continue looking.<\/p>\n<h2>Introducing iteratees<\/h2>\n<p>It was at Code Mesh 2013 that I first heard about iteratees. <a href=\"http:\/\/www.infoq.com\/presentations\/io-code-data\">I attended a talk by Jessica Kerr<\/a> and, in the first minutes, she described exactly where my mind was at the moment: iterators and reducers indeed have their limitations, but they have been solved in <a href=\"https:\/\/github.com\/scalaz\/scalaz-stream\">scalaz-stream<\/a>.<\/p>\n<p>After the talk, Jessica and I started to explore how scalaz-stream solves those problems, eventually leading us to <a href=\"https:\/\/themonadreader.files.wordpress.com\/2010\/05\/issue16.pdf\">the Monad.Reader issue that introduces iteratees<\/a>. After some experiments, we had a prototype of iteratees working in Elixir.<\/p>\n<p>With iteratees, we have &#8220;instructions&#8221; going &#8220;up and down&#8221; between the source and the reducing function telling what is the next step in the collection processing:<\/p>\n<pre><code class=\"elixir\">defmodule Iteratee do\n  @doc \"\"\"\n  Enumerates the collection with the given instruction.\n\n  If the instruction is a `{:cont, fun}` tuple, the given\n  function will be invoked with `{:some, h}` if there is\n  an entry in the collection, otherwise `:done` will be\n  given.\n\n  If the instruction is `{:halt, acc}`, it means there is\n  nothing to process and the collection should halt.\n  \"\"\"\n  def enumerate([h|t], {:cont, fun}) do\n    enumerate(t, fun.({:some, h})) \n  end\n\n  def enumerate([], {:cont, fun}) do\n    fun.(:done)\n  end\n\n  def enumerate(_, {:halt, acc}) do\n    {:halted, acc}\n  end\nend\n<\/code><\/pre>\n<p>With <code>enumerate<\/code> defined, we can write <code>map<\/code>:<\/p>\n<pre><code class=\"elixir\">def map(collection, fun) do\n  {:done, acc} = enumerate(collection, {:cont, mapper([], fun)})\n  :lists.reverse(acc)\nend\n\ndefp mapper(acc, fun) do\n  fn\n    {:some, h} -&gt; {:cont, mapper([fun.(h)|acc], fun)}\n    :done      -&gt; {:done, acc}\n  end\nend\n<\/code><\/pre>\n<p><code>enumerate<\/code> is called with <code>{:cont, mapper}<\/code> where <code>mapper<\/code> will receive <code>{:some, h}<\/code> or <code>:done<\/code>, as defined by <code>enumerate<\/code>. The <code>mapper<\/code> function then either returns <code>{:cont, mapper}<\/code>, with a new <code>mapper<\/code> function, or <code>{:done, acc}<\/code> when the collection has told no new items will be emitted.<\/p>\n<p>The Monad.Reader publication defines iteratees as teaching fold (reduce) new tricks. This is precisely what we have done here. For example, while <code>map<\/code> only returns <code>{:cont, mapper}<\/code>, it could have returned <code>{:halt, acc}<\/code> and that would have told the collection to halt. That&#8217;s how <code>take<\/code> could be implemented with iteratees, we would send <code>cont<\/code> instructions until we are no longer interested in new elements, finally returning <code>halt<\/code>.<\/p>\n<p>So while iteratees allow us to teach reduce new tricks, they are much harder to grasp conceptually. Not only that, functions implemented with iteratees were from 6 to 8 times slower in Elixir when compared to their reducer counterpart.<\/p>\n<p>In fact, it is even harder to see how iteratees are actually based on reduce since it hides the accumulator inside a closure (the <code>mapper<\/code> function, in this case). This is also the cause of the performance issues in Elixir: for each mapped element in the collection, we need to generate a new closure, which becomes very expensive when mapping, filtering or taking items multiple times.<\/p>\n<p>That&#8217;s when we asked: what if we could keep what we have learned with iteratees while maintaining the simplicity and performance characteristics of reduce?<\/p>\n<h2>Introducing reducees<\/h2>\n<p>Reducees are similar to iteratees. The difference is that they clearly map to a reduce operation and do not create closures as we traverse the collection. Let&#8217;s implement reducee for a list:<\/p>\n<pre><code class=\"elixir\">defmodule Reducee do\n  @doc \"\"\"\n  Reduces the collection with the given instruction,\n  accumulator and function.\n\n  If the instruction is a `{:cont, acc}` tuple, the given\n  function will be invoked with the next item and the\n  accumulator.\n\n  If the instruction is `{:halt, acc}`, it means there is\n  nothing to process and the collection should halt.\n  \"\"\"\n  def reduce([h|t], {:cont, acc}, fun) do\n    reduce(t, fun.(h, acc), fun) \n  end\n\n  def reduce([], {:cont, acc}, _fun) do\n    {:done, acc}\n  end\n\n  def reduce(_, {:halt, acc}, _fun) do\n    {:halted, acc}\n  end\nend\n<\/code><\/pre>\n<p>Our reducee implementations maps cleanly to the original reduce implementation. The only difference is that the accumulator is always wrapped in a tuple containing the next instruction as well as the addition of a <code>halt<\/code> checking clause.<\/p>\n<p>Implementing <code>map<\/code> only requires us to send those instructions as we reduce:<\/p>\n<pre><code class=\"elixir\">def map(collection, fun) do\n  {:done, acc} =\n    reduce(collection, {:cont, []}, fn x, acc -&gt;\n      {:cont, [fun.(x)|acc]}\n    end)\n :lists.reverse(acc)\nend\n<\/code><\/pre>\n<p>Compared to the original reduce implementation:<\/p>\n<pre><code class=\"elixir\">def map(collection, fun) do\n  reversed = reduce(collection, [], fn x, acc -&gt; [fun.(x)|acc] end)\n  :lists.reverse(reversed)\nend\n<\/code><\/pre>\n<p>The only difference between both implementations is the accumulator wrapped in tuples. We have effectively replaced the closures in iteratees by two-item tuples in reducees, which provides a considerably speed up in terms of performance.<\/p>\n<p>The tuple approach allows us to teach new tricks to reducees too. For example, our initial implementation already supports passing <code>{:halt, acc}<\/code> instead of <code>{:cont, acc}<\/code>, which we can use to implement <code>take<\/code> on top of reducees:<\/p>\n<pre><code class=\"elixir\">def take(collection, n) when n &gt; 0 do\n  {_, {acc, _}} =\n    reduce(collection, {:cont, {[], n}}, fn\n      x, {acc, count} -&gt; {take_instruction(count), {[x|acc], n-1}}\n    end)\n :lists.reverse(acc)\nend\n\ndefp take_instruction(1), do: :halt\ndefp take_instruction(n), do: :cont\n<\/code><\/pre>\n<p>The accumulator in given to <code>reduce<\/code> now holds a list, to collect results, as well as the number of elements we still need to take from the collection. Once we have taken the last item (count == 1), we <code>halt<\/code> the collection.<\/p>\n<p>At the end of the day, this is the abstraction that ships with Elixir. It solves all requirements outlined so far: it is simple, fast, works with both in-memory data structures and resources as collections, and it supports both <code>take<\/code> and <code>zip<\/code> operations in a purely functional way.<\/p>\n<h2>Eager vs Lazy<\/h2>\n<p>Elixir developers mostly do not need to worry about the underlying reducees abstraction. Developers work directly with the module <a href=\"http:\/\/elixir-lang.org\/docs\/master\/elixir\/#!Enum.html\">Enum<\/a> which provides a series of operations that work with any collection. For example:<\/p>\n<pre><code class=\"iex\">iex&gt; Enum.map([1, 2, 3], fn x -&gt; x * 2 end)\n[2, 4, 6]\n<\/code><\/pre>\n<p>All functions in Enum are eager. The <code>map<\/code> operation above receives a list and immediately returns a list. None the less, it didn&#8217;t take long for us to add lazy variants of those operations:<\/p>\n<pre><code class=\"iex\">iex&gt; Stream.map([1, 2, 3], fn x -&gt; x * 2 end)\n#Stream&lt;...&gt;\n<\/code><\/pre>\n<p>All the functions in <a href=\"http:\/\/elixir-lang.org\/docs\/master\/elixir\/#!Stream.html\">Stream<\/a> are lazy: they only store the computation to be performed, traversing the collection just once after all desired computations have been expressed.<\/p>\n<p>In addition, the <code>Stream<\/code> module provides a series of functions for abstracting resources, generating infinite collections and more.<\/p>\n<p>In other words, in Elixir we use the same abstraction to provide both eager and lazy operations, that accepts both in-memory data structures or resources as collections, all conveniently encapsulated in both <a href=\"http:\/\/elixir-lang.org\/docs\/stable\/elixir\/Enum.html\">Enum<\/a> and <a href=\"http:\/\/elixir-lang.org\/docs\/stable\/elixir\/Stream.html\">Stream<\/a> modules. This allows developers to migrate from one mode of operation to the other as needed.<\/p>\n<p><em>P.S.: An enormous thank you to Jessica Kerr for introducing me to iteratees and pairing with me at Code Mesh. Also, thanks to Jafar Husein for the conversations at Code Mesh and the team behind Rx which we are exploring next. Finally, thank you to James Fish, Pater Hamilton, Eric Meadows-J\u00f6nsson and Alexei Sholik for the countless reviews, feedback and prototypes regarding Elixir&#8217;s future.<\/em><\/p>\n<p><a href=\"http:\/\/plataformatec.com.br\/elixir-radar\"><br \/>\n  <img decoding=\"async\" src=\"http:\/\/blog.plataformatec.com.br\/wp-content\/uploads\/2015\/05\/elixir-radar-subscribe.png\" alt=\"Subscribe to Elixir Radar\" style=\"border:0\"><br \/>\n<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Elixir provides the concept of collections, which may be in-memory data structures, as well as events, I\/O resources and more. Those collections are supported by the Enumerable protocol, which is an implementation of an abstraction we call &#8220;reducees&#8221;. In this article, we will outline the design decisions behind such abstraction, often exploring ideas from Haskell, &#8230; <a class=\"read-more-link\" href=\"http:\/\/blog.plataformatec.com.br\/2015\/05\/introducing-reducees\/\">\u00bb<\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[143],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4628"}],"collection":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/comments?post=4628"}],"version-history":[{"count":10,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4628\/revisions"}],"predecessor-version":[{"id":9592,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/4628\/revisions\/9592"}],"wp:attachment":[{"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/media?parent=4628"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/categories?post=4628"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/tags?post=4628"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}