Replacing GenEvent by a Supervisor + GenServer

The downsides of GenEvent have been extensively documented. For those reasons, the Elixir team has a long term plan of deprecating GenEvent. Meanwhile, we are introducing tools, such as Registry (upcoming on Elixir v1.4) and GenStage, which better address domains developers would consider using GenEvent for.

However, there is a very minimal replacement for GenEvent which can be achieved today in Elixir that uses a Supervisor and multiple GenServers. We have recently used this technique on ExUnit, Elixir’s built-in test framework, as we prepare for an eventual deprecation of GenEvent.

Let’s explore this solution.

The old event manager

ExUnit ships with an event manager that emits notifications any time a test cases and test suite start and finish. For example, if you implement a custom ExUnit formatter, which controls how ExUnit prints output as your test suite runs, you do so by implementing a GenEvent handler and adding it to the event manager.

The implementation of the event manager with GenEvent is quite straight-forward:

defmodule ExUnit.EventManager do
  def start_link() do
    GenEvent.start_link()
  end

  def stop(pid) do
    GenEvent.stop(pid)
  end

  def add_handler(pid, handler, opts) do
    GenEvent.add_handler(pid, handler, opts)
  end

  def suite_started(pid, opts) do
    notify(pid, {:suite_started, opts})
  end

  def suite_finished(pid, run_us, load_us) do
    notify(pid, {:suite_finished, run_us, load_us})
  end

  def case_started(pid, test_case) do
    notify(pid, {:case_started, test_case})
  end

  def case_finished(pid, test_case) do
    notify(pid, {:case_finished, test_case})
  end

  def test_started(pid, test) do
    notify(pid, {:test_started, test})
  end

  def test_finished(pid, test) do
    notify(pid, {:test_finished, test})
  end

  defp notify(pid, msg) do
    GenEvent.notify(pid, msg)
  end
end

The semantics in this case are didacted by GenEvent:

In case there is an error in any of the handlers, like a custom formatter, that formatter is automatically removed from the GenEvent. A custom formatter won’t be added/restarted until the test suite runs again
Events are dispatched asynchronously, with the GenEvent.notify/2 function
Multiple handlers are processed serially, GenEvent is unable to exploit concurrency out of the box

ExUnit’s event manager is a very simple, low-profile, use case of a GenEvent. In any case, we decided it would be better to move ExUnit away from GenEvent to promote good patterns.

The new event manager

Given the semantics above, we have decided to replace GenEvent by a simple one for one Supervisor, where each handler is a separate GenServer added as a child of the supervisor, and each event is dispatched asynchronously to each handler using GenServer.cast/2. Let’s see the new code.

defmodule ExUnit.EventManager do
  @timeout 30_000

  def start_link() do
    import Supervisor.Spec
    child = worker(GenServer, [], restart: :temporary)
    Supervisor.start_link([child], strategy: :simple_one_for_one)
  end

  def stop(sup) do
    for {_, pid, _, _} <- Supervisor.which_children(sup) do
      GenServer.stop(pid, :normal, @timeout)
    end
    Supervisor.stop(sup)
  end

  def add_handler(sup, handler, opts) do
    Supervisor.start_child(sup, [handler, opts])
  end

  def suite_started(sup, opts) do
    notify(sup, {:suite_started, opts})
  end

  def suite_finished(sup, run_us, load_us) do
    notify(sup, {:suite_finished, run_us, load_us})
  end

  def case_started(sup, test_case) do
    notify(sup, {:case_started, test_case})
  end

  def case_finished(sup, test_case) do
    notify(sup, {:case_finished, test_case})
  end

  def test_started(sup, test) do
    notify(sup, {:test_started, test})
  end

  def test_finished(sup, test) do
    notify(sup, {:test_finished, test})
  end

  defp notify(sup, msg) do
    for {_, pid, _, _} <- Supervisor.which_children(sup) do
      GenServer.cast(pid, msg)
    end
    :ok
  end
end

The changes to the codebase are minimal. The semantics now are:

In case there is an error in any of the handlers, like a custom formatter, that formatter is automatically removed by the Supervisor and it is not restarted, as the :restart strategy was set to :temporary. A custom formatter will be restarted only when the test suite runs again
Events are dispatched asynchronously, with the GenServer.cast/2 function
Multiple handlers are now processed concurrently

On the handler side, the changes are also minimal. When using GenEvent, a handler had to implement a callback such as:

def handle_event({:test_finished, %ExUnit.Test{}}, state) do
  ...
  {:ok, new_state}
end

Now with a GenServer:

def handle_cast({:test_finished, %ExUnit.Test{}}, state) do
  ...
  {:noreply, new_state}
end

Overall, using GenServers is a plus since it is more likely developers are acquainted with its APIs and callbacks. Furthermore, we also gained concurrency between handlers.

Watch out!

The replacement above is straight-forward because the original code was a simple and low-profile usage of GenEvent. For example, both old and new implementation can afford to use asynchronous communication with handlers because we can reasonably assume most time is spent on the test suite and not on the handlers themselves.

In other words, both old and new implementations above do not provide back-pressure. So if you expect any of your handlers to perform tons of work, they will have an ever growing queue of messages to process. If desired, you can provide back-pressure by replacing GenServer.cast/2 by GenServer.call/3. But then execution will be serial unless you call each handler inside a task:

|> sup
|> Supervisor.which_children()
|> Enum.map(fn {_, pid, _, _} -> Task.async(GenServer, :call, [pid, msg]) end)
|> Enum.map(&Task.await/1)

Another decision we took is to use GenServer.stop/3 to synchronously terminate handlers. This only works because we set :restart to :temporary. Otherwise directly shutting down handlers would cause the supervisor to restart them. Alternatively, you could also skip the GenServer.stop/3 altogether and simply let Supervisor.stop/1 do the work of shutting down all children with exit signals. Then if a particular child needs synchronous termination, it can trap exits. We avoided this on purpose because we expect all handlers to require synchronous termination. Your mileage may vary.

In any case, there you go! A short example of how to replace a GenEvent by a Supervisor and GenServer and the design decisions we took along the way.

5 responses to “Replacing GenEvent by a Supervisor + GenServer”

Stan Bright says:

November 25, 2016 at 6:22 am

Thanks for sharing this. I’m sure this has been useful to a lot of people.
Petri Kero says:

November 27, 2016 at 10:50 am

Sounds awesome! Do these GenEvent-related problems also apply to Logger and are there plans to fix it as well? There seem to be plenty of Logger backends that do network accesses, which could take a long time to finish in face of network problems. Is it possible that such a backend could block Logger completely, leading to timeouts in processes calling it?
josevalim says:

November 27, 2016 at 12:09 pm

The Logger is one of the few use cases where a GenEvent seems like a reasonable solution. Because different clients are sending iodata to a different process, you don’t want to copy that iodata to different processes over and over again as in the example above.

We are discussing some solutions where we move most of the work to the client, which would be better to exploit concurrency, but until then we don’t have plans to drop :gen_event from Logger.
Petri Kero says:

November 27, 2016 at 3:19 pm

That’s a fair point. Moving work to the caller also sounds sensible.

Are there recommended patterns for building backends that log over network or can otherwise block for long time? I’ve had my whole system come down because of Logger getting choked to death from too much logging, which timeouted pretty much every process in the whole system. That was obviously my fault, but it sounds like a 10s latency spike in some http-based backend could lead do the same result. Or perhaps are better approaches to the whole logging problem in a cluster environment?

Another potentially very useful feature for Logger would be ability to shed load under massive load (be it from user’s own mistake, some backend stalling, or some cascade of errors). That would result in lost log messages, but at least it would not bring down a lot of processes with it.
josevalim says:

November 27, 2016 at 5:36 pm

Shedding load is the responsibility of each Logger backend at the moment. I would recommend any logger that needs to go over the network to batch those requests in the handler and then send to a separate process that will do the “upload”. The separate process can do the shedding if necessary since it is extremely important that Logger handlers do not block.