Small data with Elixir

José Valim — Fri, 17 Mar 2017 16:25:30 +0000

This is the first article in a series of articles about “small data” (in contrast to “big data”) in Elixir. We will start by defining what is “small data”, why it matters and then briefly describe the Flow tool and what to expect in the next articles of the series.

How small is small?

We define small data as data that can be processed by a single machine in a desirable time interval. Such work may be done in batches, where all the data is known upfront, or on streaming data, where one or more machines can keep up with the incoming rate of events without a need for synchronization.

Yanpei Chen, Sara Alspaugh, and Randy Katz, from University of California, have characterized different MapReduce workloads, and concluded that:

For job-level scheduling and execution planning all workloads contain a range of job types, with the most common being small jobs. These jobs are small in all dimensions compared with other jobs in the same workload. They involve 10s of KB to GB of data, exhibit a range of data patterns between the map and reduce stages, and have durations of 10s of seconds to a few minutes.

Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and Steven Hand, from University of Cambridge and Max Planck Institute for Software Systems, when developing Musketeer compared different solutions and found that

For small inputs (≤ 0.5GB), the Metis single-machine MapReduce system performs best. This matters, as small inputs are common in practice: 40–80% of Cloudera customers’ MapReduce jobs and 70% of jobs in a Facebook trace have ≤ 1GB of input.

Often the computation was not the bottleneck but reading the data from external sources. Being able to stream from and to external sources in parallel is paramount for the performance of such systems.

Finally, Frank McSherry, Michael Isardm, and Derek G. Murray published “Scalability! But at what COST?”. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation.

The cluster computing environment is different from the environment of a laptop. The former often values high capacity and throughput over latency, with slower cores, storage, and memory. The laptop now embodies the personal computer, with lower capacity but faster cores, storage, and memory. While scalable systems are often a good match to cluster resources, it is important to consider alternative hardware for peak performance.

In other words, there is a large set of problems that are more efficiently solved on a single machine, as it avoids the costs in complexity, network communication and data checkpointing common to big data systems.

What exactly constitutes small data depends on the problem, the data size (or its incoming rate) and the expected processing times. In this series of articles, we will explore solutions to different problems with the Flow library. Flow leverages concurrency on single-machines and may be a suitable option for small workloads, saving teams the need to resort to fully fledged big data solutions.

GenStage and Flow

Last year we have introduced GenStage, an abstraction for exchanging data between Elixir processes. GenStage was designed with back-pressure in mind so Elixir developers are able to consume data from external systems, such as Apache Kafka, RabbitMQ, databases, files and so on without overloading the system processing the data.

Stages may be producers and/or consumers of data. A single producer stage may have multiple consumers, which will receive events according to a chosen strategy. This means developers can create arbitrarily stage pipelines as a way to leverage concurrency.

However, if developers are the ones responsible for building those pipelines, they may end-up with suboptimal workflows. That’s why we developed a tool called Flow, built on top of GenStage. Flow allows developers to express their data computations using functional operations, such as map, reduce, filter, and friends. Flow also provides conveniences for data partioning and windowing. Once such parameters are specified, Flow takes care of building a network of connected stages where the data flows through. Here is the classic (and cliché) example of using Flow for counting words on a file:

File.stream!("path/to/file")
|> Flow.from_enumerable()
|> Flow.flat_map(&String.split/1)
|> Flow.partition()
|> Flow.reduce(fn -> %{} end, fn word, map ->
Map.update(map, word, 1, & &1 + 1)
end)
|> Enum.into(%{})

Don’t worry about the details of the example above for now. We will revisit it in future posts.

Next steps

In the next article, we will talk about lazy computations and async streams, which provide some useful background before jumping into Flow. If you would like to get a head start, watch my keynote about GenStage & Flow at ElixirConf and read the excellent documentation of the Flow project.

The post Small data with Elixir first appeared on Plataformatec Blog.

small data « Plataformatec Blog

Small data with Elixir

How small is small?

GenStage and Flow

Next steps