Small data with Elixir « Plataformatec Blog

This is the first article in a series of articles about “small data” (in contrast to “big data”) in Elixir. We will start by defining what is “small data”, why it matters and then briefly describe the Flow tool and what to expect in the next articles of the series.

How small is small?

We define small data as data that can be processed by a single machine in a desirable time interval. Such work may be done in batches, where all the data is known upfront, or on streaming data, where one or more machines can keep up with the incoming rate of events without a need for synchronization.

Yanpei Chen, Sara Alspaugh, and Randy Katz, from University of California, have characterized different MapReduce workloads, and concluded that:

For job-level scheduling and execution planning all workloads contain a range of job types, with the most common being small jobs. These jobs are small in all dimensions compared with other jobs in the same workload. They involve 10s of KB to GB of data, exhibit a range of data patterns between the map and reduce stages, and have durations of 10s of seconds to a few minutes.

Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and Steven Hand, from University of Cambridge and Max Planck Institute for Software Systems, when developing Musketeer compared different solutions and found that

For small inputs (≤ 0.5GB), the Metis single-machine MapReduce system performs best. This matters, as small inputs are common in practice: 40–80% of Cloudera customers’ MapReduce jobs and 70% of jobs in a Facebook trace have ≤ 1GB of input.

Often the computation was not the bottleneck but reading the data from external sources. Being able to stream from and to external sources in parallel is paramount for the performance of such systems.

Finally, Frank McSherry, Michael Isardm, and Derek G. Murray published “Scalability! But at what COST?”. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation.

The cluster computing environment is different from the environment of a laptop. The former often values high capacity and throughput over latency, with slower cores, storage, and memory. The laptop now embodies the personal computer, with lower capacity but faster cores, storage, and memory. While scalable systems are often a good match to cluster resources, it is important to consider alternative hardware for peak performance.

In other words, there is a large set of problems that are more efficiently solved on a single machine, as it avoids the costs in complexity, network communication and data checkpointing common to big data systems.

What exactly constitutes small data depends on the problem, the data size (or its incoming rate) and the expected processing times. In this series of articles, we will explore solutions to different problems with the Flow library. Flow leverages concurrency on single-machines and may be a suitable option for small workloads, saving teams the need to resort to fully fledged big data solutions.

GenStage and Flow

Last year we have introduced GenStage, an abstraction for exchanging data between Elixir processes. GenStage was designed with back-pressure in mind so Elixir developers are able to consume data from external systems, such as Apache Kafka, RabbitMQ, databases, files and so on without overloading the system processing the data.

Stages may be producers and/or consumers of data. A single producer stage may have multiple consumers, which will receive events according to a chosen strategy. This means developers can create arbitrarily stage pipelines as a way to leverage concurrency.

However, if developers are the ones responsible for building those pipelines, they may end-up with suboptimal workflows. That’s why we developed a tool called Flow, built on top of GenStage. Flow allows developers to express their data computations using functional operations, such as map, reduce, filter, and friends. Flow also provides conveniences for data partioning and windowing. Once such parameters are specified, Flow takes care of building a network of connected stages where the data flows through. Here is the classic (and cliché) example of using Flow for counting words on a file:

File.stream!("path/to/file")
|> Flow.from_enumerable()
|> Flow.flat_map(&String.split/1)
|> Flow.partition()
|> Flow.reduce(fn -> %{} end, fn word, map ->
Map.update(map, word, 1, & &1 + 1)
end)
|> Enum.into(%{})

Don’t worry about the details of the example above for now. We will revisit it in future posts.

Next steps

In the next article, we will talk about lazy computations and async streams, which provide some useful background before jumping into Flow. If you would like to get a head start, watch my keynote about GenStage & Flow at ElixirConf and read the excellent documentation of the Flow project.

6 responses to “Small data with Elixir”

Boshan Sun says:

March 23, 2017 at 4:14 pm

Looking forward to the coming series!
Tamori Hirofumi says:

April 14, 2017 at 1:41 pm

Hello, it is a quite suggestive post, thanks!
I found the bug in the sample code, File.stream!(“path/to/file”, :line) should be File.stream!(“path/to/file”, [], :line) or File.stream!(“path/to/file”) as stream!/3’s signature is stream!(path, modes \ [], line_or_bytes \ :line).
josevalim says:

April 14, 2017 at 3:11 pm

Thank you! I have fixed the example!
Tamori Hirofumi says:

April 26, 2017 at 1:17 am

As I told to you at Elixirconf Japan, I translated this blog into Japanese.
http://qiita.com/HirofumiTamori/items/8e076d28ef40d98d99f1
Sorry for late notice. I’m looking forward to your next post.
Matías Reyes says:

June 9, 2017 at 4:44 pm

This is a great topic! and Flow is amazing!
I have read that immutability has bad performance for data & number crunching. I don’t know if It’s proven, but if it’s true, we should be using ETS tables all over?
josevalim says:

June 9, 2017 at 5:26 pm

It may or may not be a performance issue. While writing Flow and GenStage itself, I ran only once into a situation which required speeding up and I used the process dictionary in those cases. ETS tables would also have worked though.