This is the first article in a series of articles about “small data” (in contrast to “big data”) in Elixir. We will start by defining what is “small data”, why it matters and then briefly describe the Flow<\/code> tool and what to expect in the next articles of the series.<\/p>\n
How small is small?<\/h2>\nWe define small data as data that can be processed by a single machine in a desirable time interval. Such work may be done in batches, where all the data is known upfront, or on streaming data, where one or more machines can keep up with the incoming rate of events without a need for synchronization.<\/p>\n
Yanpei Chen, Sara Alspaugh, and Randy Katz, from University of California, have characterized different MapReduce workloads<\/a>, and concluded that:<\/p>\n
Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and Steven Hand, from University of Cambridge and Max Planck Institute for Software Systems, when developing Musketeer<\/a> compared different solutions and found that<\/p>\n
Finally, Frank McSherry, Michael Isardm, and Derek G. Murray published “Scalability! But at what COST?”<\/a>. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation<\/strong>.<\/p>\n
What exactly constitutes small data depends on the problem, the data size (or its incoming rate) and the expected processing times. In this series of articles, we will explore solutions to different problems with the Flow<\/code> library<\/a>. Flow leverages concurrency on single-machines and may be a suitable option for small workloads, saving teams the need to resort to fully fledged big data solutions.<\/p>\n
GenStage and Flow<\/h2>\n
Last year we have introduced GenStage<\/a>, an abstraction for exchanging data between Elixir processes. GenStage was designed with back-pressure in mind so Elixir developers are able to consume data from external systems, such as Apache Kafka, RabbitMQ, databases, files and so on without overloading the system processing the data.<\/p>\n
File.stream!(\"path\/to\/file\")\n|> Flow.from_enumerable()\n|> Flow.flat_map(&String.split\/1)\n|> Flow.partition()\n|> Flow.reduce(fn -> %{} end, fn word, map ->\nMap.update(map, word, 1, & &1 + 1)\nend)\n|> Enum.into(%{})\n<\/code><\/pre>\nDon’t worry about the details of the example above for now. We will revisit it in future posts.<\/p>\n
Next steps<\/h2>\n
In the next article, we will talk about lazy computations and async streams, which provide some useful background before jumping into Flow<\/code>. If you would like to get a head start,
watch my keynote about GenStage & Flow at ElixirConf<\/a> and read the excellent documentation of the Flow project<\/a>.<\/p>\n
\n
\n<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"
This is the first article in a series of articles about “small data” (in contrast to “big data”) in Elixir. We will start by defining what is “small data”, why it matters and then briefly describe the Flow tool and what to expect in the next articles of the series. How small is small? We … \u00bb<\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[143,256],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/6185"}],"collection":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/comments?post=6185"}],"version-history":[{"count":11,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/6185\/revisions"}],"predecessor-version":[{"id":6271,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/posts\/6185\/revisions\/6271"}],"wp:attachment":[{"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/media?parent=6185"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/categories?post=6185"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.plataformatec.com.br\/wp-json\/wp\/v2\/tags?post=6185"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}