<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	>

<channel>
	<title>small data « Plataformatec Blog</title>
	<atom:link href="/tag/small-data/feed/" rel="self" type="application/rss+xml" />
	<link>/</link>
	<description>Plataformatec&#039;s place to talk about Ruby, Ruby on Rails, Elixir, and software engineering</description>
	<lastBuildDate>Fri, 14 Apr 2017 18:11:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.4.2</generator>
	<item>
		<title>Small data with Elixir</title>
		<link>/2017/03/small-data-with-elixir/</link>
					<comments>/2017/03/small-data-with-elixir/#comments</comments>
		
		<dc:creator><![CDATA[José Valim]]></dc:creator>
		<pubDate>Fri, 17 Mar 2017 16:25:30 +0000</pubDate>
				<category><![CDATA[English]]></category>
		<category><![CDATA[elixir]]></category>
		<category><![CDATA[small data]]></category>
		<guid isPermaLink="false">/?p=6185</guid>

					<description><![CDATA[<p>This is the first article in a series of articles about &#8220;small data&#8221; (in contrast to &#8220;big data&#8221;) in Elixir. We will start by defining what is &#8220;small data&#8221;, why it matters and then briefly describe the Flow tool and what to expect in the next articles of the series. How small is small? We ... <a class="read-more-link" href="/2017/03/small-data-with-elixir/">»</a></p>
<p>The post <a href="/2017/03/small-data-with-elixir/">Small data with Elixir</a> first appeared on <a href="/">Plataformatec Blog</a>.</p>]]></description>
										<content:encoded><![CDATA[<p>This is the first article in a series of articles about &#8220;small data&#8221; (in contrast to &#8220;big data&#8221;) in Elixir. We will start by defining what is &#8220;small data&#8221;, why it matters and then briefly describe the <code>Flow</code> tool and what to expect in the next articles of the series.</p>
<h2>How small is small?</h2>
<p>We define small data as data that can be processed by a single machine in a desirable time interval. Such work may be done in batches, where all the data is known upfront, or on streaming data, where one or more machines can keep up with the incoming rate of events without a need for synchronization.</p>
<p>Yanpei Chen, Sara Alspaugh, and Randy Katz, from University of California, have <a href="https://people.eecs.berkeley.edu/~alspaugh/papers/mapred_workloads_vldb_2012.pdf">characterized different MapReduce workloads</a>, and concluded that:</p>
<blockquote style="border-left: solid 4px #bbb; padding-left: 15px; color: #888;"><p>
For job-level scheduling and execution planning all workloads contain a range of job types, with the most common being small jobs. These jobs are small in all dimensions compared with other jobs in the same workload. They involve 10s of KB to GB of data, exhibit a range of data patterns between the map and reduce stages, and have durations of 10s of seconds to a few minutes.
</p></blockquote>
<p>Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and Steven Hand, from University of Cambridge and Max Planck Institute for Software Systems, when <a href="http://www.cl.cam.ac.uk/research/srg/netos/camsas/pubs/eurosys15-musketeer.pdf">developing Musketeer</a> compared different solutions and found that</p>
<blockquote style="border-left: solid 4px #bbb; padding-left: 15px; color: #888;"><p>
For small inputs (≤ 0.5GB), the Metis single-machine MapReduce system performs best. This matters, as small inputs are common in practice: 40–80% of Cloudera customers’ MapReduce jobs and 70% of jobs in a Facebook trace have ≤ 1GB of input.
</p></blockquote>
<p>Often the computation was not the bottleneck but reading the data from external sources. Being able to stream from and to external sources in parallel is paramount for the performance of such systems.</p>
<p>Finally, Frank McSherry, Michael Isardm, and Derek G. Murray published <a href="http://www.frankmcsherry.org/assets/COST.pdf">&#8220;Scalability! But at what COST?&#8221;</a>. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent <strong>single-threaded implementation</strong>.</p>
<blockquote style="border-left: solid 4px #bbb; padding-left: 15px; color: #888;"><p>
The cluster computing environment is different from the environment of a laptop. The former often values high capacity and throughput over latency, with slower cores, storage, and memory. The laptop now embodies the personal computer, with lower capacity but faster cores, storage, and memory. While scalable systems are often a good match to cluster resources, it is important to consider alternative hardware for peak performance.
</p></blockquote>
<p>In other words, there is a large set of problems that are more efficiently solved on a single machine, as it avoids the costs in complexity, network communication and data checkpointing common to big data systems.</p>
<p>What exactly constitutes small data depends on the problem, the data size (or its incoming rate) and the expected processing times. In this series of articles, we will explore solutions to different problems with <a href="https://github.com/elixir-lang/flow">the <code>Flow</code> library</a>. Flow leverages concurrency on single-machines and may be a suitable option for small workloads, saving teams the need to resort to fully fledged big data solutions.</p>
<h2>GenStage and Flow</h2>
<p>Last year <a href="http://elixir-lang.org/blog/2016/07/14/announcing-genstage/">we have introduced GenStage</a>, an abstraction for exchanging data between Elixir processes. GenStage was designed with back-pressure in mind so Elixir developers are able to consume data from external systems, such as Apache Kafka, RabbitMQ, databases, files and so on without overloading the system processing the data.</p>
<p>Stages may be producers and/or consumers of data. A single producer stage may have multiple consumers, which will receive events according to a chosen strategy. This means developers can create arbitrarily stage pipelines as a way to leverage concurrency.</p>
<p>However, if developers are the ones responsible for building those pipelines, they may end-up with suboptimal workflows. That&#8217;s why we developed a tool called <code>Flow</code>, built on top of <code>GenStage</code>. Flow allows developers to express their data computations using functional operations, such as <code>map</code>, <code>reduce</code>, <code>filter</code>, and friends. Flow also provides conveniences for data partioning and windowing. Once such parameters are specified, <code>Flow</code> takes care of building a network of connected stages where the data flows through. Here is the classic (and cliché) example of using Flow for counting words on a file:</p>
<pre><code class="elixir">File.stream!("path/to/file")
|&gt; Flow.from_enumerable()
|&gt; Flow.flat_map(&amp;String.split/1)
|&gt; Flow.partition()
|&gt; Flow.reduce(fn -&gt; %{} end, fn word, map -&gt;
Map.update(map, word, 1, &amp; &amp;1 + 1)
end)
|&gt; Enum.into(%{})
</code></pre>
<p>Don&#8217;t worry about the details of the example above for now. We will revisit it in future posts.</p>
<h2>Next steps</h2>
<p>In the next article, we will talk about lazy computations and async streams, which provide some useful background before jumping into <code>Flow</code>. If you would like to get a head start, <a href="https://www.youtube.com/watch?v=srtMWzyqdp8">watch my keynote about GenStage &amp; Flow at ElixirConf</a> and <a href="https://hexdocs.pm/flow/">read the excellent documentation of the Flow project</a>.</p>
<p><a href="http://plataformatec.com.br/elixir-radar?utm_source=our-blog&amp;utm_medium=referral&amp;utm_campaign=elixir-radar&amp;utm_content=elixir-radar-cta-blog-post-bottom"><br />
<img decoding="async" style="border: 0;" src="/wp-content/uploads/2015/05/elixir-radar-subscribe.png" alt="Subscribe to Elixir Radar"><br />
</a></p><p>The post <a href="/2017/03/small-data-with-elixir/">Small data with Elixir</a> first appeared on <a href="/">Plataformatec Blog</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>/2017/03/small-data-with-elixir/feed/</wfw:commentRss>
			<slash:comments>6</slash:comments>
		
		
			</item>
	</channel>
</rss>
