1) Spark & Factorio
As I searched for an interesting topic to discuss data engineering, I stumbled upon the impressive game called Factorio. After playing it for a while, I noticed that many of the game’s concepts were similar to those used in Apache Spark.
My goal is to introduce you to the essential concepts of Spark while making the learning process enjoyable. Specifically, I will cover Spark’s computing capabilities, as well as one of the most common issues I have faced when working with Spark – data distribution.
2) What makes Factorio brilliant for engineers?
Why choose Factorio? Well, Factorio is a unique strategy and survival game where the main objective is to build a factory. In my opinion, it is probably the second-best thing that the Czech Republic has given us, coming in just behind Kofola. Some would argue that fired cheese should be put into contention too.
What I particularly appreciate about Factorio is its conceptual similarity to data engineering. The process of discovering and extracting ores, smelting them into plates, and using those plates to create gears, mirrors the building blocks used in data engineering.
The best part of Factorio, in contrast to real-world data engineering work, is that everything just works. There are no ‘outside circumstances’ that slow you down, aside from the original inhabitants of the game world – the bugs. You won’t have to deal with services that are unavailable or be told that you need to sit through five meetings before gaining access to an API. Factorio scratches all the right places in an engineer’s mind.
Apache Spark is a fast and flexible open-source data processing engine that can handle large-scale data processing tasks. It provides high-level APIs in Java, Scala, Python, and R, and supports batch processing, streaming, machine learning, and graph processing.
Without further ado, let’s start with the basics and work our way forward.
3) The basics: data, partiions, cores, nodes
We will be using four essential concepts down the road, let’s start with the introduction of first, most basic one: the data.In real worlds this could be just about anything, but for the principles of Factorio, data will be obfuscated as … Iron Plates.
How are iron plates made in factorio? It’s really simple, you mine the iron ore, and smelt it in furnaces. As easy as it sounds.
Since Apache Spark is suited for big data workflows, we would need to partition our data into smallers parts, and then process it in parallel.
As partitions, we will be using two different figures: red and yellow transporting belts. They have different capacities, hence why they would serve us perfectly to showcase the speed of data.
We’ve got the data ‘transportation’ covered, now how to process it? For this we would be using two different types of objects: assembling machines. Say hello to Cores. First, the basic assembly machine:
And the advanced assembling machine, faster core.
Big data = a lot of computing = a lot of cores. Nodes are a way of organizing computing devices to work together in a coordinated way. In this particular we would be using nodes containing 4 cores each. Be it of basic, or advanced assembling machines.
4) The problem: uneven iron distribution in the piping department
Uneven distribution of data in Apache Spark refers to a common scenario where the data is not evenly distributed among partitions (transporting belts in this case). This can occur due to various reasons such as a non-uniform key distribution, data skewness, or non-uniform data size.
When the data is unevenly distributed, it can lead to performance issues as some partitions may become heavily loaded, while others remain underutilized. This can result in slow processing times, increased resource usage, and even out of memory errors.
Let’s see how this problem can be showcased in Factorio, using previously mentioned elements:
Let’s suppose we are tasked with creating a highly functioning, big data workflow. This time I will show you a simple example: how to effectively turn iron plates into pipes, or in other words, how to compute the data in an efficient way.
Here is an example of how in an ideal scenario this might look, with data evenly distributed across all the partitions.
As you can clearly see, due to the fact that workload is evenly distributed, even the basic architecture of 4 nodes, with basic cores can handle it without any overhead, and without much idle time.
Let’s see how the same workload would be doing, but this time we will be looking at uneven distribution of data.
This time, our architecture is straightforward not fit for the task. Most of the cores in the top node are not working on getting things done, while the bottom node is greatly overworked, and the materials stack at the end of the belt. This gives us a poor performance, an overwhelming waste of resources, and vastly prolonged time. Not a good thing to see!
How we could resolve the issue? Below you would find some common ideas that tackle the problem:
A) Upgrading the cores:
B) Using faster means of transportation, so we could mitigate the problems with topmost nodes:
C) Using faster means of transportation & combining that with faster cores
As you clearly see, none of the above works. And in real life examples of big data workflows, this would be like applying the band-aid to an open wound. Of course, we could just upgrade the AWS machine, see a decrease in time of execution and call it a day, but it would be only painting the grass with green paint, or in other words: temporary solution.
The Solution: partition, repartition
Small changes such as modifying the partitioning column can significantly improve the distribution of data. Although it may be not perfect, it is alright if it’s not perfect, because partitioning ain’t perfect. Sometimes good enough, and quick is sufficient.
Repartition, This function can be used to shuffle and redistribute the data evenly across partitions based on a specific key or column
Thank you for taking the time to read my article. Your interest mean a lot to me, and I really appreciate it. I hope you had as much fun, as I had writing this.Thanks again for your time and attention! And always remember, THE FACTORY MUST GROW!