[EN] Spark my fire – an introduction to data engineering in Apache Spark

In this post I’ll make a recap of my lecture at 17th SFI IT Academic Festival in Cracow. Conducted on 15/03/2022.

At the stage.


Once upon a time, I have been working with a very clever man. That man used to tell a joke every time someone in the office mentioned the term ‘big data’ For my liking: a little too frequent. The phrase was: ‘Hadoop? Hadoop is a joke in this space’. You know what is not a joke? Apache Spark is not a joke.

What is Big Data & Data Engineering all about?

I’d like to start this chapter with a question to all of You – ‘What is Big Data?’.

There is no perfect answer, I must assure you, no wrong answer as well.
Some may say:
– it’s data that can’t be processed in acceptable time:
– it’s data in sizes that exceeds our machine’s HDD or SSD:
– it’s data that is too frequently changing

So on, and so forth
What’s the most common answer out there in the business world?
It’s the data that cannot fit into Excel.
Jokes aside, Big data currently is described via six different characteristics. The number of those characteristics increases with time. Starting from the beginning of the concept – with three, then four, five was a standard couple of years ago, six is the state of the art, however in future there might be more.

– Volume – that means the quantity of data. Usually in the range of couple of Terabytes and petabytes
– Variety – the type and nature of the data, be it structured, semi structured or unstruxtured
– Velocity – the speed at which the data is generated and processed
– Veracity – the truthfulness or reliability of the data, or in other words the quality of the data
– Value – the worth in information we process and store
– Variability – changing formats, changing structure, changing sources

Thats a lot o V’s, and no, that is not a Cyberpunk reference.

irst big question of the chapter behind us, now let’s focus on the ‘bright side of life’, pardon me, on the right side of the ampersand – ‘What is Data Engineering all about?’

This time the answer might be a little more precise, and far easier. Engineers of all specialisations generally design and build things, be that civil engineering, chemical engineer, software engineers, does not really matter. What matters is that the general principle stays the same, no matter the field.

By translating that principle to field of data – Data Engineers design and build pipelines. They transform and transport data from various sources to end users. Sounds simple, right? In fact, it is a bit more complicated, but my goal today is to keep things as introductory and as simple as they possibly can get.

Apache Spark

Apache Spark – is an open source analytics engine for big data processing. In my understanding and experience. Spark works as an interface between us, our data and computing power.
It was originally developed at the UC Berkeley – University of California, Berkeley. Sometime later the codebase was dontaed to Apache Software Foundation, and it does remain under Apache since. Hence why it’s widely popular i open source. It is written in Scala.

I’d like to give You a quick overview on it’s general concepts and implementation of those in real life. There are multiple ways in which You can work with the data in spark. Let me focus on first major one – a ‘RDD’, which is a feature I cannot imagine Spark without.

Resilient Distributed DataSets are the primary way for working with data in Spark. RDD is a distributed, immutable collection of elements, partitioned across nodes in your cluster. Which can be then operated in parallel via low-level API. This is fundamental abstraction of Spark. It’s most important features, not all, are:

– Immutable, Read-Only
– Lazy Evaluated
– Fault Tolerance & Lineage
– Partitioning
– Parallelism
– Location stickiness
– In memory computation

What are RDD’s useful for:
– they offer a lot of control & flexibity – rdd gives you a lot of control of what actualy happens to the data
– it’s the lowest level API
– it’s an how-to do something approach – you are given controll, and You tell spark how to to something yourself

RDDs behind us, now let’s talk about the second major way of interacting with data in Spark – DataFrames. Data frames are sort of extension to RDDs, they share multiple simililarities, but are not the same thing. Please think of DataFrames as a regular table with defined schema. So every column has a name, and the type assosciated with it. The main pros that come with writing data frames as opposed to RDDs are in the clarity of the queries. It’s really easy to write the code there. You are tellin spark what to do, just not how exactly to do it. Spark will know on its own.

Dataframes are supported by spark sql engine, hence why your query is well optimised, but you lose the controll. As opposed for RDD ‘how to do it’ this approach is more ‘what to do’. Also a nice thing, You can write a sql query against it, instead of functional methods.

What are the useful for:

– great for structured data
– high level APIs
– strong type safety
– ease of use, ease of write, ease of read

Comparison to Hadoop & Myth busting

For those of You that does not know what Hadoop is: open source framework for distributed processing of large data sets using the MapReduce model. The previous hottest framework for big data processing. In order to grasp the differences between those two we first need to distinguish two terms: parallel computing and distributed computing. Parallel computing is the simultaneous use of more than one processor to solve a problem, while distributed computing is the simultaneous use of more than one computer to solve a problem.

Most notable and most important difference between these two is … speed. The general principle of computing is that in RAM operations are way faster than writing and reading from a disk. It is widely known and true. Hence why Spark is much faster than Hadoop. There is about 100x difference between Hadoop and Spark. which I think all of You would agree is significant.

The general assumption would be: Spark is faster, less secure, more expensive, as memory is considerably more expensive these days, more challenging to set up, way more challenging to be set up correctly, better suited for machine learning and more challenging to scale, I’ve been in a situations in which simple scaling the aws clusters upwards did little to nothing for its cost. All of the above is overshadowed thou, by the sheer fact of the speed. You may ask but what if our dataset won’t fit in the memory? Well, Spark is still faster, around 10x faster to be precise.

Performance tuning, quirks

Importance of partition tuning. There are two main issues with partitioning Your data. As in most cases in IT, You can either go with too much or too few.
Let’s take a look at first main issue: too few partitions, with too few:

  • you have less concurency
  • are more prone to data skew
  • your operations like groupBy or reduceByKey have increased memory pressure

Second main issue: too many partitions:

  • you have great concurrency, but it takes way longer to schedule than to execute a task.
    What You need is a reasonable number of partitions. The general assumption is You need between 100 or 10 000 partitions. As a minimum, the number, is please use at least 2x the number of cores in the cluster.
    As a upper bound, please take a look at your spark ui and check if tasks execute for at least 100-200 ms.

3 key points when tuning spark performance:

  • ensure enough partitions for concurency .repartition()
  • minimize memory consumption
  • minimize data shuffle