Count(), Foreach, and Save() with Noop -- which one do you use? There are three ways to benchmark queries, and today I'll explain why Noop is the best.
Spark is the default library when we use Databricks. It can be used with both Scala and Python. Some behaviors are default for every language, while others may change a little bit.
What we have to take into account is: Spark has a lazy behavior, which means that any real computation won’t be done till an action is triggered.
In the example below:
df = spark.table('my_db.my_table') |
Nothing really happens at this point. You need to trigger an action like:
df = spark.table('my_db.my_table') |
Generally, three actions are commonly used when benchmarking Spark queries:
count()
Foreach with an empty lambda
save() with format option "noop"
Which one is the best? That’s what we'll try to answer while exploring the inner workings of Spark in Databricks. To test which benchmark is best, I will be using Databricks Spark UI Simulator, a collection of snapshots provided by Databricks for a number of use cases.
This notebook includes experiments for all three of our benchmark queries. Let’s review the pertinent highlights of each query test and explore why one stands out as better than the others.
Go to the SparkUI tap in the notebook and click the first job.
You'll see this image:
The red bar is the environment setup that happens when you run code for the first time. So when we are doing benchmark testing on the Databricks environment, it is important to do a dry run before starting the actual benchmark.
Now, let’s look at cells 4 and 5, which show a count action performed in both Python and Scala:
The spark job is reading a parquet file from the trxPath variable using read.parquet and performing a count() action just after it. The command takes 13 seconds in both Python and Scala.
This is quite fast, but does it meet our needs? If we examine the Spark UI, we can see how much data was read:
The data is pretty small. The reason for this is that count only reads the metadata. But we want Spark to read every single row. So we can discard count as a proper action to benchmark a query.
Now, let’s move forward and see how foreach performs. The code uses an empty lambda in both cases to establish a benchmark.
This time the Spark UI shows that we are reading the entire data:
However, it seems that there is a huge difference in execution time between Scala and Python: 13 minutes for Scala against 2 and a half hours for Python.
Why is this happening? Every time a lambda function is triggered, it creates a new Python execution runtime, which consumes time and additional resources.
If you are using Scala, the empty foreach is a good solution, but avoid it if you are using pyspark.
But we want a solution that can be used in both languages, without any significant differences.
Let’s try our last option, the noop save. We use this option by specifying “noop” as the format when trying to save our Spark dataframe. This option won’t save any records, but it will perform the same computations used in writing a file. Remember that since Spark is lazy, it only performs computations when an action is triggered.
This time, the action has the same performance for both languages.
And the Spark UI shows us that the whole file was read:
Given the test scenarios above, we conclude that writing a file using "noop" is the best trigger action to benchmark a Spark query between both Scala and Python.
So next time you need to benchmark a Spark query, use noop!