Regarding PySpark vs Scala Spark performance.

1 min readMar 19, 2019

Regarding PySpark vs Scala Spark performance. They can perform the same in some, but not all, cases. I was just curious if you ran your code using Scala Spark if you would see a performance difference. In general, programmers just have to be aware of some performance gotchas when using a language other than Scala with Spark. Here’s a link to a few benchmarks of different flavors of Spark programs.

https://mindfulmachines.io/blog/2018/6/apache-spark-scala-vs-java-v-python-vs-r-vs-sql26

Regarding my data strategy, the answer is … it depends. For example, you’re working with CSV files, which is a very common, easy-to-use file type. In a case where that data is mostly numeric, simply transforming the files to a more efficient storage type, like NetCDF or Parquet, provides a huge memory savings. That alone could transform what, at first glance, appears to be multi-GB data into MB of data.

> The point I am trying to make is, for one-off aggregation and analysis like this on bigger data sets which can sit on a laptop comfortably, it’s faster to write simple iterative code than to wait for hours.

Yes, that’s a great summary of your article! I totally agree with your point.

> But I noticed it [Scala] to be orders of magnitude slower than Rust(around 3X).

Sorry to be pedantic … however, one order of magnitude = 10¹ (i.e. 10x).

3x isn’t orders of magnitude :-)

Anyway, I enjoyed your article. Thanks for sharing it!

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Brian Schlining

Responses (1)