Regarding PySpark vs Scala Spark performance. They can perform the same in some, but not all, cases. I was just curious if you ran your code using Scala Spark if you would see a performance difference. In general, programmers just have to be aware of some performance gotchas when using a language other than Scala with Spark. Here’s a link to a few benchmarks of different flavors of Spark programs.
Regarding my data strategy, the answer is … it depends. For example, you’re working with CSV files, which is a very common, easy-to-use file type. In a case where that data is mostly numeric, simply transforming the files to a more efficient storage type, like NetCDF or Parquet, provides a huge memory savings. That alone could transform what, at first glance, appears to be multi-GB data into MB of data.
> The point I am trying to make is, for one-off aggregation and analysis like this on bigger data sets which can sit on a laptop comfortably, it’s faster to write simple iterative code than to wait for hours.
Yes, that’s a great summary of your article! I totally agree with your point.
> But I noticed it [Scala] to be orders of magnitude slower than Rust(around 3X).
Sorry to be pedantic … however, one order of magnitude = 10¹ (i.e. 10x).
3x isn’t orders of magnitude :-)
Anyway, I enjoyed your article. Thanks for sharing it!