It is very cool that you can write data analysis relatively concisely in Rust. I do want to point out that you’re really comparing apples and oranges here with your emphasis on how much faster your Rust code is. Spark is for ‘Big Data’, that is data that is too large to effectively work with on one machine. There’s extra overhead in Spark for dealing with distributed processing that your code doesn’t have to deal with. I realize your just using it for illustrative purposes here, but normally, I wouldn’t use Spark when crunching data on a single node.
Also, I’d be curious to see what the speed is of a Scala version of your Spark code. Not sure if you have the time and/or bandwidth to throw one together. But it would be interesting as it would indicate how much overhead pyspark adds to Spark.