Scala · Spark

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in a pre-release   preview of Apache Spark 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs. Mostly, I will focus on DataFrames and Datasets, because in Apache… Continue reading A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets