Deriving Spark Encoders and Schemas Using Implicits

Since Apache Spark 2.0 the Dataset API is the preferred way of programming over low level RDDs. However when migrating complex business entities from RDDs to Datasets, a handful of problems arise. One is the lack of support for user defined types, confining the developer to a predefined set of types and severely hindering the usefulness of Datasets. The other one is the inferior type safety of DataFrame operations compared to RDDs.

In this blogpost I describe how our team at Ekata resolved these problems while migrating our ETL pipeline from RDDs to DataFrames. We did this without changing our object-oriented domain entities, writing schema description, breaking binary compatibility with our existing serialized format, or needlessly degrading the optimizations coming with DataFrames. With the help of the code generation capabilities of the Scala compiler and a set open source type level libraries, we derived schema for our business entities, generated Dataset encoders and had fun along the way.

Continue reading »