Encoding in Spark vs. Pandas
I recently came across several interoperability issues between Spark and Pandas data frames. In this case, it was a problem with One Hot Encoding of a categorical feature vector. The data was large enough to dictate the use of Spark for data manipulation and feature engineering but the model expected pandas data frames. However, the one hot encoding implementation in Spark creates a dense representation of the one hot encoded vector whereas the model expected it the way Pandas implements it, i.e. sparse representation. I found a round-about way of solving this problem and if others have a more elegant solution, I would be happy to hear about it. I use the Kaggle Data Science survey data to illustrate the problem and the solution.
Sparse representation
This is quite straightforward in Pandas as shown below and it produces a set of one hot encoded columns with one element set to 1 and the others as 0.
[One Hot Encoding in Pandas — full code on GitHub]
Dense representation
The sparse representation is known to be a wasteful way of storing information and Spark takes a different approach. The way to use one hot encoding in Spark is to first use a StringIndexer that converts the categorical variable into numeric fields and then OneHotEncoder uses these numeric values to create dense representations of the category. I made use of the Pipeline component that Spark provides in order to chain these two operations.
[One Hot Encoding in Spark — full code on GitHub]
Now I was faced with the task of converting this Dense representation into the Sparse representation produced by Pandas so that the model continues to function without any changes. That means that the single column in the Sparse data frame needs to be converted into multiple columns of 1 and 0 values.
I could not find an in-built function in Spark or Pandas to support this and had to resort to converting the dense representation into a list of values and using them to create a new pandas data frame.
[Converting Spark Dense representation to Pandas Sparse representation — full code on GitHub]
This creates the resultant pandas data frame with the one hot encoded vector in the format that Pandas natively produces and can be used by the model. You can find the full code listing on my GitHub repo.
I also noticed some issues with how Spark and Pandas treat NaN and null values which caused some headaches for us when moving back and forth between them. By creating some guidelines on the use of Spark data frames and Pandas data frames, identifying specific use cases and how to inter-operate between them would help to avoid complex data issues.