Spark provides a lot of connectors to load data from various formats. Whether it is a CSV or JSON or Parquet you can use the magic of “spark.read”.
However, there are times when you would like to create a DataFrame dynamically using code. The one use case that I was presented with was to create a dataframe out of a very twisted incoming JSON from an API. So, I decided to parse the JSON manually and create a dataframe.
The approach we are going to use is to create a list of structured Row types and we are using PySpark for the task. The steps are as follows:
- Define the custom row class
personRow = Row("name","age")
2. Create an empty list to populate later
community = []
3. Create row objects with the specific data in them. In my case, this data is coming from the response that we get from calling the API.
qr = personRow(name, age)
4. Append the row objects to the list. In our program, we are using a loop to append multiple Row objects to the list.
community.append(qr)
5. Define the schema using StructType
person_schema = StructType([ \
StructField("name", StringType(), True), \
StructField("age", IntegerType(), True), \
])
6. Create the dataframe using createDataFrame. The two parameters required is the data and schema to applied.
communityDF = spark.createDataFrame(community, person_schema)
Using the steps above, we are able to create a dataframe for use in Spark applications dynamically.