Constructing a PySpark DataFrame Dynamically

Spark provides a lot of connectors to load data from various formats. Whether it is a CSV or JSON or Parquet you can use the magic of “spark.read”.

However, there are times when you would like to create a DataFrame dynamically using code. The one use case that I was presented with was to create a dataframe out of a very twisted incoming JSON from an API. So, I decided to parse the JSON manually and create a dataframe.

The approach we are going to use is to create a list of structured Row types and we are using PySpark for the task. The steps are as follows:

  1. Define the custom row class
personRow = Row("name","age")

2. Create an empty list to populate later

community = []

3. Create row objects with the specific data in them. In my case, this data is coming from the response that we get from calling the API.

qr = personRow(name, age)

4. Append the row objects to the list. In our program, we are using a loop to append multiple Row objects to the list.

community.append(qr)

5. Define the schema using StructType

person_schema = StructType([ \
                              StructField("name", StringType(), True), \
                              StructField("age", IntegerType(), True), \

                    ])

6. Create the dataframe using createDataFrame. The two parameters required is the data and schema to applied.

communityDF = spark.createDataFrame(community, person_schema)

Using the steps above, we are able to create a dataframe for use in Spark applications dynamically.

Copy Data in Azure Databricks Table from one region to another

One of our customers had a requirement of copying data that was locked in an Azure Databricks Table in a specific region (let’s say this is eastus region). The tables were NOT configured as Delta tables in the originating region and a subset of personnel had access to both the regions.

However, the analysts were using another region (let’s say this is westus region) as it was properly configured with appropriate permissions. The requirement was to copy the Azure Databricks Table from eastus region to westus region. After a little exploration, we couldn’t find a direct/ simple solution to copy data from one Databricks region to another.

One of the first thoughts that we had was to use Azure Data Factory with the Databricks Delta connector. This would be the simplest as we would simply need a Copy Data activity in the pipeline with two linked services. The source would be a Delta Lake linked service to eastus tables and the sink would be another Delta Lake linked service to westus table. This solution faced two practical issues:

  1. The source table was not a Delta table. This prevented the use of Delta Lake linked service as source.
  2. When sink for copy activity is a not a blob or ADLS, it requires us to use a staging storage blob. While we were able to link a staging storage blob, the connection could not be established due to authentication errors during execution. The pipeline error looked like the following:
Operation on target moveBlobToADB failed: ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container adf-staging in account xxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration. Caused by: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container adf-staging in account xxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration. Caused by: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Public access is not permitted on this storage account..

On digging deeper, this requirement is documented as a prerequisite. We wanted to use the Access Key method but it requires the keys to be added into Azure Databricks cluster configuration. We didn’t have access to modify the ADB cluster configuration.

To work with this, we found the following alternatives:

  1. Use Databricks notebook to read data from non-Delta Tables in Databricks. The data can be stored in a staging Blob Storage.
  2. To upload the data into the destination table, we will again need to use a Databricks notebook as we are not able to modify the cluster configuration.

Here is the solution we came up with:

In this architecture, the ADB 1 notebook is reading data from Databricks Table A.1 and storing it in the staging blob storage in the parquet format. Only specific users are allowed to access eastus data tables, so the notebook has to be run in their account. The linked service configuration of the Azure Databricks notebook requires us to manually specify: workspace URL, cluster ID and personal access token. All the data transfer is in the same region so no bandwidth charges accrue.

Next the Databricks ADB 2 notebook is accesses the parquet file in the blob storage and loads the data in the Databricks Delta Table A.2.

The above sequence is managed by the Azure Data Factory and we are using Run ID as filenames (declared as parameters) on the storage account. This pipeline is configured to be run daily.

The daily run of the pipeline would lead to a lot of data in the Azure Storage blob as we don’t have any step that cleans up the staging files. We have used the Azure Storage Blob lifecycle management to delete all files not modified for 15 days to be deleted automatically.

Handling a peculiar DATETIME issue in Telerik Reporting

We got a request from our customer wanting to use a CSV file as Data Source in Telerik Report Designer to generate the report. They were unable to parse DateTime column to DateTime data type in Telerik Report Designer because the CSV returns the column with double quotes for example (“4/14/2021 12:42:25PM”).

Values can be easily cast to DateTime in Telerik Reporting during the datasource definition. The issue here was the quotes that came in the datasource.

In this blog post, we will explore how we can work with this scenario using Expressions provided by Telerik Report Designer..

First, we need to Add Data Source so click on Data in Menu then select the CSV Data Source and select and then click on Next button as below :

check the checkbox if CSV file has headers then click on NEXT button.

Now we the screen appear for Map columns to type and you can see the StartTime column below with double quotes:

Now we try to change the type of StartTime column string to DateTime. Now need to provide Date format in our case like yyyy-mm-dd hh:ss but in our case it will not work and the column will show blank.

Cause of blank again we need the change DateTime to String and Click on NEXT Button and Finish.

Now to the Data Source is connected with the report. the customer want the report with the parameter of data. So we need to convert StartTime(string) to StartTime(DateTime) using expression.

First we need to remove double quotes from StartTime using Replace function.

//Syntax of Replace function
=Replace(text, old substirng, new substring)

In a below code we take a text from Fields.StartTime and provide old substring that to be remove is double quotes in single quotes. and new substring is blank. 

=Replace(Fields.StartTime,'"',"")

We got a StartTime in string without double quotes. so now we need to convert String to DateTime using CDate(value) Funciton. In a CDate function we provide the above Replace function that returning the StartTime without double qoutes.

= CDate(Replace(Fields.StartTime,'"',""))

Now we got the StartTime with the type of DateTime. and the above Expression we can use any where we want like filter the data and making parameter range with StartTime.

How to create Data Model of SQL Azure database using Telerik Open Access ORM

In this post we will take a look on creating data model using Open Access ORM from a data base in SQL Azure.

image

To create data model add new item to project and select Telerik Open Access Domain Model from Data tab.

clip_image002

Next select option Populate from database and from drop down choose Microsoft SQL Azure. If you want you can change the model name and then click on the Next.

clip_image004

Telerik Open Access ORM does not provide you to create a connection in case of SQL Azure. You will have to manually copy and paste connection string of SQL Azure database to create Domain Model. You can find connection string of SQL Azure database in Windows Azure portal.

clip_image006

In the quick glance click on the Show Connection strings to see the connection string. Choose ADO.NET connection string. Do not forget to change the password with the real password.

clip_image007

Copy this connection string of ADO.NET and paste it to Set Connection Manually section and click on the Next.

clip_image009

Next you need to choose Schemas, Tables, Views, Stored Procedure to create data model. Let us say we are choosing only one table to create data model. After choosing items click on the next.

clip_image011

Now you can define naming rules of Classes, Fields and Properties if required. Let us leave to default and click on next to proceed further. In last step you can configure Code Generation Settings. Let us leave settings to default and click on Finish to generate data model.

Once Data Model is created you will find *.rlinq file added in the project. Click on *.rlinq file to view the data model.

clip_image012

In this way you can create a data model using Telerik Open Access ORM from a SQL Azure database. I hope you find this post useful. Thanks for reading.

Step by Step Creating WCF Data Service using Telerik Open Access ORM

In this post we will learn step by step to create WCF Data Service using Telerik Open Access ORM. We will create data model of data from SQL Server and further exposed that as WCF Data Service using Telerik Open Access ORM.

Learn more about Open Access ORM here

We are going to follow step by step approach. To create WCF Data Service follow the steps given below,

Step 1: Create Telerik Web Access Web Application project

After installation of Telerik Open Access ORM you will get Telerik Web Access Web Application project template while creating a new project. Go ahead and create a project selecting Telerik Web Access Web Application project template from Web group

image

Step 2: Create Domain Model

Next you will be prompted to create Domain Model. Open Access ORM allows you to create domain model from any of the following type of database. If required you can create Empty Domain Model as well.

image

Let us go ahead and create data model from Microsoft SQL Server. Choose Microsoft SQL Server from drop down and click on the Next.

Step 3: Setup Database connection

In this step you need to set up Database connection. Either you can provide Connection String manually or can create a New Connection. To create new connection click on Add New Connection. You will get window to add new connection. In this window provide database server name and choose database from the dropdown. To test the connection click on Test Connection and after successful testing click on Ok to add a new connection.

image

Step 4: Setup Database Connection

In this step you need to setup database connection. If you want you can change connection string name. Let us leave it to default and click on the Next.

Step 5: Choose Database Items

In this step you need to choose

  • Schemas
  • Tables
  • Views
  • Stored Procedure to create data model.

Let us say we are choosing only one table to create data model. After choosing items click on the next .

image

 

Step 6: Define Naming Rules

In this step you can define naming rules of Classes, Fields and Properties if required. Let us leave to default and click on next to proceed further

Step 7: Code Generation Settings

In this step you can configure Code Generation Settings. Let us leave settings to default and click on Finish to generate data model.

Once Data Model is created you will find *.rlinq file added in the project. Click on *.rlinq file to view the data model.

image

Step 8: Add Open Access Service

By step 7 you have created data model. Now let us go ahead and add WCF Data Service on the created Data Model. For this right click on the project and select Add Open Acccess Service. You will be prompted with Add Open Access Service Wizard. Before adding Open Access Service make sure that once you have built the project.

You need to choose

  • Context : Select Context from the drop down
  • Project : Use the existing project

image

Click on the Next after selecting Context and Project.

Step 9: Select Service Type

In this step you need to choose type of Service you want to create. Since we want to create WCF Data Service, let us choose that from the option and click on the next.

image

Step 10: Choose Entity to expose a part of the Service

In this step we need to choose entity to expose as part of service. Since there is only one entity choose that and click on the Next

image

In last step you can preview various References being added and changes to config files. Click on Finish to create Service. After WCF Data Service being created you can test that in browser. Run the application and browse to *.svc to test the service in browser.

You should get data of People in browser as below after successful creation of the service.

image

Conclusion

In this post we learnt to create WCF Data Service using Open Access ORM. I hope you find this post useful. Thanks for reading.