Constructing a PySpark DataFrame Dynamically

Spark provides a lot of connectors to load data from various formats. Whether it is a CSV or JSON or Parquet you can use the magic of “spark.read”.

However, there are times when you would like to create a DataFrame dynamically using code. The one use case that I was presented with was to create a dataframe out of a very twisted incoming JSON from an API. So, I decided to parse the JSON manually and create a dataframe.

The approach we are going to use is to create a list of structured Row types and we are using PySpark for the task. The steps are as follows:

  1. Define the custom row class
personRow = Row("name","age")

2. Create an empty list to populate later

community = []

3. Create row objects with the specific data in them. In my case, this data is coming from the response that we get from calling the API.

qr = personRow(name, age)

4. Append the row objects to the list. In our program, we are using a loop to append multiple Row objects to the list.

community.append(qr)

5. Define the schema using StructType

person_schema = StructType([ \
                              StructField("name", StringType(), True), \
                              StructField("age", IntegerType(), True), \

                    ])

6. Create the dataframe using createDataFrame. The two parameters required is the data and schema to applied.

communityDF = spark.createDataFrame(community, person_schema)

Using the steps above, we are able to create a dataframe for use in Spark applications dynamically.

Azure AD Login for your SPA using React Hooks

In this tutorial we are implementing Microsoft Azure Ad Login in React single-page app and retrieve user information using @azure/msal-browser.

The MSAL library for JavaScript enables client-side JavaScript applications to authenticate users using Azure AD work and school accounts (AAD), Microsoft personal accounts (MSA) and social identity providers like Facebook, Google, LinkedIn, Microsoft accounts, etc. through Azure AD B2C service. It also enables your app to get tokens to access Microsoft Cloud services such as Microsoft Graph.

Azure application registration

You can register you web app using  Microsoft tutorial.

In the below App.js file code we are using Context and Reducer for managing the login state. useReducer allows functional components in React access to reducer functions from your state management

First we initialize the initial values for reducer,
Reducers are functions that take the current state and an action as arguments, and return a new state result. In other words, (state, action) => newState

You can see the code form 17 line where we created reducer with using switch case “LOGIN” and “LOGOUT”. and from Line 58 we use the context that defined in line 6 and passing the reducer values to it.

And In a line 66 : we are checking the state.isAuthenticated or not , if it’s not the page redirect to the home page and if it’s Authenticated it will be redirected to the admin dashboard home page.

App.Js

import React, { useReducer } from "react";
import { createBrowserHistory } from "history";
import { Router, Route, Switch, Redirect } from "react-router-dom";
import AdminLayout from "layouts/Admin.js";
import HomeLayout from "layouts/Home.js";
export const AuthContext = React.createContext();
const hist = createBrowserHistory();


const initialState = {
    isAuthenticated: false,
    user: null,
    token: null,
    
};

export const reducer = (state, action) => {
    switch (action.type) {
        case "LOGIN":
            localStorage.setItem("user", action.payload.user);
            localStorage.setItem("token", action.payload.token);
            return {
                ...state,
                isAuthenticated: true,
                user: action.payload.user,
                token: action.payload.token
            };
        case "LOGOUT":
            localStorage.clear();
            return {
                ...state,
                isAuthenticated: false,
                user: null
            };
        default:
            return state;
    }
};

function App() {
    const [state, dispatch] = useReducer(reducer, initialState);

    React.useEffect(() => {
        const user = localStorage.getItem('user') || null;
        const token = localStorage.getItem('token') || null;

        if (user && token) {
            dispatch({
                type: 'LOGIN',
                payload: {
                    user,
                    token
                }
            })
        }
    }, [])
    return (
        <AuthContext.Provider
            value={{
                state,
                dispatch
            }}
        >
            <Router history={hist}>
                <Switch>
                    <div className="App">{!state.isAuthenticated ?
                        <>
                            <Route path="/home" render={(props) => <HomeLayout {...props} />} />
                            <Redirect to="/home" /> </> :
                        <><Route path="/admin" render={(props) => <AdminLayout {...props} />} />
                            <Redirect to="/admin/dashboard" /></>}</div>
                </Switch>
            </Router>
        </AuthContext.Provider>
    );
}

export default App;

Now we are coming the the Login Component with Azure Ad using with @azure/msal-browser

In a below code we are importing AuthContext from App.js and PublicClientApplication from @azure/msal-browser.

and In a 7th line we created constant variable “dispatch” with providing the reference of AuthContext then we created the msalInstandce variable using the PublicClientApplication and providing the clientId, authority and redirectUri in auth block.

In a line 22 we created the startlogin function then we created the loginRequest variable and providing the scope for login.

and In a line 31 we use the msalInstance.loginPopup(loginRequest) for getting pop up window for login and in a “loginRespone” variable save the login response. then we are fetching the username and token form loginRespone. dispatch to the AuthContext (see line 41).

And In a line 56 we created a button it’s call the startLogin function on click event.

Login.cs

import React from "react";

import { AuthContext } from "App";
import {  PublicClientApplication } from '@azure/msal-browser';

export const Login = () => {
  const { dispatch } = React.useContext(AuthContext);

  const msalInstance = new PublicClientApplication({

    auth: {
      clientId: "65395645-90dd-4436-84a4-47d7c0e961fc",
      authority: "https://login.microsoftonline.com/common",
      redirectUri: "https://localhost:3000",
    },
    cache: {
      cacheLocation: "sessionStorage",
      storeAuthStateInCookie: true,
           }
  });

  const startLogin = async (e) => {
    e.preventDefault();
   
    var loginRequest = {
      scopes: ["65395645-90dd-4436-84a4-47d7c0e961fc/.default"] // optional Array<string>
    };
   
    try {
      
      const loginResponse = msalInstance.loginPopup(loginRequest);
      
      const username = (await loginResponse).account.username;
      const token = (await loginResponse).accessToken;

      const resJson = {
        token: token,
        user: username
      };

      dispatch({
        type: "LOGIN",
        payload: resJson
      })
      console.log("Login successfully");
    } catch (err) {

      console.log("Authentication Failed.....");
      console.log(err);
    }
  }
 
  return (
    <div>

      <button className="btn btn-sm btn-outline-success my-2 my-sm-0" data-toggle="button" onClick={(e) => { startLogin(e) }}>Login</button>

    </div>

  );
};

export default Login;

Above you see we used the reducer and context in App.js file and crated a Login Component.

Conclusion

Now we can import the Login component anywhere in a project where we want to login button that uses the Azure AD for Authentication.

Copy Data in Azure Databricks Table from one region to another

One of our customers had a requirement of copying data that was locked in an Azure Databricks Table in a specific region (let’s say this is eastus region). The tables were NOT configured as Delta tables in the originating region and a subset of personnel had access to both the regions.

However, the analysts were using another region (let’s say this is westus region) as it was properly configured with appropriate permissions. The requirement was to copy the Azure Databricks Table from eastus region to westus region. After a little exploration, we couldn’t find a direct/ simple solution to copy data from one Databricks region to another.

One of the first thoughts that we had was to use Azure Data Factory with the Databricks Delta connector. This would be the simplest as we would simply need a Copy Data activity in the pipeline with two linked services. The source would be a Delta Lake linked service to eastus tables and the sink would be another Delta Lake linked service to westus table. This solution faced two practical issues:

  1. The source table was not a Delta table. This prevented the use of Delta Lake linked service as source.
  2. When sink for copy activity is a not a blob or ADLS, it requires us to use a staging storage blob. While we were able to link a staging storage blob, the connection could not be established due to authentication errors during execution. The pipeline error looked like the following:
Operation on target moveBlobToADB failed: ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container adf-staging in account xxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration. Caused by: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container adf-staging in account xxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration. Caused by: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Public access is not permitted on this storage account..

On digging deeper, this requirement is documented as a prerequisite. We wanted to use the Access Key method but it requires the keys to be added into Azure Databricks cluster configuration. We didn’t have access to modify the ADB cluster configuration.

To work with this, we found the following alternatives:

  1. Use Databricks notebook to read data from non-Delta Tables in Databricks. The data can be stored in a staging Blob Storage.
  2. To upload the data into the destination table, we will again need to use a Databricks notebook as we are not able to modify the cluster configuration.

Here is the solution we came up with:

In this architecture, the ADB 1 notebook is reading data from Databricks Table A.1 and storing it in the staging blob storage in the parquet format. Only specific users are allowed to access eastus data tables, so the notebook has to be run in their account. The linked service configuration of the Azure Databricks notebook requires us to manually specify: workspace URL, cluster ID and personal access token. All the data transfer is in the same region so no bandwidth charges accrue.

Next the Databricks ADB 2 notebook is accesses the parquet file in the blob storage and loads the data in the Databricks Delta Table A.2.

The above sequence is managed by the Azure Data Factory and we are using Run ID as filenames (declared as parameters) on the storage account. This pipeline is configured to be run daily.

The daily run of the pipeline would lead to a lot of data in the Azure Storage blob as we don’t have any step that cleans up the staging files. We have used the Azure Storage Blob lifecycle management to delete all files not modified for 15 days to be deleted automatically.

Azure AD Authentication in ASP.NET Core Web API

The next step to building an API is to protect it from anonymous users. Azure Active Directory (AD) serves as an identity platform that can be used to secure our APIs from anonymous users. After the authentication is enabled, users will need to provide a OAuth 2.0/ JST token to gain access to our API.

Let us begin to implement Azure AD Authentication in ASP.NET Core 5.0 Web API.

I will be creating ASP.NET Core 5.0 project and show you step by step how to enable authentication on it using Azure AD Authentication. We will be doing it using the MSAL package from nuget.

Prerequisites

Before you start to follow steps given in this article, you will need an Azure Account, and Visual Studio 2019 with .NET 5.0 development environment step.

Creating ASP.NET Core 5.0 web application

Open visual studio and click on Create a new project in the right and select “Asp.net core web app” as shown in below image and click next.

In the configure your new project section enter name and location of your project as shown in below image and click next

In the additional information step, select .NET 5.0 in the target framework, Authentication Type to none and check Configure HTTPS checkbox and click on create.

Configuring ASP.NET Core 5.0 App for Azure AD Authentication

Open appsettings.json of your web api and add following lines of code.

"AzureAd": {
    "Instance": "https://login.microsoftonline.com/",
    "Domain": "gtmcatalyst.com",  "qualified.domain.name",
    "ClientId": "your-client-id",
    "TenantId": "your-tenant-id"
  }

Replace your-client-id and your-tenant-id with the actual values that you copied while doing app registration in azure ad

Next, add package manager console and add following two package references to your web application.

using Microsoft.AspNetCore.Authentication.JwtBearer;
using Microsoft.Identity.Web;

Next, open startup.cs in your project and paste following code in the ConfigureServices method

   services.AddAuthentication(JwtBearerDefaults.AuthenticationScheme)
               .AddMicrosoftIdentityWebApi(Configuration.GetSection("AzureAd"));

Next, in the Startup.cs, go to Configure method and add app.UseAuthentication(); line before app.UseAuthorization(); line.

Next, open any Controller and add [Authorize] attribute:

    [Authorize]
    [Route("[controller]")]
    [ApiController]
    public class SupportController : ControllerBase
    {

    }

Save all files and run your project.

You will notice that once you run the project, and try to access any method in support controller from the browser you will get return the HTTP ERROR 401 ( Unauthorized client error).

Conclusion

Our API is no longer available for anonymous access. It is now protected by Azure AD Authentication.