Navigating Data Privacy Risks in Generative AI

Generative AI (GenAI) tools — from ChatGPT and Gemini to Claude — are no longer just innovation experiments. They’re embedded in workflows, customer journeys, and enterprise applications.

But with that opportunity comes a sharp reality: GenAI introduces new data privacy risks that most corporate systems were never designed to handle.

This article breaks down:

  1. The top privacy challenges organisations face when adopting GenAI.
  2. What enterprise-grade solutions offer to mitigate these risks.
  3. Practical steps for enterprises and project managers to operationalise GenAI safely.

Why Privacy Matters More Than Ever

GenAI thrives on data — but that’s exactly where its risk lies. Every prompt, document, or snippet entered could become a data-governance concern.

The Core Privacy Risks

RiskWhat It MeansWhy It Matters
Data Leakage & Unintended ExposureEmployees may paste sensitive (PII) or proprietary data into public GenAI tools, losing control once it leaves the org boundary.Studies (Deloitte) show 75% of tech professionals rank privacy as a top GenAI concern.
Unintended Model TrainingNon-enterprise versions may use user inputs for model improvement.Proprietary IP could be absorbed into shared models.
Ownership & Retention AmbiguityConsumer tools often lack clarity on who owns prompts and outputs or how long data is stored.Creates legal uncertainty around IP and auditability.
Regulatory Compliance GapsGenAI usage must align with GDPR, CCPA, and emerging AI Acts.Non-compliance risks penalties and reputation loss.
Shadow IT & Unapproved UsageEmployees may use personal GenAI accounts.Creates blind spots for data exposure and audit gaps.

Bottom Line: GenAI amplifies traditional data risks — because the data flows are richer, models more opaque, and control boundaries blur faster.


The Solution: What It Looks Like

The solution lies in ensuring that GenAI tools respect the data privacy of the documents uploaded. The way GenAI tools have struck a balance is to offer “privacy” as a feature in their enterprise editions.

When enterprises upgrade to enterprise editions of GenAI tools, they gain visibility, ownership, and control. Let’s examine OpenAI’s enterprise commitments as an example blueprint.

1. Ownership & Control of Data

“You own and control your data. We do not train our models on your business data by default.” — OpenAI Enterprise Privacy

✅ Inputs and outputs remain your property.
✅ You control data retention.
✅ No auto-training on business data.

Why it matters:
Enterprises retain IP rights, reduce legal ambiguity, and align with data-minimisation and right-to-erasure principles.


2. Fine-Grained Access & Authentication

  • SAML-based SSO integration
  • Admin dashboards to control who can access what
  • Connector governance to approve or restrict data sources

Why it matters:
Access governance limits who can interact with sensitive data — and how. Permissions reduce accidental or malicious data exfiltration.


3. Security, Compliance & Certifications

  • AES-256 encryption at rest; TLS 1.2+ in transit
  • SOC 2 Type II certification
  • BAA & DPA support for regulated sectors (e.g., healthcare, finance)

Why it matters:
Aligns GenAI use with enterprise IT standards and compliance requirements.


4. Data Retention & Deletion Controls

“Admins control retention. Deleted conversations are removed within 30 days.” — OpenAI

  • Zero-Data-Retention (ZDR) for API endpoints
  • Custom deletion timelines

Why it matters:
Enables compliance with right-to-erasure laws and reduces long-term data exposure risk.


5. Model Training & Fine-Tuning Controls

  • No business data used for training without explicit opt-in.
  • Fine-tuned models remain exclusive to the enterprise.

Why it matters:
Prevents proprietary data from bleeding into shared models. Protects confidential business logic and datasets.


Translating Commitments into Enterprise Practice

Here’s how to operationalise privacy-by-design in your GenAI strategy.

StepActionWhy It Matters
1Create a “What Goes into GenAI” PolicyBan sensitive data (PII, source code, contracts) unless approved.
2Use Enterprise LicensesEnsure tools provide encryption, retention control, and no auto-training.
3Govern ConnectorsLimit which internal systems feed into GenAI tools.
4Define Retention RulesConfigure retention periods and deletion workflows.
5Monitor UsageUse compliance APIs to track prompts, access logs, and connectors.
6Train EmployeesReinforce responsible usage and red-flag categories (finance, HR, IP).
7Align with Legal & Governance PoliciesMap GenAI practices to your data-governance framework and DPIAs.
8Use ZDR Endpoints for Sensitive DataRequired for regulated or confidential workloads.
9Review RegularlyRe-audit tools and contracts as regulations evolve.

Applying It to Your Context: ERP & Project Management

For ERP implementations — where financial, HR, and vendor data converge — the stakes are higher.

  • Data boundaries: Never use GenAI with live financial or payroll data unless the endpoint is enterprise-secured.
  • Vendor contracts: Ensure client or third-party NDAs don’t prohibit AI-based processing. Include a section that mentions that contracts may be reviewed by GenAI tools.
  • Governance embedding: Add GenAI checkpoints in your project governance map — who approves prompts, what is logged, how outputs are validated.
  • Audit readiness: Maintain GenAI usage logs — prompt, purpose, output, approver.
  • Model isolation: When fine-tuning internal GenAI workflows, use isolated models to prevent cross-project exposure.

The Takeaway

Generative AI unlocks speed and scale, but data privacy is the cost of entry for responsible adoption.

Enterprise GenAI tools — like OpenAI’s ChatGPT Enterprise — now offer the controls and transparency needed for compliant, secure innovation. Yet, technology alone isn’t enough.

Enterprises must also embed:

  • Governance (policies & audits)
  • Awareness (training & culture)
  • Alignment (legal & regulatory frameworks)

For project managers and IT leaders, the mission is clear:
👉 Innovate boldly, govern responsibly, and ensure that data privacy remains the cornerstone of your GenAI strategy.

Secure Databricks Notebook Run in Azure Data Factory with Service Principal

We recently had a case where a customer was running Azure Databricks notebook from Azure Data Factory. They mentioned that the configuration used to run the notebook was not secure as it was associated with a user account (probably a service account). The username and password was known to multitude of users and it had already caused some trouble. The concept of a “service account” has been quite prevalent since the times of on-premise applications. The service account was a user account that was used to run various applications and nothing else. In the modern world of cloud services, service accounts are obsolete and should never be used.

He wondered if there was a better way to run such notebooks in a more secure way?

In a single line, the solution to this problem is running the notebook as a service principal (and not a service account).

Here is a detailed step by step guide on how to make this work for Azure Databricks notebooks running from Azure Data Factory:

  1. Generate a new Service Principal (SP) in Azure Entra ID and create a client secret
  2. Add this Service Principal to the Databricks workspace and grant “Cluster Creation” rights
  3. Grant this Service Principal User Access to the Databricks workspace as a user
  4. Generate a Databricks PAT for this Service Principal
  5. Create a new Linked Service in ADF using Access Token credentials of the Service Principal
  6. Create a new workflow and execute it

If you followed the above instructions, your Databricks notebook will be run as a Service Principal.

One thing to understand before going over the detailed steps is the fact that when you add a Databricks notebook as an activity in Azure Data Factory, you are creating an “ephemeral job” in Databricks in the background when the ADF pipeline is run. You can control what kind of job cluster gets created and the user credentials under which the job will be run based on the linked service settings. Let’s get started:

  1. Generate a new SP with client secret
  1. Go to portal.azure.com and search for “Microsoft Entra ID” in the search tab and click it
  2. Click on Add and select “App registration”
  3. Give a meaningful name and leave the rest as-is and click “register”
  4. You should now see something like below
  5. Now click on “Certificates & secrets” and under the “Client secrets” tab click “New client secret”
  6. Under the “Add a client secret” pop-up, give a meaningful description and choose the appropriate expiration window and click “create”. Once done you should see something like this under the “Client secrets” tab
  7. Now open a notepad and have all the information noted there.
    Host: Databricks workspace URL
    Azure_tenant_id and Azure_client_id will be what is shown in the image under point 4
    Azure_client_secret will be the value in the image under point 6

2. Add this Service Principal to the Databricks workspace and grant “Cluster Creation” rights

  1. Now go to your target workspace and click on your Name to expand the menu, click open the settings( this requires workspace admin privilege)
  2. Now click on” identity and access” and under service principals click on manage

Click on “add service principal” button
Now click on Add new
Select the “Microsoft Entra ID managed” radio button and populate with the Application ID(client id) and give the same name as shown in point 4 and click Add

Now provide “Allow cluster creation” permission to the service principal to ensure it can create a new job cluster when we run through an ADF and click “update”

Generate a Databricks PAT for this Service Principal

  1. If you are on Mac open the terminal and use the command “nano .databrickscfg” to open the databricks config file and paste the information from the notepad from point 8.
    If you are on Windows press windows + R and type “notepad
    %APPDATA%\databricks\databricks.cfg” and press enter. Now paste the information from the notepad from point 8 into the Databricks config file. You will also have to give a meaningful profile name to identify the workspace in the config file.

  2. The config file should look like below once you are done. [hdfc-dbx] is the profile name here.Save and exit.
  3. Now go to Databricks CLI and authenticate the service principal’s profile using a PAT token by typing “databricks configure”. It will prompt for the host workspace’s URL and PAT token, provide it, and press enter.
  4. Now goto databricks cli and generate a PAT token for the Service Principal using the command “databricks tokens create –lifetime-seconds 86400 -p <profile name>”. You should see a proper response like below:
  5. Now grab the “token value” and keep it in a notepad safely

Create a new Linked Service in ADF using Access Token credentials of the Service Principal

  1. Now go to ADF, click on Manage -> Linked services -> New
    Under New Linked Service popup click on “compute” and select “Azure Databricks” click continue
  2. Now provide a meaningful name to the linked service, select the proper Azure subscription, and your target workspace. Under Authentication Type leave the option as Access Token and under the access token field update the access token we generated in point 13.(This is being done as an interim solution, the best practice is to use an Azure Key Vault and pass the token through secret).
  3. Now select the cluster version as 13.3 LTS, Node type, Python version, and Worker options as relevant to current cluster configuration. You can select the UC Catalog Access Mode as “Assigned” for single user and “Shared” for shared access mode. Click on test connection to ensure everything is working as intended.

Create a new workflow and execute it

  1. Now create a pipeline select the linked service that you created and point the ADF to a notebook.
  2. Click “Debug” and the job should run successfully

If you open the job execution link, you can see the job has been run as a Service Principal under task details

This is indicated by the value against “Run as” which is that of the Service Principal.

Indian GDPR (DPDP) affects every marketer in India

The Indian government has recently passed the Digital Personal Data Protection Bill (DPDP) in 2023. This is a significant step towards establishing a framework for managing citizens’ data in India.

Previously, data protection was governed by the Information Technology Act of 2000 (Section 21). With this law, customers have been granted specific rights over their data including correction, erasure, grievance redressal, and withdrawing consent.

For every marketer in India, it’s essential to understand and follow the provisions of the DPDP law to avoid severe penalties. Whether you do digital marketing, events marketing or are an organisation collecting data, this bill affects you. Even companies collecting data from their employees are within ambit of this law.

Penalties for Data Fiduciaries/ data collectors breaching customer data security can reach up to Rs 250 crore or USD 30 mn. Penalties are influenced by severity, repetition, and actions taken by the fiduciary.

In this blog post, we’ll highlight the key points of this law that are relevant for marketers:

To grasp the DPDP, it’s important to know the main entities involved:

  1. Data Fiduciaries: These are the parties primarily responsible for handling data. As a website owner, you are a Data Fiduciary.
  2. Data Principals: These are your customers or individuals whose data you are handling.
  3. Data Processors: These are entities that process data on behalf of Data Fiduciaries. Processing includes various operations like collecting, storing, transmitting, and more.

Whether your website is hosted in India or abroad, if it deals with data of Indian citizens, the DPDP law applies (Section 3(b)). As website owners, you can appoint a different Data Processor (Section 8(2)), but you are responsible for data handling and ensuring the processor complies by implementing appropriate measures. This means that you can use external service providers e.g. for emails, SMSs, Whatsapp, Social Media but are responsible for their adherence to these laws.

Obligations of Data Fiduciaries are as follows:

  • Processing for Lawful Purposes (Section 4): Data can only be used for lawful purposes for which the data principal has given consent. You need to notify individuals about the purpose of data collection at the time of data collection. If you acquired customers before this law was enacted, you must provide this notification as soon as possible. The burden of proof for consent lies with the Data Fiduciary.
  • Consent and Withdrawal (Section 6): Individuals can withdraw consent at any time, and you must stop processing their data within a reasonable time (not defined). This includes deleting data from both the processor and fiduciary.
  • Data Protection Officer (Section 8(9)): You must appoint this officer to address customer data queries.
  • Exceptions for Processing without Consent (Section 7): Certain exceptions exist where prior consent isn’t needed, such as government processing, medical emergencies, financial assessments, compliance with legal judgments, and natural disasters.
  • Breach Notification (Section 8(6)): If there’s a data breach, you must notify affected parties.
  • Data of Children (Section 9): Consent from parents is required for individuals under 18. Advertising targeting children is prohibited.

DPDP isn’t as strict as GDPR in terms of processing data within national boundaries, but this can be restricted by government notification. It clarifies that if other laws limit data residency, DPDP doesn’t relax those restrictions. The Government can override data protection laws for maintaining state security and sovereignty (Section 17(2)).

Foreign citizens’ data can be processed in India through valid contracts (Section 17).

DPDP provides an additional advantage to registering yourself as a startup. A lot of exceptions around data compliance apply to startups (Section 17(3)).

As a final clarification, if there are any other laws that are in conflict with this law, then the provisions of this law will prevail to the extent of this conflict.

Disclaimer: This is an overview, consult your legal representative for specific advice.

Securing Azure Databricks ML Model Serving Endpoint with Service Principals: Step by Step

A recent customer of mine had deployed a Machine Learning Model developed using Databricks. They had correctly used MLFlow to track the experiments and deployed the final model using Serverless ML Model Serving. This was deemed extremely useful as the serverless capability allowed the compute resources to scale to zero when not in use shaving off precious costs.

Once deployed, the inferencing/ scoring can be done via an HTTP POST request that requires a Bearer authentication and payload in a specific JSON format.

curl -X POST -u token:$DATABRICKS_PAT_TOKEN $MODEL_VERSION_URI \
  -H 'Content-Type: application/json' \
  -d '{"dataframe_records": [
    {
      "sepal_length": 5.1,
      "sepal_width": 3.5,
      "petal_length": 1.4,
      "petal_width": 0.2
    }
  ]}'

This Bearer token can be generated using the logged in user dropdown -> User Settings -> Generate New Token as shown below:

Databricks Personal Access Token

In Databricks parlance, this is the Personal Access Token (PAT) and is used to authenticate any API calls that are made to Databricks. Also, note that lifetime of the token (defaults to 90 days) and is easily configured.

The customer had the Data Scientist generate the PAT and use it for calling the endpoint.

However, this is an anti-pattern for security. Since the user is a Data Scientist on the Databricks workspace, the PAT would get access to the workspace and allow a nefarious actor to achieve higher privileges. If a user’s account is decommissioned, the user loses access to all Databricks objects. This user based PAT was flagged by the security team and they were requested to replace it with a service account as is a standard practice in any in-premise system.

Since this is an Azure deployment, the customer’s first action was to create a new user account in the Azure Active Directory, add the user to the Databricks workspace and generate the PAT for this “service account”. However, this is still a security hazard given the user has interactive access to Databricks workspace.

I suggested the use of Azure Service Principal that would secure the API access with the lowest privileges. The service principal acts as a client role and uses the OAuth 2.0 client credentials flow to authorize access to Azure Databricks resources. Also, the service principal is not tied to a specific user in the workspace.

We will be using two separate portals and Azure CLI to complete these steps:

  1. Azure Portal
  2. Databricks Workspace

In this post, I am detailing the steps required to do the same (official documentation here):

  1. Generate an Azure Service Principal: Now this step is completely done in the Azure Portal (not Databricks workspace). Go to Azure Active Directory -> App Registrations -> Enter some Display Name -> Select “Accounts in this organisation only” -> Register. After the service principal is generated, you will need its Application (client) ID and Tenant ID as shown below:
  2. Generate Client Secret and Grant access to Databricks API: We will need to create Client Secret and Register Databricks access.
    To generate client secret, remain on Service Principal’s blade -> Certificates & Secrets -> New Client secret -> Enter Description and Expiry -> Add. Copy and store the Client Secret’s Value (not ID) separately. Note: The client secret Value is only displayed once.

    Further, In the API permissions -> Add a permission -> APIs my organization users tab -> AzureDatabricks -> Permissions.
  3. Add SP to the workspace: Now, Databricks has the ability to sync all identities created to the workspace if “Identity Federation” is enabled. Unfortunately, for this customer this was not enabled. Next, they needed to assign permission for this SP.
    This step in done in your Databricks workspace. Go to Your Username -> Admin Settings -> Service Principals tab -> Add service principal -> Enter Application ID from the Step 1 and a name -> No other permissions required -> Add.
  4. Get Service Principal Contributor Access: Once the service principal has been generated, you will need to get Contributor access to the subscription. You will need help of your Azure Admin for the same.
  5. Authenticate as service principal: First login into Azure CLI as the Service Principal with the command:
az login \
--service-principal \
--tenant <Tenant-ID>
--username <Client-ID> \
--password <Client-secret> \
--output table

Use the values from Step 1 (tenant ID & Client ID) and Step 2 (Client Secret) here:

  1. Generate Azure AD tokens for service principals: To generate Azure AD token you will need to use the following command:
az account get-access-token \
--resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d \
--query "accessToken" \
--output tsv

The resource ID should be the same as shown above as it represents Azure Databricks.

This will create a Bearer token that you should preserve for the final step.

  1. Assign permissions to SP to access Model Serving: The Model Serving endpoint has its own access control that provides for following permissions: Can ViewCan Query, and Can Manage. Since we require to query the endpoint, we need to grant “Can Query” permission to the SP. This is done using the Databricks API endpoints as follows:
$ curl -n -v -X GET -H "Authorization: Bearer $ACCESS_TOKEN" https://$WORKSPACE_URL../api/2.0/serving-endpoints/$ENDPOINT_NAME 

// $ENDPOINT_ID = the uuid field from the returned object 

Replace @ACCESS_TOKEN with the bearer token generated in Step 6. The @WORKSPACE_URL should be replaced by the Databricks Workspace URL. Finally, the $ENDPOINT_NAME should be replaced by the name of the endpoint created.

This will give you the ID of the Endpoint as shown:

Next, you will need to issue the following command:

$ curl -n -v -X PATCH -H "Authorization: Bearer $ACCESS_TOKEN" -d '{"access_control_list": [{"user_name": "$SP_NAME", "permission_level": "CAN_QUERY"}]}' https://$WORKSPACE_URL../api/2.0/permissions/serving-endpoints/$ENDPOINT_ID

$ACCESS_TOKEN: From Step 6
$SP_NAME: Name of the service principal as created in Azure Portal
$WORKSPACE_URL: Workspace URL
@ENDPOINT_ID: Generated in the command above

  1. Generate Databricks PAT from Azure AAD Token: While you have the Azure AAD token now available and it can used for calling the Model Serving endpoint, the Azure AAD tokens are shortlived.
    To create a Databricks PAT, you will need to make a POST request call to the following endpoint (I am using Postman for the request): https://$WORKSPAE_URL/api/2.0/token/create
    The Azure AAD Bearer token should be passed in the Authorization header. In the body of the request, just pass in a JSON with “comment” attribute.
    The response of the POST request will contain an attribute – “token_value” that can be used as Databricks PAT.

The token_value, starting with “dapi” is the PAT that can be used for calling the Model Serving Endpoint that is in a secure configuration using the Service Principal.

Introduction to Machine Learning with Spark ML – III

In the last post, we saw how we can use Pipelines to streamline our machine learning workflow.

We will start off with a fantastic feature that will blow the lid off your mind. All the effort done till now, in post I and II, can be done in 2 lines of code! Ah yes, this magic is possible with use of a feature called AutoML. Not only will it perform preprocessing steps automatically, it will also select the best algorithm from a multitude of them including XGBoost, LightGBM, Prophet etc. The hyper parameter search comes for free 🙂 If all this was not enough it will share the entire auto-generated code for your use and modification. So, the two magic lines are:

from databricks import automl
summary = automl.classify(train, target_col="category", timeout_minutes=20)

The AutoML requires you to specify the following:

  1. Type of machine learning problem – Classification/ Regression/ Forecasting
  2. Specify the training set, target column
  3. Timeout after which the autoML will stop looking for better models.

While all this is very exciting, the real world use case of AutoML tends to be to create a baseline for model performance and give us a model to start with. In our case, the AutoML gave an RoC score of 91.8% (actually better than our manual work so far!!) for the best model.

Looking at the auto-generated Python notebook, here are the broad steps it took:

  1. Load Data
  2. Preprocessing
    • Impute values for missing numerical columns
    • Convert each categorical column into multiple binary columns through one-hot encoding
  3. Train – Validation – Test Split
  4. Train classification model
  5. Inference
  6. Determine Accuracy – Confusion matrix, ROC and Precision-Recall curves for validation data

This is broadly in line with what we did in our manual setup.

Apart from the coded approach we saw, you can create AutoML experiment using the UI from the Experiments -> Create AutoML Experiment button. If you can’t find the experiments tab, make sure that you have the Machine Learning persona selected in the Databricks workspace.

In an enterprise/ real world scenario, we will build many alternate models with different parameters and use the one with higest accuracy. Next, we deploy the model and use it for predictions with new data. Finally, the selected model will be updated over time as new data becomes available. Until now, we didn’t talk about how to handle these enterprise requirements or were doing it manually on best effort.

These requirements are covered under what we know as MLOps. Spark supports MLOps using an open source framework called MLFlow. MLFlow supports the following objects:

  1. Projects: Provides a mechanism for storing machine learning code in a reusable and reproducible format.
  2. Tracking: Allows for tracking your experiments and metrics. You can see the history of your model and its accuracy evolve over time. There is also a tracking UI available.
  3. Model Registry: Allows for storing various models that you develop in a registry with an UI to explore the same. It also provides for model lineage (which MLflow experiment and run produced the model), stage transitions (for example from staging to production)
  4. Model Serving: We can use serving to provide inference endpoints for either batch or inline processing. Mostly this will be made available as REST endopoints.

There is a very easy way to get started with MLFlow where we allow MLFlow to log automatically the metrics and models. This can be done using a single line:

import mlflow
mlflow.autolog()

This will log the parameters, metrics, models and the environment. The core concept to get with MLOps is the concept of runs. Each run is a unique combination of parameters and algorithm that you have executed. Many runs can be part of the same experiment.

To get started we can set name of the experiment with the command: mlflow_set_experiment(“name_of_the_experiment”)

To start tracking the experiments manually, we can setup the context as follows:

with mlflow.start_run(run_name="") as run:
   <pseudo code for running an experiment>
   mlflow.log_param("key","value")
   mlflow.log_metrics("key","value")
   mlflow.spark.log_model(model,"model name")

You can track parameters and metrics using log_param and log_metrics functions of mlflow object. The model can now be registered with the Model Registry using the function: mlflow.register_model(model_uri=””, name=””)

What is important here is the model_uri. The model_uri takes the form: runs:/<runid>/model. The runid identifies the specific run in the experiment and each model is stored at the model_uri location mentioned above.

You can now load the model and perform inference using the following code:

import mlflow

# Load model
loaded_model = mlflow.spark.load_model(model_uri)

# Perform inference via model.transform()
loaded_model.transform(data)

While we have seen how to track experiments explicitly, Databricks Workspaces also track the experiments automatically (from Databricks Runtime 10.3 ML and above). You can view the expriments and their runs in the UI via the Experiments sidebar of the Machine Learning persona.

You will need to click on the specific experiment (in our case Adult Dataset), that will show all the runs of the experiment. Click on the specific run to get more details about the run. The run will show the metrics recorded which is in our case was areaUnderRoC of 91.4%.

Under Artifacts, if you click on the model, you can see the URI of the model run. This URI can be used to register the model with the Model Registry and use it for predictions at any point in time.

MLFlow also supports indicating the state of the model for production. Different states supported by MLFlow are:

  1. None
  2. Staging
  3. Production
  4. Archived

Once your model is registered with the Model registry, you can change the state of the model to any other state with the function transition_model_version_stage() function.

From the model registry you are able to create model serving endpoint using the Serverless Real-Time Inference service that uses managed Databricks compute service to provide a REST endpoint.

Introduction to Machine Learning with Spark ML – II

In the earlier post, we went over some concepts regarding Machine Learning done with Spark ML. Here are primarily 2 types of objects relating to machine learning we saw:

  1. Transformers: Objects that took a DataFrame, changed something in it and returned a DataFrame. The method used here was “transform”.
  2. Estimator: Objects that are passed in a DataFrame and would apply an algorithm on it to return a transformer. E.g. GBTClassifier. We used the “fit” function to apply the algorithm on the Dataframe.

In our last example of predicting income level using Adult dataset, we had to change our input dataset to a format that is suitable for machine learning. There was a sequence of changes we had done e.g. converting categorical variables to numeric, One Hot Encoding & Assembling the columns in a single column. Everytime there is additional data available (which will be numerous times), we will need to do these steps again and again.

In this post, we will introduce a new Object that organises these steps in sequence that can be run as many times as needed and it is called the Pipeline. The Pipeline chains together various transformers and estimators in sequence. While we could do the machine learning without the Pipeline, it is a standard practice to put the sequence of steps in a Pipeline. Before we get there, let’s try to add an additional step in fixing our pipeline and that is to identify and remove Null data. This is indicated in our dataset as ‘?’.

To know how many null values exist let’s run this command:

from pyspark.sql.functions import isnull, when, count, col
adultDF.select([count(when(isnull(c), c)).alias(c) for c in adultDF.columns]).show()

The result shows that there are no null values. Inspecting the data, we see that null values have been replaced with “?”. We would need to remove these rows from our dataset. We can replace the ? with null values as follows:

adultDF = adultDF.replace('?', None)

Surprisingly this doesn’t change the ? values. It appeared that the ? is padded with some spaces. So we will use the when and trim function as follows:

from pyspark.sql.functions import isnull, when, count, col,trim
adultDF = adultDF.select([when(trim(col(c))=='?',None).otherwise(col(c)).alias(c) for c in adultDF.columns])

This replaces ? will null that we can now drop from our dataframe using dropna() function. The number of rows remaining are now 30,162.

Now let’s organise these steps in a Pipeline as follows:

from pyspark.ml import Pipeline

adultPipeline = Pipeline(stages = [wcindexer,eduindexer,maritalindexer,occupationindexer,relindexer,raceindexer,sexindexer,nativecountryindexer,categoryindexer,ohencoder,colvectors])

The stages list contains all the transformers we used to convert raw data into dataset ready for machine learning. This includes all the StringIndexers, OneHotEncoder and VectorAssembler. Next, the process of defining the GBTClassifier and BinaryClassificationEvaluator remains the same as in the earlier post. You can now include the GBTClassfier in the pipeline as well and run the fit() on this pipeline with train dataset as follows:

adultMLTrainingPipeline = Pipeline(stages = [adultPipeline,gbtclassifier])
gbmodel  = adultMLTrainingPipeline.fit(train)

However, we can perform another optimization at this point. The model currently trained is based of a random split of values from the dataset. Cross Validation can help generalise the model even better by determining best parameters from a list of parameters and do it by creating more than one train and test datasets (called as folds). The list of parameters are supplied as ParamGrid as follows:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder()\
  .addGrid(gbtclassifier.maxDepth, [2, 5])\
  .addGrid(gbtclassifier.maxIter, [10, 100])\
  .build()

# Declare the CrossValidator, which performs the model tuning.
cv = CrossValidator(estimator=gbtclassifier, evaluator=eval, estimatorParamMaps=paramGrid)

The cross validator object takes the estimator, evaluator and the paramGrid objects. The pipeline will need to be modified to use this cross validator instead of the classifier object we used earlier as follows:

adultMLTrainingPipeline = Pipeline(stages = [adultPipeline,gbtclassifier])


adultMLTrainingPipeline = Pipeline(stages = [adultPipeline,cv])

With these settings, the experiment ran for 22 mins and the evalution result came out to be 91.37% area under RoC.

Introduction to Machine Learning with Spark ML – I

Machine Learning is most widely done using Python and scikit-learn toolkit. The biggest disadvantage of using this combination is the single machine limit that Python imposes on training the model. This limits the amount of data that can be used for training to the maximum memory on the computer.

Industrial/ enterprise datasets tend to be in terabytes and hence the need for a parallel processing framework that could handle enormous datasets was felt. This is where Spark comes in. Spark comes with a machine learning framework that can be executed in parallel during training using a framework called Spark ML. Spark ML is based on the same Dataframe API that is widely used within the Spark ecosystem. This requires minimal additional learning for preprocessing of raw data.

In this post, we will cover how to train a model using Spark ML.

In the next post, we will introduce the concept of Spark ML pipelines that allow us to process the data in a defined sequence.

The final post will cover MLOps capabilities that MLFlow framework provides for operationalising our machine learning models.

We are going to be Adult Dataset from UCI Machine Learning Repository. Go ahead and download the dataset from the “Data Folder” link on the page. The file you are interested to download is named “adult.data” and contains the actual data. Since the format of this dataset is CSV, I saved it on my local machine as adult.data.csv. The schema of this dataset is available in another file titled – adult.names. The schema of the dataset is as follows:

age: continuous.
workclass: categorical.
fnlwgt: continuous.
education: categorical.
education-num: continuous.
marital-status: categorical.
race: categorical.
sex: categorical.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: categorical.
summary: categorical

The prediction task is to determine whether a person makes over 50K in a year which is contained in the summary field. This field contains value of <50K or >=50K and is our target variable. The machine learning task is that of binary classification.

Upload file in DBFS

The first step was to upload the dataset from where it is accessible. I chose DBFS for ease of use and uploaded the file at the following location: /dbfs/FileStore/Abhishek-kant/adult_dataset.csv

Once loaded in DBFS, we need to access the same as a Dataframe. We will apply a schema while reading the data since the data doesn’t come with header values as indicated below:

adultSchema = "age int,workclass string,fnlwgt float,education string,educationnum float,maritalstatus string,occupation string,relationship string,race string,sex string,capitalgain double,capitalloss double,hoursperweek double,nativecountry string,category string"

adultDF = spark.read.csv("/FileStore/Abhishek-kant/adult_dataset.csv", inferSchema = True, header = False, schema = adultSchema)

A sample of the data is shown below:

We need to move towards making this dataset machine learning ready. Spark ML only works with numeric data. We have many text values in the dataframe that will need to be converted to numeric values.

One of the key changes is to convert categorical variables expressed as string into labels expressed as string. This can be done using StringIndexer object (available in pyspark.ml.feature namespace) as illustrated below:
eduindexer = StringIndexer(inputCol=”education”, outputCol =”edu”)

The inputCol indicates the column to be transformed and outputCol is the name of the column that will get added to the dataframe after converting to the categorical label. The result of the StringIndexer is shown to the right e.g. Private is converted to 0 while State-gov is converted to 4.

This conversion will need to be done for every column:

#Convert Categorical variables to numeric
from pyspark.ml.feature import StringIndexer

wcindexer = StringIndexer(inputCol="workclass", outputCol ="wc")
eduindexer = StringIndexer(inputCol="education", outputCol ="edu")
maritalindexer = StringIndexer(inputCol="maritalstatus", outputCol ="marital")
occupationindexer = StringIndexer(inputCol="occupation", outputCol ="occ")
relindexer = StringIndexer(inputCol="relationship", outputCol ="relation")
raceindexer = StringIndexer(inputCol="race", outputCol ="racecolor")
sexindexer = StringIndexer(inputCol="sex", outputCol ="gender")
nativecountryindexer = StringIndexer(inputCol="nativecountry", outputCol ="country")
categoryindexer = StringIndexer(inputCol="category", outputCol ="catlabel")

This creates what is called a “dense” matrix where a single column contains all the values. Further, we will need to convert this to “sparse” matrix where we have multiple columns for each value for a category and for each column we have a 0 or 1. This conversion can be done using the OneHotEncoder object (available in pyspark.ml.feature namespace) as shown below:
ohencoder = OneHotEncoder(inputCols=[“wc”], outputCols=[“v_wc”])

The inputCols is a list of columns that need to be “sparsed” and outputCols is the new column name. The confusion sometimes is around fitting sparse matrix in a single column. OneHotEncoder uses a schema based approach to fit this in a single column as shown to the left.

Note that we will not sparse the target variable i.e. “summary”.

The final step for preparing our data for machine learning is to “vectorise” it. Unlike most machine learning frameworks that take a matrix for training, Spark ML requires all feature columns to be passed in as a single vector of columns. This is achieved using VectorAssembler object (available in pyspark.ml.feature namespace) as shown below:

colvectors = VectorAssembler(inputCols=["age","v_wc","fnlwgt","educationnum","capitalgain","capitalloss","v_edu","v_marital","v_occ","v_relation","v_racecolor","v_gender","v_country","hoursperweek"],
outputCol="features")

As you can see above, we are adding all columns in a vector called as “features”. With this our dataframe is ready for machine learning task.

We will proceed to split the dataframe in training and test data set using randomSplit function of dataframe as shown:

(train, test) = adultMLDF.randomSplit([0.7,0.3])

This will split our dataframe into train and test dataframe in 70:30 ratio.
The classifier used will be Gradient Boosting classifier available as GBTClassifier object and initialised as follows:

from pyspark.ml.classification import GBTClassifier

classifier = GBTClassifier(labelCol="catlabel", featuresCol="features")

The target variable and features vector column is passed as attributes to the object. Once the classifier object is initialised we can use it to train our model using the “fit” method and passing the training dataset as an attribute:

gbmodel = classifier.fit(train)

Once the training is done, you can get predictions on the test dataset using the “transform” method of the model with test dataset passed in as attribute:

adultTestDF = gbmodel.transform(test)

The result of this function is addition of three columns to the dataset as shown below:

A very important task in machine learning is to determine the efficacy of the model. To evaluate how the model performed, we can use the BinaryClassificationEvaluator object as follows:

from pyspark.ml.evaluation import BinaryClassificationEvaluator

eval = BinaryClassificationEvaluator(labelCol = "catlabel", rawPredictionCol="rawPrediction")

eval.evaluate(adultTestDF)

In the initialisation of the BinaryClassificationEvaluator, the labelCol attribute specifies the actual value and rawPredictionCol represents the predicted value stored in the column – rawPrediction. The evaluate function will give the accuracy of the prediction in the test dataset represented as AreaUnderROC metric for classification tasks.

You would definitely want to save the trained model for use later by simply saving the model as follows:

gbmodel.save(path)

You can later retrieve this model using “load” function of the specific classifier:

from pyspark.ml.classification import GBTClassificationModel
classifierModel = GBTClassificationModel.load(path)

You can now use this classification model for inferencing as required.

Azure Databricks – What is it costing me?

A customer wanted to know the total cost of running Azure Databricks for them. They couldn’t understand what DBU (Databricks Units) was, given that is the pricing unit for Azure Databricks.

I will attempt to clarify DBU in this post.

DBU is just really an abstraction not related to any amount of compute or compute metrics. At the same time, it changes with the following factors:

  1. When there is a change in the kind of machine you are running the cluster on
  2. When there is change in the kind of workload (e.g. Jobs, All Purpose)
  3. The tier / capabilities of the workload (Standard / Premium)
  4. The kind of runtime (e.g. with or without Photon)

So, the DBUs really measure the consumption of resources. It is billed on a per second basis. The monetary value of these DBUs is called $DBU and is determined by a $ rate charged per DBU. Pricing rate per DBU is available here.

You are paying DBUs for the software that Databricks has made available for processing your Big Data workload. The hardware component needed to run the cluster is charged directly by Azure.

When you visit the Azure Pricing Calculator and look at the pricing for Azure Databricks, you would see that there are two distinct sections – one for compute and another for Databricks DBUs that specifies this rate.

The total price of running an Azure Databricks cluster is a combination of the above two sections.

Let us now understand if we are running a cluster of 6 machines what is the total cost likely to be:

For the All Purpose Compute running in West US in Standard tier with D3 v2, the total cost for compute (in PAYG model) is likely to be USD 1,222/ month.

From the calculator you can see that DBU consumption is 0.75 DBU with a rate of USD 0.4/ hr. So, the total cost of running 6 D3 v2 machines in the cluster will be $ 219 * 6 = $ 1,314.

The total cost hence would be USD 2,536.

From an optimization perspective, you can reduce your hardware compute by making reservations e.g. 1 yr or 3 yrs in Azure. DBUs from the Azure calculator doesn’t have any upfront commitment discounts available. However, they are available if you contact Databricks and request for discount on a fixed commitment.

If you are interested in saving costs while running Databricks clusters, here are two blog posts that you may find interesting:
https://medium.com/similarweb-engineering/how-we-cut-our-databricks-costs-by-50-7c60d6b6c069
https://www.databricks.com/blog/2022/10/18/best-practices-cost-management-databricks.html

Constructing a PySpark DataFrame Dynamically

Spark provides a lot of connectors to load data from various formats. Whether it is a CSV or JSON or Parquet you can use the magic of “spark.read”.

However, there are times when you would like to create a DataFrame dynamically using code. The one use case that I was presented with was to create a dataframe out of a very twisted incoming JSON from an API. So, I decided to parse the JSON manually and create a dataframe.

The approach we are going to use is to create a list of structured Row types and we are using PySpark for the task. The steps are as follows:

  1. Define the custom row class
personRow = Row("name","age")

2. Create an empty list to populate later

community = []

3. Create row objects with the specific data in them. In my case, this data is coming from the response that we get from calling the API.

qr = personRow(name, age)

4. Append the row objects to the list. In our program, we are using a loop to append multiple Row objects to the list.

community.append(qr)

5. Define the schema using StructType

person_schema = StructType([ \
                              StructField("name", StringType(), True), \
                              StructField("age", IntegerType(), True), \

                    ])

6. Create the dataframe using createDataFrame. The two parameters required is the data and schema to applied.

communityDF = spark.createDataFrame(community, person_schema)

Using the steps above, we are able to create a dataframe for use in Spark applications dynamically.

Copy Data in Azure Databricks Table from one region to another

One of our customers had a requirement of copying data that was locked in an Azure Databricks Table in a specific region (let’s say this is eastus region). The tables were NOT configured as Delta tables in the originating region and a subset of personnel had access to both the regions.

However, the analysts were using another region (let’s say this is westus region) as it was properly configured with appropriate permissions. The requirement was to copy the Azure Databricks Table from eastus region to westus region. After a little exploration, we couldn’t find a direct/ simple solution to copy data from one Databricks region to another.

One of the first thoughts that we had was to use Azure Data Factory with the Databricks Delta connector. This would be the simplest as we would simply need a Copy Data activity in the pipeline with two linked services. The source would be a Delta Lake linked service to eastus tables and the sink would be another Delta Lake linked service to westus table. This solution faced two practical issues:

  1. The source table was not a Delta table. This prevented the use of Delta Lake linked service as source.
  2. When sink for copy activity is a not a blob or ADLS, it requires us to use a staging storage blob. While we were able to link a staging storage blob, the connection could not be established due to authentication errors during execution. The pipeline error looked like the following:
Operation on target moveBlobToADB failed: ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container adf-staging in account xxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration. Caused by: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container adf-staging in account xxx.blob.core.windows.net using anonymous credentials, and no credentials found for them in the configuration. Caused by: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Public access is not permitted on this storage account..

On digging deeper, this requirement is documented as a prerequisite. We wanted to use the Access Key method but it requires the keys to be added into Azure Databricks cluster configuration. We didn’t have access to modify the ADB cluster configuration.

To work with this, we found the following alternatives:

  1. Use Databricks notebook to read data from non-Delta Tables in Databricks. The data can be stored in a staging Blob Storage.
  2. To upload the data into the destination table, we will again need to use a Databricks notebook as we are not able to modify the cluster configuration.

Here is the solution we came up with:

In this architecture, the ADB 1 notebook is reading data from Databricks Table A.1 and storing it in the staging blob storage in the parquet format. Only specific users are allowed to access eastus data tables, so the notebook has to be run in their account. The linked service configuration of the Azure Databricks notebook requires us to manually specify: workspace URL, cluster ID and personal access token. All the data transfer is in the same region so no bandwidth charges accrue.

Next the Databricks ADB 2 notebook is accesses the parquet file in the blob storage and loads the data in the Databricks Delta Table A.2.

The above sequence is managed by the Azure Data Factory and we are using Run ID as filenames (declared as parameters) on the storage account. This pipeline is configured to be run daily.

The daily run of the pipeline would lead to a lot of data in the Azure Storage blob as we don’t have any step that cleans up the staging files. We have used the Azure Storage Blob lifecycle management to delete all files not modified for 15 days to be deleted automatically.

Handling a peculiar DATETIME issue in Telerik Reporting

We got a request from our customer wanting to use a CSV file as Data Source in Telerik Report Designer to generate the report. They were unable to parse DateTime column to DateTime data type in Telerik Report Designer because the CSV returns the column with double quotes for example (“4/14/2021 12:42:25PM”).

Values can be easily cast to DateTime in Telerik Reporting during the datasource definition. The issue here was the quotes that came in the datasource.

In this blog post, we will explore how we can work with this scenario using Expressions provided by Telerik Report Designer..

First, we need to Add Data Source so click on Data in Menu then select the CSV Data Source and select and then click on Next button as below :

check the checkbox if CSV file has headers then click on NEXT button.

Now we the screen appear for Map columns to type and you can see the StartTime column below with double quotes:

Now we try to change the type of StartTime column string to DateTime. Now need to provide Date format in our case like yyyy-mm-dd hh:ss but in our case it will not work and the column will show blank.

Cause of blank again we need the change DateTime to String and Click on NEXT Button and Finish.

Now to the Data Source is connected with the report. the customer want the report with the parameter of data. So we need to convert StartTime(string) to StartTime(DateTime) using expression.

First we need to remove double quotes from StartTime using Replace function.

//Syntax of Replace function
=Replace(text, old substirng, new substring)

In a below code we take a text from Fields.StartTime and provide old substring that to be remove is double quotes in single quotes. and new substring is blank. 

=Replace(Fields.StartTime,'"',"")

We got a StartTime in string without double quotes. so now we need to convert String to DateTime using CDate(value) Funciton. In a CDate function we provide the above Replace function that returning the StartTime without double qoutes.

= CDate(Replace(Fields.StartTime,'"',""))

Now we got the StartTime with the type of DateTime. and the above Expression we can use any where we want like filter the data and making parameter range with StartTime.

How to create Data Model of SQL Azure database using Telerik Open Access ORM

In this post we will take a look on creating data model using Open Access ORM from a data base in SQL Azure.

image

To create data model add new item to project and select Telerik Open Access Domain Model from Data tab.

clip_image002

Next select option Populate from database and from drop down choose Microsoft SQL Azure. If you want you can change the model name and then click on the Next.

clip_image004

Telerik Open Access ORM does not provide you to create a connection in case of SQL Azure. You will have to manually copy and paste connection string of SQL Azure database to create Domain Model. You can find connection string of SQL Azure database in Windows Azure portal.

clip_image006

In the quick glance click on the Show Connection strings to see the connection string. Choose ADO.NET connection string. Do not forget to change the password with the real password.

clip_image007

Copy this connection string of ADO.NET and paste it to Set Connection Manually section and click on the Next.

clip_image009

Next you need to choose Schemas, Tables, Views, Stored Procedure to create data model. Let us say we are choosing only one table to create data model. After choosing items click on the next.

clip_image011

Now you can define naming rules of Classes, Fields and Properties if required. Let us leave to default and click on next to proceed further. In last step you can configure Code Generation Settings. Let us leave settings to default and click on Finish to generate data model.

Once Data Model is created you will find *.rlinq file added in the project. Click on *.rlinq file to view the data model.

clip_image012

In this way you can create a data model using Telerik Open Access ORM from a SQL Azure database. I hope you find this post useful. Thanks for reading.

Step by Step Creating WCF Data Service using Telerik Open Access ORM

In this post we will learn step by step to create WCF Data Service using Telerik Open Access ORM. We will create data model of data from SQL Server and further exposed that as WCF Data Service using Telerik Open Access ORM.

Learn more about Open Access ORM here

We are going to follow step by step approach. To create WCF Data Service follow the steps given below,

Step 1: Create Telerik Web Access Web Application project

After installation of Telerik Open Access ORM you will get Telerik Web Access Web Application project template while creating a new project. Go ahead and create a project selecting Telerik Web Access Web Application project template from Web group

image

Step 2: Create Domain Model

Next you will be prompted to create Domain Model. Open Access ORM allows you to create domain model from any of the following type of database. If required you can create Empty Domain Model as well.

image

Let us go ahead and create data model from Microsoft SQL Server. Choose Microsoft SQL Server from drop down and click on the Next.

Step 3: Setup Database connection

In this step you need to set up Database connection. Either you can provide Connection String manually or can create a New Connection. To create new connection click on Add New Connection. You will get window to add new connection. In this window provide database server name and choose database from the dropdown. To test the connection click on Test Connection and after successful testing click on Ok to add a new connection.

image

Step 4: Setup Database Connection

In this step you need to setup database connection. If you want you can change connection string name. Let us leave it to default and click on the Next.

Step 5: Choose Database Items

In this step you need to choose

  • Schemas
  • Tables
  • Views
  • Stored Procedure to create data model.

Let us say we are choosing only one table to create data model. After choosing items click on the next .

image

 

Step 6: Define Naming Rules

In this step you can define naming rules of Classes, Fields and Properties if required. Let us leave to default and click on next to proceed further

Step 7: Code Generation Settings

In this step you can configure Code Generation Settings. Let us leave settings to default and click on Finish to generate data model.

Once Data Model is created you will find *.rlinq file added in the project. Click on *.rlinq file to view the data model.

image

Step 8: Add Open Access Service

By step 7 you have created data model. Now let us go ahead and add WCF Data Service on the created Data Model. For this right click on the project and select Add Open Acccess Service. You will be prompted with Add Open Access Service Wizard. Before adding Open Access Service make sure that once you have built the project.

You need to choose

  • Context : Select Context from the drop down
  • Project : Use the existing project

image

Click on the Next after selecting Context and Project.

Step 9: Select Service Type

In this step you need to choose type of Service you want to create. Since we want to create WCF Data Service, let us choose that from the option and click on the next.

image

Step 10: Choose Entity to expose a part of the Service

In this step we need to choose entity to expose as part of service. Since there is only one entity choose that and click on the Next

image

In last step you can preview various References being added and changes to config files. Click on Finish to create Service. After WCF Data Service being created you can test that in browser. Run the application and browse to *.svc to test the service in browser.

You should get data of People in browser as below after successful creation of the service.

image

Conclusion

In this post we learnt to create WCF Data Service using Open Access ORM. I hope you find this post useful. Thanks for reading.