How to get data from Databricks in Excel using Power Query

Share the post:

Databricks is a unified data engineering platform renowned for it's ability to process large datasets. Excel remains the tool of choice for analysis and other end-use-computation. In this post, we will explore how we can use Databricks API to get data into Excel via Power Query.

Why connect Excel with Databricks

Platforms like Databricks are rapidly increasing in popularity for their ability to process extremely large datasets, and faster number-crunching. Databricks processes raw data (e.g. insurance claims) into enriched data (e.g. profit forecasts). While this enriched data can be analysed within Databricks itself, usually this is done outside using tools such as Power BI or Tableau, which have standard connectors to Databricks.

Excel remains the tool of choice for many professionals. Its versatility and widespread usage makes it a valuable instrument for end-user-computing (EUC) and analysis. Therefore, it makes perfect sense that Excel users be able to connect to Databricks. At the time of writing this post, there was no standard connector between Excel and Databricks. You can install an ODBC connector on your machine, but installing and configuring such a connector may not be a viable option in many corporate environments.

In this post, we will walkthrough one often less explored way to connect Excel with Databricks, using Power Query by leveraging Databricks REST API.

Two ways to get data from Databricks into Excel

We can use Databricks REST API in two ways:

Invoke Databricks SQL API to submit SQL queries using Databricks SQL endpoint
Invoke Databricks Jobs API to execute a notebook using Databricks compute

In both cases, you will use Power Query to send API requests to Databricks. The first option allows you to query data that already exists in Databricks; i.e. it only let’s you read the existing data. In this post, we will cover the second option as it opens up a whole range of possibilities. It allows you to run a notebook and get the output straight into Excel. You can even pass parameters to your notebook to generate dynamic results. For full documentation of Databricks Jobs API, click here.

Overview of the solution – Databricks Jobs API

In principle, the solution is simple – submit a job to run a notebook, wait for the results, and then retrieve the output. However, Power Query does have a few limitations which makes this process much harder.

Limitations of Power Query in this context

No automatic polling: Power Query doesn’t have a mechanism to wait or loop until a job is completed.
Default output format: By default, Databricks returns the output in JSON format, which requires complex transformation before it can be loaded into Excel
Output size limitations: Power Query truncates the output returned if it the output exceeds a certain number of rows.

Workflow

This post proposes a sequence of implementation steps, which can overcome two of the limitations mentioned above. It is possible to overcome the third limitation as well, but we will not discuss in this post. So the solution presented here would only work if the data size to be returned is small, within

Create a PySpark notebook on Databricks – this notebook will process the SQL query, and return the output as response to the API call.
Trigger notebook execution – a Power Query query submits a job to Databricks, retrieves the jobRunID, and writes it to an Excel sheet.
Check job status – another Power Query query takes the jobRunID from the Excel sheet, checks the job status at specified intervals, and writes the current status (SUCCESS, FAILED, or IN_PROGRESS) back to Excel.
Retrieve output – once the job status is returned as SUCCESS, a final Power Query query fetches the output data from Databricks and loads it into a table in Excel.
Orchestrate using VBA – VBA automates the sequence: triggering the job, checking the status, and retrieving the output without requiring manual intervention.

Let’s go through each of these steps in detail.

Implementing the solution

Before getting started, make sure you have the following information to hand:

Root Databricks URL – it would be like https://<databricks-instance>.cloud.databricks.com, or https://<databricks-instance>.azuredatabricks.net if you are using Azure Databricks.
Personal Access Token – you can generate this from ‘Settings’ of your personal account
Cluster ID of the cluster you wish the notebook to run on – navigate to ‘Compute’ on the left navigation menu

Refer to my GitHub repository for this project for most up to date versions of the sample codes provided here, and codes for other more complex use cases.

Step 1: Create a PySpark notebook on Databricks

You need to have a notebook ready which will contain the SQL query you wish to execute. Optionally, you can have the SQL query string sent to the notebook as a parameter.

The output can be returned directly in JSON format within the response to the API call. But in this scenario, we want output in CSV format. In order to do this, we need to ensure out notebook does the following:

The notebook writes the output in DBFS (Databricks internal file storage)
Data is written in a single csv file
The notebook returns the location of the CSV, which we can then fetch using Power Query

Sample code for such a notebook is shown below:

				
					dbutils.widgets.text("sql_query", "")
sql_query = dbutils.widgets.get("sql_query")

if not sql_query:
  sql_query = "SELECT * FROM your_table LIMIT 10" # change table name
     
df = spark.sql(sql_query)

# -----
import hashlib
import time

timestamp = str(int(time.time() * 10000))
short_hash = hashlib.md5(timestamp.encode()).hexdigest()[:6]
csv_path = f"dbfs:/tmp/api_responses/result_{short_hash}.csv"
     
df.coalesce(1).write.mode("overwrite").option("header", "true").csv(csv_path)
files = dbutils.fs.ls(csv_path)
csv_file = [file.path for file in files if file.path.endswith(".csv")][0]
dbutils.notebook.exit(csv_file)

This code produces all output to a single CSV file. If the output is large, the API response will be truncated to 1MB. If you want to fetch larger datasets, you can write query output to multiple files in DBFS (instead of just one), and then read each of them separately. The parts can then be combined in Power Query before loading to Excel.

It is not recommended to use this approach to fetch very large datasets.

Step 2: Trigger notebook execution

Power Query submits the job and extract jobRunID from the response received. Load jobRunID to a specific cell in Excel. A sample code to achieve this is as follows:

				
					let
    // API to submit the notebook job
    url = rootUrl & "/api/2.1/jobs/runs/submit",
    headers = [
        Authorization = "Bearer <your PAT here>",
        #"Content-Type" = "application/json"
    ],
    
    // update the parameters below as required
    body = Text.ToBinary(
        "{
            ""run_name"": ""Power Query Job"",
            ""existing_cluster_id"": ""<YOUR_CLUSTER_ID>"",
            ""notebook_task"": {
                ""notebook_path"": ""/Users/your.email@your_domain.com/your_notebook"",
                ""base_parameters"": {
                    ""sql_query"": ""SELECT * FROM your_table LIMIT 10""
                }
            }
        }"
    ),
    response = Json.Document(Web.Contents(url, [Headers = headers, Content = body])),
    jobRunId = Text.From(response[run_id])
in
    Table.FromRecords({[jobRunID = jobRunId]})

Let’s say you named this query jobRunIDTable . It will be loaded in an Excel table with the same name.

Step 3: Check the job status

Once the job is submitted, it will take time for the execution to be completed. Every few seconds, we need to check if the job has been completed. For this, we need to write another query which reads the jobRunID from Excel, checks the job status via the /jobs/runs/get API, and writes the status back to Excel.

				
					let
    // Retrieve jobRunID from Excel table or named range
    Source = Excel.CurrentWorkbook(){[Name="JobRunIDTable"]}[Content],
    jobRunID = Source{0}[jobRunID],  // Extract the jobRunID
    
    // API to check job status
    url = rootUrl & "/api/2.1/jobs/runs/get?run_id=" & jobRunID,
    headers = [
        Authorization = "Bearer <your PAT here>"
    ],
    response = Json.Document(Web.Contents(url, [Headers = headers])),
    
    // Extract status and completion time
    lifeCycleStatus = response[state][life_cycle_state],
    resultStatus = if Record.HasFields(response, "result_state") then response[result_state] else "IN_PROGRESS",
    endTime = if Record.HasFields(response, "end_time") then DateTime.FromBinary(response[end_time]) else null,
    
    // Output the status and timestamp
    statusTable = Table.FromRecords({
        [jobRunID = jobRunID, status = resultStatus, completion_time = endTime]
    })
in
    statusTable

Load the output of the above query in a table in Excel. This query needs to be refreshed repeatedly until we get a SUCCESS status.

Step 4: Retrieve output data from Databricks

Once the job is marked as "SUCCESS", use the jobRunID to retrieve the output data via the /jobs/runs/get-output API.

				
					let
    // Retrieve jobRunID from Excel table or named range
    Source = Excel.CurrentWorkbook(){[Name="JobRunIDTable"]}[Content],
    jobRunID = Source{0}[jobRunID],
    
    // API to get the job output
    outputUrl = rootUrl & "/api/2.1/jobs/runs/get-output?run_id=" & jobRunID,
    headers = [
        Authorization = "Bearer <your PAT here>"
    ],
    outputResponse = Json.Document(Web.Contents(outputUrl, [Headers = headers])),
    
    // Extract the CSV path
    csvPath = outputResponse[notebook_output][result],
    
    // API to download the CSV file
    dbfsReadUrl = rootUrl & "/api/2.0/dbfs/read?path=" & csvPath,
    csvResponse = Web.Contents(dbfsReadUrl, [Headers = headers]),
    csvContent = Binary.FromText(Json.Document(csvResponse)[data], BinaryEncoding.Base64),
    csvTable = Csv.Document(csvContent, [Delimiter=",", Encoding=65001, QuoteStyle=QuoteStyle.Csv])
in
    csvTable

This query will load your data as a table in Excel, which you can then analyse or perform further computation based upon your use case.

Step 5: Orchestrate with VBA

We managed to get the data from Databricks by following the method outlined the first 3 steps. However, you will appreciate that this requires several manual touchpoints, and therefore risk of errors is high. Therefore, it makes a lot of sense to automate the whole process, end-to-end, so the data is refreshed automatically with one click. Thankfully, you can achieve this easily using VBA. A sample code to achieve this is shown below:

				
					Sub OrchestrateDatabricksJob()
    ' Refresh the query to trigger the notebook and get jobRunID
    ThisWorkbook.RefreshAll
    Application.Wait (Now + TimeValue("0:00:10"))  ' Wait 10 seconds
    
    ' Loop until job status is SUCCESS
    Dim status As String
    Do
        ThisWorkbook.Connections("CheckJobStatus").Refresh
        Application.Wait (Now + TimeValue("0:00:10"))  ' Wait 10 seconds
        status = ThisWorkbook.Sheets("StatusSheet").Range("B2").Value  ' Adjust cell reference
    Loop Until status = "SUCCESS"
    
    ' Refresh query to load final output
    ThisWorkbook.Connections("LoadOutputData").Refresh
End Sub

You will need to adjust the code above for your specific use case. Change the wait times, depending on how long you think your queries will take on Databricks to avoid making unnecessary API calls. Ensure that jobRunID is consistent across all parts of this workflow. So don’t forget to put in place sufficient number of checks to minimize the possibility of an error.

Feel free to leave any comments, and I will do my best to respond. If you wish to contribute as a developer, the code can be found on this GitHub repository.

Find out about our Data Engineering solutions

Share the post:

How to get data from ZIP files using Power Query

Did you ever have to process data from Excel spreadsheets or CSV files locked inside a ZIP archive? With Power Query, you can seamlessly extract and analyse these files without manually unzipping the archive. This post walks you through extracting files from a ZIP archive, handling folder structures, and processing multiple file formats like .xlsx and .csv.

November 25, 2024

Why small businesses should consider using Azure Virtual Desktop

For startups and small enterprises looking to future-proof their operations and empower a dynamic, remote workforce, Azure Virtual Desktop is an oustanding choice. By leveraging Azure, small businesses can enhance productivity, improve security, and position themselves to adapt to the ever-evolving business landscape, all while keeping costs under control and maximizing their potential for growth.

August 21, 2024