Why connect Excel with Databricks
Platforms like Databricks are rapidly increasing in popularity for their ability to process extremely large datasets, and faster number-crunching. Databricks processes raw data (e.g. insurance claims) into enriched data (e.g. profit forecasts). While this enriched data can be analysed within Databricks itself, usually this is done outside using tools such as Power BI or Tableau, which have standard connectors to Databricks.
Excel remains the tool of choice for many professionals. Its versatility and widespread usage makes it a valuable instrument for end-user-computing (EUC) and analysis. Therefore, it makes perfect sense that Excel users be able to connect to Databricks. At the time of writing this post, there was no standard connector between Excel and Databricks. You can install an ODBC connector on your machine, but installing and configuring such a connector may not be a viable option in many corporate environments.
In this post, we will walkthrough one often less explored way to connect Excel with Databricks, using Power Query by leveraging Databricks REST API.
Two ways to get data from Databricks into Excel
We can use Databricks REST API in two ways:
- Invoke Databricks SQL API to submit SQL queries using Databricks SQL endpoint
- Invoke Databricks Jobs API to execute a notebook using Databricks compute
Overview of the solution – Databricks Jobs API
Limitations of Power Query in this context
- No automatic polling: Power Query doesn’t have a mechanism to wait or loop until a job is completed.
- Default output format: By default, Databricks returns the output in JSON format, which requires complex transformation before it can be loaded into Excel
- Output size limitations: Power Query truncates the output returned if it the output exceeds a certain number of rows.
Workflow
This post proposes a sequence of implementation steps, which can overcome two of the limitations mentioned above. It is possible to overcome the third limitation as well, but we will not discuss in this post. So the solution presented here would only work if the data size to be returned is small, within
Create a PySpark notebook on Databricks – this notebook will process the SQL query, and return the output as response to the API call.
Trigger notebook execution – a Power Query query submits a job to Databricks, retrieves the
jobRunID
, and writes it to an Excel sheet.Check job status – another Power Query query takes the
jobRunID
from the Excel sheet, checks the job status at specified intervals, and writes the current status (SUCCESS
,FAILED
, orIN_PROGRESS
) back to Excel.Retrieve output – once the job status is returned as SUCCESS, a final Power Query query fetches the output data from Databricks and loads it into a table in Excel.
Orchestrate using VBA – VBA automates the sequence: triggering the job, checking the status, and retrieving the output without requiring manual intervention.
Implementing the solution
Before getting started, make sure you have the following information to hand:
- Root Databricks URL – it would be like https://<databricks-instance>.cloud.databricks.com, or https://<databricks-instance>.azuredatabricks.net if you are using Azure Databricks.
- Personal Access Token – you can generate this from ‘Settings’ of your personal account
- Cluster ID of the cluster you wish the notebook to run on – navigate to ‘Compute’ on the left navigation menu
Refer to my GitHub repository for this project for most up to date versions of the sample codes provided here, and codes for other more complex use cases.
Step 1: Create a PySpark notebook on Databricks
You need to have a notebook ready which will contain the SQL query you wish to execute. Optionally, you can have the SQL query string sent to the notebook as a parameter.
The output can be returned directly in JSON format within the response to the API call. But in this scenario, we want output in CSV format. In order to do this, we need to ensure out notebook does the following:
- The notebook writes the output in DBFS (Databricks internal file storage)
- Data is written in a single csv file
- The notebook returns the location of the CSV, which we can then fetch using Power Query
Sample code for such a notebook is shown below:
dbutils.widgets.text("sql_query", "")
sql_query = dbutils.widgets.get("sql_query")
if not sql_query:
sql_query = "SELECT * FROM your_table LIMIT 10" # change table name
df = spark.sql(sql_query)
# -----
import hashlib
import time
timestamp = str(int(time.time() * 10000))
short_hash = hashlib.md5(timestamp.encode()).hexdigest()[:6]
csv_path = f"dbfs:/tmp/api_responses/result_{short_hash}.csv"
df.coalesce(1).write.mode("overwrite").option("header", "true").csv(csv_path)
files = dbutils.fs.ls(csv_path)
csv_file = [file.path for file in files if file.path.endswith(".csv")][0]
dbutils.notebook.exit(csv_file)
This code produces all output to a single CSV file. If the output is large, the API response will be truncated to 1MB. If you want to fetch larger datasets, you can write query output to multiple files in DBFS (instead of just one), and then read each of them separately. The parts can then be combined in Power Query before loading to Excel.
It is not recommended to use this approach to fetch very large datasets.
Step 2: Trigger notebook execution
jobRunID
to a specific cell in Excel. A sample code to achieve this is as follows:
let
// API to submit the notebook job
url = rootUrl & "/api/2.1/jobs/runs/submit",
headers = [
Authorization = "Bearer ",
#"Content-Type" = "application/json"
],
// update the parameters below as required
body = Text.ToBinary(
"{
""run_name"": ""Power Query Job"",
""existing_cluster_id"": """",
""notebook_task"": {
""notebook_path"": ""/Users/your.email@your_domain.com/your_notebook"",
""base_parameters"": {
""sql_query"": ""SELECT * FROM your_table LIMIT 10""
}
}
}"
),
response = Json.Document(Web.Contents(url, [Headers = headers, Content = body])),
jobRunId = Text.From(response[run_id])
in
Table.FromRecords({[jobRunID = jobRunId]})
Let’s say you named this query jobRunIDTable
. It will be loaded in an Excel table with the same name.
Step 3: Check the job status
Once the job is submitted, it will take time for the execution to be completed. Every few seconds, we need to check if the job has been completed. For this, we need to write another query which reads the jobRunID
from Excel, checks the job status via the /jobs/runs/get
API, and writes the status back to Excel.
let
// Retrieve jobRunID from Excel table or named range
Source = Excel.CurrentWorkbook(){[Name="JobRunIDTable"]}[Content],
jobRunID = Source{0}[jobRunID], // Extract the jobRunID
// API to check job status
url = rootUrl & "/api/2.1/jobs/runs/get?run_id=" & jobRunID,
headers = [
Authorization = "Bearer "
],
response = Json.Document(Web.Contents(url, [Headers = headers])),
// Extract status and completion time
lifeCycleStatus = response[state][life_cycle_state],
resultStatus = if Record.HasFields(response, "result_state") then response[result_state] else "IN_PROGRESS",
endTime = if Record.HasFields(response, "end_time") then DateTime.FromBinary(response[end_time]) else null,
// Output the status and timestamp
statusTable = Table.FromRecords({
[jobRunID = jobRunID, status = resultStatus, completion_time = endTime]
})
in
statusTable
Load the output of the above query in a table in Excel. This query needs to be refreshed repeatedly until we get a SUCCESS status.
Step 4: Retrieve output data from Databricks
Once the job is marked as "SUCCESS"
, use the jobRunID
to retrieve the output data via the /jobs/runs/get-output
API.
let
// Retrieve jobRunID from Excel table or named range
Source = Excel.CurrentWorkbook(){[Name="JobRunIDTable"]}[Content],
jobRunID = Source{0}[jobRunID],
// API to get the job output
outputUrl = rootUrl & "/api/2.1/jobs/runs/get-output?run_id=" & jobRunID,
headers = [
Authorization = "Bearer "
],
outputResponse = Json.Document(Web.Contents(outputUrl, [Headers = headers])),
// Extract the CSV path
csvPath = outputResponse[notebook_output][result],
// API to download the CSV file
dbfsReadUrl = rootUrl & "/api/2.0/dbfs/read?path=" & csvPath,
csvResponse = Web.Contents(dbfsReadUrl, [Headers = headers]),
csvContent = Binary.FromText(Json.Document(csvResponse)[data], BinaryEncoding.Base64),
csvTable = Csv.Document(csvContent, [Delimiter=",", Encoding=65001, QuoteStyle=QuoteStyle.Csv])
in
csvTable
This query will load your data as a table in Excel, which you can then analyse or perform further computation based upon your use case.
Step 5: Orchestrate with VBA
We managed to get the data from Databricks by following the method outlined the first 3 steps. However, you will appreciate that this requires several manual touchpoints, and therefore risk of errors is high. Therefore, it makes a lot of sense to automate the whole process, end-to-end, so the data is refreshed automatically with one click. Thankfully, you can achieve this easily using VBA. A sample code to achieve this is shown below:
Sub OrchestrateDatabricksJob()
' Refresh the query to trigger the notebook and get jobRunID
ThisWorkbook.RefreshAll
Application.Wait (Now + TimeValue("0:00:10")) ' Wait 10 seconds
' Loop until job status is SUCCESS
Dim status As String
Do
ThisWorkbook.Connections("CheckJobStatus").Refresh
Application.Wait (Now + TimeValue("0:00:10")) ' Wait 10 seconds
status = ThisWorkbook.Sheets("StatusSheet").Range("B2").Value ' Adjust cell reference
Loop Until status = "SUCCESS"
' Refresh query to load final output
ThisWorkbook.Connections("LoadOutputData").Refresh
End Sub
You will need to adjust the code above for your specific use case. Change the wait times, depending on how long you think your queries will take on Databricks to avoid making unnecessary API calls. Ensure that jobRunID
is consistent across all parts of this workflow. So don’t forget to put in place sufficient number of checks to minimize the possibility of an error.
Feel free to leave any comments, and I will do my best to respond. If you wish to contribute as a developer, the code can be found on this GitHub repository.