How to get data from ZIP files using Power Query

Power Query use cases
Share the post:
Did you ever have to process data from Excel spreadsheets or CSV files locked inside a ZIP archive? With Power Query, you can seamlessly extract and analyse these files without manually unzipping the archive. This post walks you through extracting files from a ZIP archive, handling folder structures, and processing multiple file formats like .xlsx and .csv.

The ZIP file problem

Dealing with data stored in ZIP files can be challenging. If you are end user of this data, trying to analyse it in Excel, you will have no other option but to navigate to the location of the zip file, and manually unzip its contents. However, with Power Query, you can do this task in an automated way, taking only seconds to get to the data you need from the zip archive. You can even connect to, and extract data from multiple zip files seamlessly, saving you hours of manual work.

In this tutorial, you’ll learn how to connect to a ZIP archive and extract the data from underlying Excel or flat (CSV or TXT) files.

The Power Query solution

The solution goes as follows:

  1. Define a function to unzip contents from a given ZIP file
  2. Connect and load your ZIP file in Power Query
  3. Execute the function in defined in step 1 to get the unzipped files
  4. Extract the data and attributes from unzipped files and perform your intended transformations
 

Let’s go through each of these steps in detail over the next few sections.

Step 1: Define the unzipContents function

The unzipContents function extracts all files (including their folder paths and binary content) from a ZIP archive. Create a blank query in Power Query editor, and paste the following code in Advanced Editor. Name it unzipContents or something else as needed.

				
					(ZIPFile) =>
let
    // Define the ZIP file header format
    Header = BinaryFormat.Record([
        MiscHeader = BinaryFormat.Binary(14),
        BinarySize = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
        FileSize   = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
        FileNameLen= BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
        ExtrasLen  = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)    
    ]),

    // Define how to process each header entry
    HeaderChoice = BinaryFormat.Choice(
        BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
        each if _ <> 67324752 // Check for valid header signature
            then BinaryFormat.Record([IsValid = false, Filename = null, Content = null])
            else BinaryFormat.Choice(
                BinaryFormat.Binary(26),
                each BinaryFormat.Record([
                    IsValid  = true,
                    Filename = BinaryFormat.Text(Header(_)[FileNameLen]),
                    Extras   = BinaryFormat.Text(Header(_)[ExtrasLen]),
                    Content  = BinaryFormat.Transform(
                        BinaryFormat.Binary(Header(_)[BinarySize]),
                        (x) => try Binary.Buffer(Binary.Decompress(x, Compression.Deflate)) otherwise null
                    )
                ]),
                type binary
            )
    ),

    // Define the full ZIP file format
    ZipFormat = BinaryFormat.List(HeaderChoice, each _[IsValid] = true),

    // Extract entries from the ZIP file
    Entries = List.Transform(
        List.RemoveLastN(ZipFormat(ZIPFile), 1),
        (e) => [FileName = e[Filename], Content = e[Content]]
    ),

    // Convert to a table
    FilesTable = Table.FromRecords(Entries)
in
    FilesTable

				
			

The code above has not been created by me, it has been taken from another source and full credit to the author. 

The function above takes the loaded ZIP file binary as the input, and returns a table containing the unzipped files within the ZIP binary. 

Step 2: Load the source ZIP archive

Start writing your main query in Power Query editor. The query should first be to provide the location of the ZIP archive so the file can be loaded. Power Query doesn’t have an in-built connection for ZIP files, so again you will need to use the Power Query Advanced Editor to connect. Create another blank query in Power Query editor, and paste the following code:
				
					let
        // Load ZIP file from a specified location
        FilePath = "C:\path\to\your\zipfile.zip", // Replace with your ZIP file's location
        ZIPFileBinary = File.Contents(FilePath)

				
			

This will load the ZIP archive in binary format for further downstream use.

Note that: 

  • the command File.Contents(FilePath) will only work if the file is located on your computer, or a network drive. If you are trying to access a file located on an online source, such as SharePoint, or a Teams site, then use Web.Contents(FilePath) instead. 
  • FilePath needs to be the full path to the file, including the filename with extension.  
 

Step 3: Extract files from the ZIP archive

Your next code statement is simply to call and execute the function defined in step 1, by providing the ZIP file binary created in step 2. Then you filter and load contents of the extracted files. 

Add the following commands to your query from step 2:

				
					// continue from previous step
    // Call the function created in step 1
        ExtractedFiles = unzipContents(ZIPFileBinary)
    
    // Filter the extracted files to include only the ones you require
        ProcessedFiles = Table.AddColumn(
        ExtractedFiles,
        "FileData",
        each
    // Modify the code below for the specific types of files you are dealing with
            if Text.EndsWith([FileName], ".xlsx") then
                try Excel.Workbook([Content], null, true) otherwise error "Invalid Excel file"
            else if Text.EndsWith([FileName], ".csv") then
                try Csv.Document([Content]) otherwise error "Invalid CSV file"
            else
                null
    )

				
			

This first statement ExtractedFiles will create a table containing:

  • FileName: The full path of each file within the ZIP.
  • Content: The binary content of each file.
 
The next part of the code filters out the unnecessary files from the archive and retains only the ones you require. For each of the extracted files, a structured table is also created, which will be expanded into columns in the next step.
 

Note that: 

  • The command to load a file, such as Excel.Workbook is specific to the type of file you are trying to load. Because of this, filtering and segregation of extracted files becomes important, because you will receive an error if you try to load a .csv is file with Excel.Workbook method.
  • Even if the ZIP archive contains all files of the same format, I still advise to apply the filtering step in the code above.
 

Step 4: Extract data from unzipped files

Expand the parsed data to view file contents:

				
					// continue from previous step    
    ExpandedData = Table.ExpandTableColumn(
        Table.RemoveColumns(ProcessedFiles, {"Content"}),
        "FileData",
        {"Name", "Data", "Kind"}
    ),
    // Include further transformations below
    FurtherTransformedData = .....
in

// Final output
    FurtherTransformedData
				
			

This will display:

  • Name: Sheet or table names (for Excel files).
  • Data: The content of the files as tables.
  • Kind: The type of content (Sheet, Table, or Delimited).
 
Note that: 
  • The binary content of each of the files extracted from the ZIP archive has been removed
  • For each extracted .xlsx file, “Data” column would contain the data in each sheet. This step is where you will see the data you wanted to extract.
  • Once you have the data you wanted, you can then apply further transformations as needed by your process.
  • If files are nested in folders inside the ZIP archive, then you will need a couple of more lines of code to handle the folder structure, before ExpandedData step.
 
The complete code (excluding the function), is provided below for completeness.
				
					let
    // Load ZIP file
    FilePath = "C:\path\to\your\zipfile.zip",
    ZIPFileBinary = Binary.Load(FilePath),

    // Extract files from the ZIP
    ExtractedFiles = unzipContents(ZIPFileBinary),

    // Filter and process files
    ProcessedFiles = Table.AddColumn(
        ExtractedFiles,
        "FileData",
        each
            if Text.EndsWith([FileName], ".xlsx") then
                try Excel.Workbook([Content], null, true) otherwise error "Invalid Excel file"
            else if Text.EndsWith([FileName], ".csv") then
                try Csv.Document([Content]) otherwise error "Invalid CSV file"
            else
                null
    ),

    // Expand the columns
    ExpandedData = Table.ExpandTableColumn(
        Table.RemoveColumns(ProcessedFiles, {"Content"}),
        "FileData",
        {"Name", "Data", "Kind"}
    )
    
    // Add further transformations and modify the code below accordingly
in
    ExpandedData

				
			

Find out about our Data Engineering solutions

Share the post:

Related posts

Databricks to Excel using Power Query

How to get data from Databricks in Excel using Power Query

Databricks is a unified data engineering platform renowned for it’s ability to process large datasets. Excel remains the tool of choice for analysis and other end-use-computation. In this post, we will explore how we can use Databricks API to get data into Excel via Power Query.

Read More
Azure Virtual Desktop

Why small businesses should consider using Azure Virtual Desktop

For startups and small enterprises looking to future-proof their operations and empower a dynamic, remote workforce, Azure Virtual Desktop is an oustanding choice. By leveraging Azure, small businesses can enhance productivity, improve security, and position themselves to adapt to the ever-evolving business landscape, all while keeping costs under control and maximizing their potential for growth.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *