The ZIP file problem
Dealing with data stored in ZIP files can be challenging. If you are end user of this data, trying to analyse it in Excel, you will have no other option but to navigate to the location of the zip file, and manually unzip its contents. However, with Power Query, you can do this task in an automated way, taking only seconds to get to the data you need from the zip archive. You can even connect to, and extract data from multiple zip files seamlessly, saving you hours of manual work.
In this tutorial, you’ll learn how to connect to a ZIP archive and extract the data from underlying Excel or flat (CSV or TXT) files.
The Power Query solution
The solution goes as follows:
- Define a function to unzip contents from a given ZIP file
- Connect and load your ZIP file in Power Query
- Execute the function in defined in step 1 to get the unzipped files
- Extract the data and attributes from unzipped files and perform your intended transformations
Let’s go through each of these steps in detail over the next few sections.
Step 1: Define the unzipContents
function
The unzipContents
function extracts all files (including their folder paths and binary content) from a ZIP archive. Create a blank query in Power Query editor, and paste the following code in Advanced Editor. Name it unzipContents
or something else as needed.
(ZIPFile) =>
let
// Define the ZIP file header format
Header = BinaryFormat.Record([
MiscHeader = BinaryFormat.Binary(14),
BinarySize = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
FileSize = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
FileNameLen= BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
ExtrasLen = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
]),
// Define how to process each header entry
HeaderChoice = BinaryFormat.Choice(
BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
each if _ <> 67324752 // Check for valid header signature
then BinaryFormat.Record([IsValid = false, Filename = null, Content = null])
else BinaryFormat.Choice(
BinaryFormat.Binary(26),
each BinaryFormat.Record([
IsValid = true,
Filename = BinaryFormat.Text(Header(_)[FileNameLen]),
Extras = BinaryFormat.Text(Header(_)[ExtrasLen]),
Content = BinaryFormat.Transform(
BinaryFormat.Binary(Header(_)[BinarySize]),
(x) => try Binary.Buffer(Binary.Decompress(x, Compression.Deflate)) otherwise null
)
]),
type binary
)
),
// Define the full ZIP file format
ZipFormat = BinaryFormat.List(HeaderChoice, each _[IsValid] = true),
// Extract entries from the ZIP file
Entries = List.Transform(
List.RemoveLastN(ZipFormat(ZIPFile), 1),
(e) => [FileName = e[Filename], Content = e[Content]]
),
// Convert to a table
FilesTable = Table.FromRecords(Entries)
in
FilesTable
The code above has not been created by me, it has been taken from another source and full credit to the author.
The function above takes the loaded ZIP file binary as the input, and returns a table containing the unzipped files within the ZIP binary.
Step 2: Load the source ZIP archive
let
// Load ZIP file from a specified location
FilePath = "C:\path\to\your\zipfile.zip", // Replace with your ZIP file's location
ZIPFileBinary = File.Contents(FilePath)
This will load the ZIP archive in binary format for further downstream use.
Note that:
- the command File.Contents(FilePath) will only work if the file is located on your computer, or a network drive. If you are trying to access a file located on an online source, such as SharePoint, or a Teams site, then use Web.Contents(FilePath) instead.
- FilePath needs to be the full path to the file, including the filename with extension.
Step 3: Extract files from the ZIP archive
Your next code statement is simply to call and execute the function defined in step 1, by providing the ZIP file binary created in step 2. Then you filter and load contents of the extracted files.
Add the following commands to your query from step 2:
// continue from previous step
// Call the function created in step 1
ExtractedFiles = unzipContents(ZIPFileBinary)
// Filter the extracted files to include only the ones you require
ProcessedFiles = Table.AddColumn(
ExtractedFiles,
"FileData",
each
// Modify the code below for the specific types of files you are dealing with
if Text.EndsWith([FileName], ".xlsx") then
try Excel.Workbook([Content], null, true) otherwise error "Invalid Excel file"
else if Text.EndsWith([FileName], ".csv") then
try Csv.Document([Content]) otherwise error "Invalid CSV file"
else
null
)
This first statement ExtractedFiles will create a table containing:
FileName
: The full path of each file within the ZIP.Content
: The binary content of each file.
Note that:
- The command to load a file, such as Excel.Workbook is specific to the type of file you are trying to load. Because of this, filtering and segregation of extracted files becomes important, because you will receive an error if you try to load a .csv is file with Excel.Workbook method.
- Even if the ZIP archive contains all files of the same format, I still advise to apply the filtering step in the code above.
Step 4: Extract data from unzipped files
Expand the parsed data to view file contents:
// continue from previous step
ExpandedData = Table.ExpandTableColumn(
Table.RemoveColumns(ProcessedFiles, {"Content"}),
"FileData",
{"Name", "Data", "Kind"}
),
// Include further transformations below
FurtherTransformedData = .....
in
// Final output
FurtherTransformedData
This will display:
Name
: Sheet or table names (for Excel files).Data
: The content of the files as tables.Kind
: The type of content (Sheet
,Table
, orDelimited
).
- The binary content of each of the files extracted from the ZIP archive has been removed
- For each extracted
.xlsx
file, “Data” column would contain the data in each sheet. This step is where you will see the data you wanted to extract. - Once you have the data you wanted, you can then apply further transformations as needed by your process.
- If files are nested in folders inside the ZIP archive, then you will need a couple of more lines of code to handle the folder structure, before
ExpandedData
step.
The complete code (excluding the function), is provided below for completeness.
let
// Load ZIP file
FilePath = "C:\path\to\your\zipfile.zip",
ZIPFileBinary = Binary.Load(FilePath),
// Extract files from the ZIP
ExtractedFiles = unzipContents(ZIPFileBinary),
// Filter and process files
ProcessedFiles = Table.AddColumn(
ExtractedFiles,
"FileData",
each
if Text.EndsWith([FileName], ".xlsx") then
try Excel.Workbook([Content], null, true) otherwise error "Invalid Excel file"
else if Text.EndsWith([FileName], ".csv") then
try Csv.Document([Content]) otherwise error "Invalid CSV file"
else
null
),
// Expand the columns
ExpandedData = Table.ExpandTableColumn(
Table.RemoveColumns(ProcessedFiles, {"Content"}),
"FileData",
{"Name", "Data", "Kind"}
)
// Add further transformations and modify the code below accordingly
in
ExpandedData