Automate data extraction from Excel using Python

Optimize My Day Job
4 min readApr 12, 2021

This week at work, I automated the process of extracting historic data from thousands of excel files using Python. Here is how.

You might be in the same situation as me, where you need to get some data from a daily excel report stored in one or multiple folders on some drive. The report could be orders received, or whatever your colleagues or you have decided to save in excel day after day, year after year.

https://unsplash.com/photos/Wpnoqo2plFA

First of. This is important. The success of this task heavily depends on the current structure or lack of in the folder and naming convention of your excel report savings. With a lot of structure in the storage and naming of the current files, this problem is fairly simple to crack.

A scenario with good structure, could be a case where all the excel reports were stored alone in a single folder, with a tight naming convention; such as “daily_sales_report_2021–04–12”. A scenario with bad structure, would be a case where the files were stored in several different folders, with none of them following a naming convention.

This guide assumes a good structure, however if you are in the unfortunate situation of having bad or no structure I believe that you can still use bits and pieces from this small guide.

Example of a case with good structure

First of we are going to create some fake excel reports, as in the picture above. The code below creates twenty-five fake reports with the naming convention “Daily_Sales” + a date. The report have four columns each having one hundred random integers as observations.

import numpy as np
import pandas as pd
import openpyxl
from datetime import datetime as dt
from pathlib import Path


path_save = Path('PATH WHERE YOU ARE SAVING THE FAKE REPORTS')
start_date = pd.to_datetime('2021-03-01', format='%Y-%m-%d')
date_range = pd.date_range(start_date, periods=25, freq='d')
rng = np.random.default_rng()

for i in date_range:
df = pd.DataFrame(rng.integers(0, 100, size=(100, 4)), columns=list('ABCD'))
date = dt.strptime(str(i), '%Y-%m-%d %H:%M:%S').date()
df.to_excel(path_save / f'Daily_Sales_{date}.xlsx')

The above data is of course not 1:1 matching the data in you specific situation, however with a little bit of fiddling it will be easy to adjust this tutorial to fit your needs.

Now we pretend that we have not just created this data ourselves but instead it was created by one of our colleagues, who know asks you to gather the sales data from the last 25 days and make it into a graph. We definitely do not want to extract this data manually from each file, so we create a loop with python instead.

import pandas as pd
import os
import re
from pathlib import Path

After imports, you need to setup the path where you have stored the excel files. This is used for the loop later on.

Path to Excel reports

Next up, you need to inspect the naming conventions of the files in question, and specify some regex that matches your file names. As I forget regex as fast as I learn it, I always use one of these online regex testers to help me write something that works: https://regex101.com/.

The regex below is very simple, it looks for “Daily_Sales_” + four digits + “-” + two digits + “-” + two digits + “.” + “xlsx”.

Regex compiler

Next up, you need to setup an empty dictionary and an empty DataFrame. The dictionary is used to easily store filenames and dates, and the DataFrame is used to store the final dataset with the combined extract from the excel files.

Empty dictionary and DataFrame

The loop below does as follows:

  1. LOOP — Looks in the path (folder) and lists the elements (files) in the folder.
  2. IF — Compares the filename (file) with the regex pattern specified before.
  3. Extracts the date from the filename using two str.split() operations.
  4. Adds filename and date to the file_dict dictionary.
  5. Read the excel files and saves the data as the DataFrame data.
  6. Adds a Date column to the DataFrame data.
  7. Appends the DataFrame data to the DataFrame dataset.
  8. END IF
  9. NEXT LOOP ITERATION
Loop to extract and save data from Excel reports

This concludes this small tutorial. The most common places you need to custom fit this code are: regex, date extraction using str.split() and maybe an additional loop if you are looking into several folders.

Code below:

import pandas as pd
import os
import re
from pathlib import Path

path_save = Path('C:/Users/kisum/PycharmProjects/excel_automation/daily_report/')

regex_file = re.compile('Daily_Sales_[0-9]{4}-[0-9]{2}-[0-9]{2}\.xlsx')

file_dict = {}
dataset = pd.DataFrame(columns=['Date','A','B','C','D'])

for file in os.listdir(path_save):
if re.fullmatch(regex_file, file):
date = str(file).split(sep='Daily_Sales_')[1].split(sep='.xlsx')[0]
file_dict[file] = date

data = pd.read_excel(path_save / file, header=0, index_col=0, engine='openpyxl')
data['Date'] = date

dataset = dataset.append(data)

dataset.to_csv('dataset.csv', sep=';')

--

--

Optimize My Day Job

An programming amateur from Denmark, who tries to make his and your life easier with code.