[4/5] Power Query's automated data cleaning and referencing in a nutshell (The Power Series Way, Part 1)


  • Challenge level: 2 out of 5

  • Estimated reading time: 14 min

  • Power Query functions covered: Change Data Type, Remove Duplicates, Auto-Generate Calendar Table (using M language)

  • Excel attachment reference sheets: template_Power Series


BEFORE YOU START: Get your FREE Excel template that accompanies this five-part tutorial by subscribing here!


Once you have both Power Query and Power Pivot up and running, we can start using these powerful tools to calculate all of the above figures a lot quicker than it originally took with “normal” Excel functions.

We have already built a template that allows you to just copy and paste your own data into it and get monthly, YOY, L12M etc. figures. All you need to do is  hit the “Refresh All” button under the Data tab and see the magic happen!

It’s pretty simple to build this from scratch too. No macros. No VBA. It is, amazingly possible. Here’s how:  

Power Query in action

 

Step 1. Set-up your first query: Get your primary data table in the right format

Step 1.1. Select “Get & Transform Data” > From Table/Range

Using the same set of raw data you’ve input in sheet “raw data”, select any cell within the range of raw data, go to the "Data" tab. Under the "Get & Transform Data" section, select the "From Table/Range" button.

Step 1.2. Tick “My table has headers”

When you select this the first time, you will be prompted to convert your data into Table form. Make sure that you tick the little box that says “My table has headers” so that your original headers do not get overwritten. Check also that the selected range is correct.

 Power Query create table from table

Step 1.3. Activate Power Query editor window

Once you click on the “OK” button, Power Query Editor window will pop up immediately.

If you do not see the formula bar, click on the “View” tab. Under the “Layout” section, tick the box for "Formula Bar".

 Power Query editor Formula Bar

Step 1.4 Remove unnecessary column

Right-click on the YYYYMM header, and select “Remove Columns”. With Power Query and Power Pivot, we no longer need this helper column to aggregate by month.

 Power Query remove unnecessary column

Step 1.5. Set data types

Make sure you set your data to the correct types by clicking on the small icons next to the header words. In this case, we want the dates to be in date format, category to be Text, and value to be Whole Numbers.

Step 1.6. Name query

Before closing and loading this query, make sure to also name (choose a relevant name) the query (aka the table created). Here, we’ve named it “Data”. The importance of this will be become apparent as we proceed further into the template set-up.

 Power Query set Date data type and name query for easy reference

Step 1.7. Select “Home” > Close and Load To…

Finally, select the “Close & Load” button on the top left-hand corner on the “Home” tab. Click on the words portion of the button and a dropdown list will appear. You want to choose the “Close & Load To…” option. A separate window will pop up when you choose this option.

Step 1.8. Choose “Only Create Connection” and check “Add this data to the Data Model”

Make sure you select the correct options. First, choose to “Only Create Connection” as the option on how to view the data in Excel. Secondly, check the box at the bottom that says “Add this data to the Data Model”. We will explain in awhile what this “Data Model” is about.

power-query-import-data-only-create-connection-add-this-to-data-model.png

Step 1.9. Load data

Once the above is done, hit the “OK” button. You now have your primary data table set-up!

 

Step 2. Set-up your second query: Create a sub-Category table

Creating a separate sub-Category table has its benefits, especially if you are intending to use slicers later to filter your data via these categories. The filtering will be done off this Category table which has significantly lesser rows (i.e. 21 rows) than the primary data table (i.e. 700+ rows in this case but imagine if it has over 10,000 or 200,000 rows etc.). It's a much faster process.  

Step 2.1. Create reference query

With the first query created, a Query list should appear on the right-hand side of the worksheet. (If you don’t see it, click on “Show Queries” under the Data tab’s Get & Transform.) Right-click on the “Data” query and select “Reference”. A new query will be set up based off this first query.

 Power Query create reference query

Step 2.2. Multi-select unnecessary columns

Once the Power Query Editor window has been loaded, hold on to the ctrl key while clicking on the date, value and YYYYMM columns.

Step 2.3. Remove unnecessary columns

When all three columns have been selected, release the ctrl key and right-click on any of the three headers. Select “Remove Columns” from the dropdown list.

 Power Query editor Remove Columns

Step 2.4. Remove duplicate categories

Right-click on the remaining “category” column header and choose “Remove Duplicates” from the dropdown list.

 Power Query editor Remove Duplicates

Step 2.5. Add new categories

If you would like or need to add on further categorisation to the cleaned-up category list, you can do so by adding additional columns through the Add Column tab. In our template, we inserted a larger umbrella grouping of Cellphone, Camera, Computer, Home system and Others. To do this, click on the “Column from Examples” button and  select “From All Columns” option. Start keying in your additional categorisation in the column that appears on the right. Once you are done, click on the “OK” button.

 Power Query editor Column From Examples

Step 2.6. Name query

To change the name of the header, double-click on it directly and make the edit. Finally, remember to assign a relevant name to the query. Here, we’ve named it “Category”. Follow steps 1.6 and 1.7 to close and load the Category table.

 Power Query editor Close & Load To

 

Step 3. Set-up your third query: Build a sub-calendar table

In essence, the separate calendar table here exists for the same purpose as the reference date table does for our Day 1 to 3 Excel template. It enables Power Pivot formulas to carry out dynamic calculations that factor in time period sensitivity. As with the separate Category table, it also allows the calculations to run at a much faster speed with less rows to filter through.  

Step 3.1. Create another reference query

Repeat Step 2.1. to create a new query based off the first query (titled “Data”).

Step 3.2. Activate Power Query editor’s Advanced Editor window

With the Power Query Editor window opened, click on the “Advanced Editor” button that sits under the “Home” tab.

 Power Query activate Advanced Query Editing

Step 3.3. Copy and paste M language code that auto-generates a list of dates

Once the “Advanced Editor” window is loaded, copy and paste the below M-language code in:


let
    Source = Data,
    MaxDate = Record.Field( Table.Max( Source, "date"),"date"),
    MinDate = Record.Field( Table.Min( Source, "date"),"date"),
    DaysElapsed = Number.From( MaxDate-MinDate),
    DatesList = List.Dates( MinDate, DaysElapsed + 1, Duration.From(1)),
    RawDatesTable = Table.FromList( DatesList, Splitter.SplitByNothing(), {"date"}, null, ExtraValues.Error),
    ChangedType2 = Table.TransformColumnTypes( RawDatesTable,{{"date", type date}}),
    InsertedDay = Table.AddColumn( ChangedType2, "Day", each Date.Day([date]), type number),
    InsertedMonth = Table.AddColumn( InsertedDay, "Month", each Date.Month([date]), type number),
    InsertedYear = Table.AddColumn( InsertedMonth, "Year", each Date.Year([date]), type number),
    #"Changed Type" = Table.TransformColumnTypes(InsertedYear,{{"Day", Int64.Type}, {"Month", Int64.Type}, {"Year", Int64.Type}})
in
    #"Changed Type"

Click on the “Done” button to load the code.

In short, what the M language code does is to create a list of dates based on the earliest and latest date you have in your data.

If (for whatever reason) you want to reference a different original source data to build the calendar table from, you need to change the line “Source = Data”. Replace the word “Data” with the name of the reference query you want to base the calendar table off.

Step 3.4. Name query

Remember to give the query a relevant name (we’ve named it “Calendar”) before repeating steps 1.5 and 1.6 to close and load the query.

Remember, once these queries are written, they will clean and set up your data the exact same way each time your raw data set gets updated. All you need to do is just hit the “Refresh All” button (under the Data tab)!