Loading...
Data VisualizationTips & TricksUse Cases

Validating & Analyzing Data in Tableau

Photo by Michael Burrows from Pexels

Introduction

Tableau is a tool which prides itself on being easy to use and support rapid development of different types of analyses for data. For this reason, I have frequently used it in different projects where ad-hoc data analysis and validation were needed. In this article, I plan to look at an example of this usage by giving some examples of how I used it to validate a data pipeline instead of relying on querying alone.

The scenario we will look at is how to validate different versions of a single table of a pipeline: there is the version of the table before the pipeline was changed, then after the new pipeline is deployed we have the new version of the table. Ideally, these tables will be the same.

The Data

In reality, the data for my example is much larger and more complex, but we can illustrate the different uses of Tableau with a simpler dataset. In this dataset, we have a series of Order IDs which have a Quantity sold of various Product IDs.

V1 of the data, before pipeline changes

We will additionally have the dataset which contains the data after the pipeline has changed. If you look closely you will see the information for Order ID 4 was duplicated. This is something that will become apparent in our validations. Additionally, we have an extra column added named TestColumn where every value is ‘A’.

V2 of the data, after pipeline changes

How can we validate these in Tableau?

There is no one way to handle things like this in Tableau, so I will simply show an approach I could follow and how some of the features of Tableau support that approach.

“Soft” unions

Tableau has the feature to allow you to union different tables together without worrying about if they have the same number of columns, or if the columns are in the same order: two requirements you often run into when trying to union data in SQL. I have no idea if there is another term for this, but I will refer to it as a “soft union”.


I created my validation dataset by unioning the two tables together in the data source page. This gives me a Tableau data source where I have both versions of the datasets available.

When you union tables together, Tableau automatically adds a column “Table Name” which contains the name of the sheet or table as its value. This will be useful for comparing the datasets.

Fixed calculations

Tableau has a wonderful feature which allows you to create calculations at varying levels of detail within a single dataset. They have different types of level of detail calculations, but the ones I will focus on are FIXED calculations.

Let’s start with a question to try to understand them: what total Quantity sold is associated with each Order ID?

We can visualize this in Tableau by adding Order ID & Quantity to a view. In this view, you can see the total quantity sold for each Order ID, this amount includes the Quantity from both the V1 and V2 datasets as they are unioned together, so for Order ID 1 you see a Quantity of 10.

If we add table name to this, you will see the actual Quantity associated with each Order ID in each dataset (V1 and V2). You can see that Order ID 1 which had 10 in the previous view, is made up of 5 for V1 and 5 for V2.

What fixed calculations allow us to do is ignore the level of detail in a given view and specify our own. As an example, we will create a fixed calculation called OrderQty with the formula of: { FIXED [Order ID] : SUM([Quantity]) } and add it to the view.

You can see that the new column ignores the fact that Table Name was added to the view, and shows the total Quantity across both datasets for each Order ID.

With this understanding, we will create a separate FIXED calculation to allow us to check if the Quantity per Order ID changed from V1 to V2. We will name this OrderQty Check with the formula below:

When we add this to the view, we will see which Order IDs have different Quantity values from the V1 dataset to the V2 dataset.

You can see that this returned false for Order ID 4 because in the V1 dataset the quantity was 15, but in the V2 dataset it was 30.

This is one quick way to use FIXED level of detail calculations to try to evaluate how data has changed across different versions. Hopefully you can see that there are many applications for this type of functionality!

Visual Validation

Tableau’s dashboarding capabilities allow you to create a series of checks and combine them into one easy to use dashboard to validate your pipeline.

In this case, for every table in the pipeline I added a series of checks: row count, distinct values, totals of aggregable columns (like sum of weight, sum of price), etc. This gave us a quick and easy way to check the status of the new pipeline version without having to write and run a long series of queries. Additionally, this made it easy to convey to team members what tables or columns needed investigation.

Drilling Down to Investigate

Once you have found a problem in your tables, you’ll want to look further into where specifically it is happening. In our example let’s say we see that the total Quantity per data source is different.

We want to find the source of this problem, so we start adding fields to investigate where this difference comes from, starting with Order ID.

From this, we can see that Order ID 4 is the source of the difference, let’s drill further by adding Product ID and filtering on Order ID 4.

We see that there is only 1 product showing up, and they have different values. We know we added an extra column (TestColumn) to the dataset, perhaps that is causing some duplication. Let’s check the number of rows.

Now we see that V2 has 2 rows for Product ID 1, if we are expecting that this dataset should have 1 row per Product ID per Order ID then this likely means some duplication has occurred when we joined in our new TestColumn and we should likely check the query we created for that.

Conclusion

Tableau is not necessarily the tool EVERYONE should use ANYTIME they need to validate anything, but it is certainly a tool which can be speed up validation because of its very flexible features and easy to use interface. In this article, we looked at a couple of ways you can use Tableau to execute some different checks and explore data more quickly than continually writing and re-writing SQL queries to perform similar checks. Hopefully you’ve gotten something out of this that can help make your life easier while trying to make sure your data is complete and of high quality!