Data Blending

An Introduction to Data Blending – Part 3 (Benefits of Blending Data)

Readers:

In Part 2 of this series on data blending, we delved deeper into understanding what data blending is. We also examined how data blending is used in Hans Rosling’s well-known Gapminder application.

Today, in Part 3 of this series, we will dig even deeper by examining the benefits of blending data.

Again, much of Parts 1, 2 and 3 are based on a research paper written by Kristi Morton from The University of Washington (and others) [1].

You can learn more about Ms. Morton’s research as well as other resources used to create this blog post by referring to the References at the end of the blog post.

Best Regards,

Michael

Benefits of Blending Data

In this section, we will examine the advantages of using the data blending feature for integrating datasets. Additionally, we will review another illustrative example of data blending using Tableau.

Integrating Data Using Tableau

In Ms. Morton’s research, Tableau was equipped with two ways of integrating data. First, in the case where the data sets are collocated (or can be collocated), Tableau formulates a query that joins them to produce a visualization. However, in the case where the data sets are not collocated (or cannot be collocated), Tableau federates queries to each data source, and creates a dynamic, blended view that consists of the joined result sets of the queries. For the purpose of exploratory visual analytics, Ms. Morton (et al) found that data blending is a complementary technology to the standard collocated approach with the following benefits:

  • Resolves many data granularity problems
  • Resolves collocation problems
  • Adapts to needs of exploratory visual analytics

Figure 1 - Company Tables

Image: Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau. [1]

Resolving Data Granularity Problems

Often times a user wants to combine data that may not be at the same granularity (i.e. they have different primary keys). For example, let’s say that an employee at company A wants to compare the yearly growth of sales to a competitor company B. The dataset for company B (see Figure 1 above) contains a detailed quarterly growth of sales for B (quarter, year is the primary key), while company A’s dataset only includes the yearly sales (year is the primary key). If the employee simply joins these two datasets on yearly earnings, then each row from A will be duplicated for each quarter in B for a given year resulting in an inaccurate overestimate of A’s yearly earnings.

This duplication problem can be avoided if for example, company B’s sales dataset were first aggregated to the level of year, then joined with company A’s dataset. In this case, data blending detects that the data sets are at different granularities by examining their primary keys and notes that in order to join them, the common field is year. In order to join them on year, an aggregation query is issued to company B’s dataset, which returns the sales aggregated up to the yearly level as shown in Figure 1. This result is blended with company A’s dataset to produce the desired visualization of yearly sales for companies A and B.

The blending feature does all of this on-the-fly without user-intervention.

Resolves Collocation Problems

As mentioned in Part 1, managed repository is expensive and untenable. In other cases, the data repository may have rigid structure, as with cubes, to ensure performance, support security or protect data quality. Furthermore, it is often unclear if it is worth the effort of integrating an external data set that has uncertain value. The user may not know until she has started exploring the data if it has enough value to justify spending the time to integrate and load it into her repository.

Thus, one of the paramount benefits of data blending is that it allows the user to quickly start exploring their data, and as they explore the integration happens automatically as a natural part of the analysis cycle.

An interesting final benefit of the blending approach is that it enables users to seamlessly integrate across different types of data (which usually exist in separate repositories) such as relational, cubes, text files, spreadsheets, etc.

Adapts to Needs of Exploratory Visual Analytics

A key benefit of data blending is its flexibility; it gives the user the freedom to view their blended data at different granularities and control how data is integrated on-the-fly. The blended views are dynamically created as the user is visually exploring the datasets. For example, the user can drill-down, roll-up, pivot, or filter any blended view as needed during her exploratory analysis. This feature is useful for data exploration and what-if analysis.

Another Illustrative Example of Data Blending

Figure 2 (below) illustrates the possible outcomes of an election for District 2 Supervisor of San Francisco. With this type of visualization, the user can select different election styles and see how their choice affects the outcome of the election.

What’s interesting from a blending standpoint is that this is an example of a many-to-one relationship between the primary and secondary datasets. This means that the fields being left-joined in by the secondary data sources match multiple rows from the primary dataset and results in these values being duplicated. Thus any subsequent aggregation operations would reflect this duplicate data, resulting in overestimates. The blending feature, however, prevents this scenario from occurring by performing all aggregation prior to duplicating data during the left-join.

Figure 2 - San Francisco Election

 Image: Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau. [1]

Next: Data Blending Design Principles

——————————————————————————————————–

References:

[1] Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau, University of Washington and Tableau Software, Seattle, Washington, March 2012, http://homes.cs.washington.edu/~kmorton/modi221-mortonA.pdf.

[2] Hans Rosling, Wealth & Health of Nations, Gapminder.org, http://www.gapminder.org/world/.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s