Readers:
Today I am beginning a multi-part series on data blending.
- Parts 1, 2 and 3 will be an introduction and overview of what data blending is.
- Part 4 will review an illustrative example of how to do data blending in Tableau.
- Part 5 will review an illustrative example of how to do data blending in MicroStrategy.
I may also include a Part 6, but I have to see how my research on this topic continues to progress over the next week.
Much of Parts 1, 2 and 3 are based on a research paper written by Kristi Morton from The University of Washington (and others) [1].
Please review the source references, at the end of each blog post in this series, to be directed to the source material for additional information.
I hope you find this series helpful for your data visualization needs.
Best Regards,
Michael
Introduction
Tableau and MicroStrategy’s new Analytics Platform are commercial business intelligence (BI) software tools that support interactive, visual analysis of data. [1]
Using a Web-based visual interface to data and a focus on usability, these tools enable a wide audience of business partners (IT’s end-users) to gain insight into their datasets. The user experience is a fluid process of interaction in which exploring and visualizing data takes just a few simple drag-and-drop operations (no programming skills or DB experience is required). In this context of exploratory, ad-hoc visual analysis, we will explore a feature originally introduced in Tableau in 2006, and in MicroStrategy’s new Analytics Platform v9.4.1 late last year (2013).
We will examine how we can integrate large, heterogeneous data sources. This feature is called data blending, which gives users the ability to create data visualization mashups from structured, heterogeneous data sources dynamically without any upfront integration effort. Users can author visualizations that automatically integrate data from a variety of sources, including data warehouses, data marts, text files, spreadsheets, and data cubes. Because data blending is workload driven, we are able to bypass many of the pain points and uncertainty in creating mediated schemas and schema-mappings in current pay-as-you-go integration systems.
The Cycle of Visual Analysis
Unlike databases, our human brains have limited capacity for managing and making sense of large collections of data. In database terms, the feat of gaining insight in big data is often accomplished by issuing aggregation and filter queries (producing subsets of data).
However, this approach can be time-consuming. The user is forced to complete the following tasks.
- Figure out what queries to write.
- Write the queries.
- Wait for the results to be returned back in textual format. And, then finally,
- Read through these textual summaries (often containing thousands of rows) to search for interesting patterns or anomalies.
Tools like Tableau and MicroStrategy help bridge this gap by providing a visual interface to the data. This approach removes the burden of having to write queries. The user can ask their questions through visual drag-and-drop operations (again, no queries or programming experience required). Additionally, answers are displayed visually, where patterns and outliers can quickly be identified.
Visualizations leverage the powerful human visual system to help us effectively digest large amounts of information and disseminate it quicker.
Image: Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau. [1]
Figure 1, above, illustrates how visualization is a key component in turning information into knowledge and knowledge into wisdom.
Ms. Morton discusses the process as follows,
The process starts with some task or question that a knowledge worker (shown at the center) seeks to gain understanding. In the first stage, the user forages for data that may contain relevant information for their analysis task. Next, they search for a visual structure that is appropriate for the data and instantiate that structure. At this point, the user interacts with the resulting visualization (e.g. drill down to details or roll up to summarize) to develop further insight.
Once the necessary insight is obtained, the user can then make an informed decision and take action. This cycle is centered around and driven by the user and requires that the visualization system be flexible enough to support user feedback and allow alternative paths based on the needs of the user’s exploratory tasks. Most visualization tools, however, treat this cycle as a single, directed pipeline, and offer limited interaction with the user. Moreover, users often want to ask their analytical questions over multiple data sources. However, the task of setting up data for integration is orthogonal to the analysis task at hand, requiring a context switch that interrupts the natural flow of the analysis cycle. We extend the visual analysis cycle with a new feature called data blending that allows the user to seamlessly combine and visualize data from multiple different data sources on-the-fly. Our blending system issues live queries to each data source to extract the minimum information necessary to accomplish the visual analysis task.
Often, the visual level of detail is at a coarser level than the data sets. Aggregation queries, therefore, are issued to each data source before the results are copied over and joined in Tableau’s local in-memory view. We refer to this type of join as a post-aggregate join and find it a natural fit for exploratory analysis, as less data is moved from the sources for each analytical task, resulting in a more responsive system.
Finally, Tableau’s data blending feature automatically infers how to integrate the datasets on-the-fly, involving the user only in resolving conflicts. This system also addresses a few other key data integration challenges, including combining datasets with mismatched domains or different levels of detail and dirty or missing data values. One interesting property of blending data in the context of a visualization is that the user can immediately observe any anomalies or problems through the resulting visualization.
These aforementioned design decisions were grounded in the needs of Tableau’s typical BI user base. Thanks to the availability of a wide-variety of rich public datasets from sites like data.gov, many f Tableau’s users integrate data from external sources such as the Web or corporate data such as internally-curated Excel spreadsheets into their enterprise data warehouses to do predictive, what-if analysis.
However, the task of integrating external data sources into their enterprise systems is complicated. First, such repositories are under strict management by IT departments, and often IT does not have the bandwidth to incorporate and maintain each additional data source. Second, users often have restricted permissions and cannot add external data sources themselves. Such users cannot integrate their external and enterprise sources without having them collocated.
An alternative approach is to move the data sets to a data repository that the user has access to, but moving large data is expensive and often untenable. We therefore architected data blending with the following principles in mind: 1) move as little data as possible, 2) push the computations to the data, and 3) automate the integration challenges as much as possible, involving the user only in resolving conflicts.
Next: Data Blending Overview
——————————————————————————————————–
References:
[1] Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau, University of Washington and Tableau Software, Seattle, Washington, March 2012, http://homes.cs.washington.edu/~kmorton/modi221-mortonA.pdf.