The Whisky Dataset
Our dataset, courtesy of the University of Strathclyde in Glasgow, Scotland, consists of tasting notes from 86 different Scotch Whiskies. Each whisky is scored on 12 different categories on a scale of 0-4. These ratings provide distinct features from which to organize the different whiskies. The features we have are:
Dimensionality Reduction with t-SNE
In this example, we have 86 whiskies from different distilleries with 12 features each for a total of 1,032 individual category ratings. Directly visualizing these values can be quite a challenge; imagine a bar chart with over 1000 bars on it for example. To address this issue, we choose to use a dimensionality reduction algorithm to visualize the relationships between datapoints. One primary use of dimensionality reduction algorithms is to project a large dimension dataset, 12 in our case, into two dimensions so that it can be directly plotted and visualized. The goal is to represent the original structure of the high dimension dataset as accurately as possible in a smaller number of dimensions. Dimensionality reduction is often used to visualize a low dimensional representation for facial recognition, handwritten characters, country statistics, or any other source of high dimensional data.
One such state-of-the-art algorithm is t-distributed Stochastic Neighborhood Embedding (t-SNE), originally described by Laurens van der Maaten in 2008. This algorithm measures pairwise distances between datapoints in the original high dimensional space and creates conditional probabilities that represent pairwise similarities. These probabilities are calculated using the Gaussian distribution, which determines how quickly our perception of similarity falls off as a function of distance between datapoints. t-SNE will create a mapping of points in 2 dimensions that have pairwise probabilities that resemble those of the high dimension mapping, with one exception: the low dimension similarities are calculated using the t-distribution, which is the secret sauce from which t-SNE derives part of its name.
Interestingly enough, the t-distribution was first described by William Sealy Gosset in 1908 while working at the Guiness Brewery in Dublin, Ireland for analysis with small sample size. The pursuit of better drinks has been pushing the envelope of data science for over 100 years!
A visualization of the whisky dataset is shown below. Move your mouse over the datapoints to trigger a radar chart showing the tasting notes for that particular whisky!
The t-SNE visualization above represents our entire whisky dataset in a simple 2 dimensional map. Datapoints near each other should have similar tasting profiles, while datapoints far apart should be more different. This can be verified by inspecting some of the features using the interactive visualization. For example, Laphroaig in the lower right corner has high smoky, body, and medicinal features, and I assure you that the smokiness of Laphroiag cannot be overstated. Near Laphroiag is a cluster consisting of Caol Ila, Clynelish, Ardbeg, Lagavulin, and Talisker, which all have similar features, especially smoky.
This map could be used to provide insight to make better purchasing decisions. Hosting a Scotch tasting? You may want to try whisky from a wider variety of distilleries to provide a full exploration of the different tastes of Scotch. Simply pick a handful of distilleries that are as separated as possible on the visualization. As for me, my next purchase will be a smoky bottle of 18 year-old Caol Ila.
Image Credit: John Haslam on Flickr