Menu

Better Whisky Drinking Through Data Science

Nanigans is known for its hardworking, fun-loving culture so our love of Whisky Data Science should come as no surprise. This post is a discussion and visualization of the tasting profiles of Scotch whisky from 86 distilleries. We use a form of dimensionality reduction to create a tasting map that shows similarity between different whiskies. As a big fan of the amazingly smoky Laphroaig from Islay, I’d also like to identify some good candidates for my next Scotch to try.

The Whisky Dataset

Our dataset, courtesy of the University of Strathclyde in Glasgow, Scotland, consists of tasting notes from 86 different Scotch Whiskies. Each whisky is scored on 12 different categories on a scale of 0-4. These ratings provide distinct features from which to organize the different whiskies. The features we have are:

  • Body
  • Sweetness
  • Smoky
  • Medicinal
  • Spicy
  • Winey
  • Nutty
  • Floral

  • Tobacco
  • Honey
  • Malty
  • Fruity

Dimensionality Reduction with t-SNE

In this example, we have 86 whiskies from different distilleries with 12 features each for a total of 1,032 individual category ratings. Directly visualizing these values can be quite a challenge; imagine a bar chart with over 1000 bars on it for example. To address this issue, we choose to use a dimensionality reduction algorithm to visualize the relationships between datapoints. One primary use of dimensionality reduction algorithms is to project a large dimension dataset, 12 in our case, into two dimensions so that it can be directly plotted and visualized. The goal is to represent the original structure of the high dimension dataset as accurately as possible in a smaller number of dimensions. Dimensionality reduction is often used to visualize a low dimensional representation for facial recognition, handwritten characters, country statistics, or any other source of high dimensional data.

One such state-of-the-art algorithm is t-distributed Stochastic Neighborhood Embedding (t-SNE), originally described by Laurens van der Maaten in 2008. This algorithm measures pairwise distances between datapoints in the original high dimensional space and creates conditional probabilities that represent pairwise similarities. These probabilities are calculated using the Gaussian distribution, which determines how quickly our perception of similarity falls off as a function of distance between datapoints. t-SNE will create a mapping of points in 2 dimensions that have pairwise probabilities that resemble those of the high dimension mapping, with one exception: the low dimension similarities are calculated using the t-distribution, which is the secret sauce from which t-SNE derives part of its name.

Interestingly enough, the t-distribution was first described by William Sealy Gosset in 1908 while working at the Guiness Brewery in Dublin, Ireland for analysis with small sample size. The pursuit of better drinks has been pushing the envelope of data science for over 100 years!

A visualization of the whisky dataset is shown below. Move your mouse over the datapoints to trigger a radar chart showing the tasting notes for that particular whisky!

[No canvas support]

 

The t-SNE visualization above represents our entire whisky dataset in a simple 2 dimensional map. Datapoints near each other should have similar tasting profiles, while datapoints far apart should be more different. This can be verified by inspecting some of the features using the interactive visualization. For example, Laphroaig in the lower right corner has high smoky, body, and medicinal features, and I assure you that the smokiness of Laphroiag cannot be overstated. Near Laphroiag is a cluster consisting of Caol Ila, Clynelish, Ardbeg, Lagavulin, and Talisker, which all have similar features, especially smoky.

This map could be used to provide insight to make better purchasing decisions. Hosting a Scotch tasting? You may want to try whisky from a wider variety of distilleries to provide a full exploration of the different tastes of Scotch. Simply pick a handful of distilleries that are as separated as possible on the visualization. As for me, my next purchase will be a smoky bottle of 18 year-old Caol Ila.

Image Credit: John Haslam on Flickr

 

One Response

  1. Intersting distribution. I find the Speyside product to be on the smokier side of the Scotch set based on the peat. Usually described as fuller and sweeter too. However something like the Albelour as compared to the Speyburn are on differnt quadrants. Speysides are generally classified as light and grassy (e.g. Glenlivet) or rich and sweet, (e.g. Macallan)

    My personal favorite is the Oban 14, for the money it is really smooth and delightful.

    Nice Job!

Leave a reply

Your email address will not be published. Required fields are marked *

* Copy This Password *

* Type Or Paste Password Here *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>