What is Seaborn?

Seaborn is a Python data visualization library based on Matplotlib. I personally like it over Matplotlib and Pandas visualizations because Seaborn provides more beautiful statistical graphics. Seaborn is designed to work with Pandas data frames and to make visualizations of complex data easier and more attractive. This website is a great resource for learning the details of how to create visualizations in Seaborn.

Getting Started with Seaborn

The first step is to install Seaborn. You can do this by running the following command in your terminal or command prompt:

pip install seaborn

Figure

Next, you will need to import Seaborn into your Python environment. You can do this by using the code below. The alias “sns” is a standard abbreviation for “Seaborn.”

import seaborn as sns

Figure

Loading data into Seaborn

The following code will allow you access to the iris dataset, which is a data frame with 150 measurements of iris petal widths and lengths of three different species.

iris = sns.load_dataset(“iris”)
iris

Figure

Plotting with Different Visualizations

In this tutorial, we will go over the basics of how to create some common types of plots. I will go over basic arguments, but the entire series of arguments can be found in the documentation for each type of graph.

The Scatter Plot

The scatter plot graphs two quantitative variables against each other, one on the x-axis and the other on the y-axis. More information on Seaborn scatter plots can be found here.

From the code below, the following arguments are used:

sns.scatterplot(x=”petal_length”, y=”petal_width”, data=iris)

  • “x=”: the variable plotted on the x-axis. Petal length is plotted on the x-axis.
  • “y=”: the variable plotted on the y-axis. Petal width is plotted on the y-axis.
  • “data=”: specifies which data frame to use. In this case, it is the “iris” variable.

Figure

  • “hue=”: differentiates between levels of a variable and is most useful if the variable is categorical. In the code below, “species” has three different levels, and the legend shows which species is associated with what color.

Analysis: The graph below is more descriptive than the graph above, where we can see in the graph below that the setosa species has the smallest petal length and petal width, whereas the virginica species overall has the longest petal length and petal width.

sns.scatterplot(x=”petal_length”, y=”petal_width”, hue=”species”, data=iris)

Figure

The Boxplot

Boxplots are useful visualizations to compare the spread or variance of the data. The bottom of the “box” is the 25th percentile, the middle line is the median, and the top of the “box” is the 75th percentile. More information on Seaborn boxplots can be found here.

From the code below, the following arguments are used:

sns.boxplot(x=”species”, y=”sepal_length”, data=iris)

  • “x=”: the variable plotted on the x-axis. “Species” is plotted on the x-axis.
  • “y=”: the variable plotted on the y-axis. Sepal length is plotted on the y-axis.
  • “data=”: specifies which data frame to use. In this case, it is the “iris” variable.

Analysis: We can see that setosa has the smallest sepal length compared to the other species. We can also observe that there is an outlier in the virginica boxplot.

Figure

The Kernel Density Estimation (KDE) Plot

The KDE plot shows the distributions of certain data. More information on Seaborn KDEs can be found here.

From the code below, the following arguments are used:

sns.kdeplot(x=’sepal_width’, hue=’species’, data=iris)

  • “x=”: the variable plotted on the x-axis. Sepal width is plotted on the x-axis.
  • “hue=”: differentiates between levels of a variable and is most useful if the variable is categorical. In the code below, “species” has three different levels, and the legend shows which species is associated with what color.
  • “data=”: specifies which data frame to use. In this case, it is the “iris” variable.

Analysis: To interpret this graph, we see that setosa overall has the longest sepal width. The peak of the graph shows where most of the sepal width values lie. Notice that the y-axis automatically plots the density of the KDE. For example, about one-third of the sepal width values for the setosa species is around 3.5. However, we see that the versicolor and virginica species have about 40 percent of their sepal width values around 3.0.

Figure

The Pairplot

Pairplots graph every combination pair of quantitative variables. More information on Seaborn pairplots can be found here.

From the code below, the following arguments are used:

sns.pairplot(data=iris, hue=”species”)

  • “data=”: specifies which data frame to use. In this case, it is the “iris” variable.
  • “hue=”: differentiates between levels of a variable by color and is most useful if the variable is categorical. The “species” column is used to color the different species of iris.

Analysis: We can see that scatter plots and KDEs are graphed in the pairplot, including the scatter plot and KDE plot that were graphed in the above examples. This type of graph is useful for a comprehensive overview of the quantitative variables in the dataset.

Figure

A Call to Action

Now that we’ve graphed a few visualizations with Seaborn, we can apply what we’ve learned and more with other datasets. The “titanic” dataset is also well-known and has data about the passengers of the Titanic. What visualizations can you create? What trends do you find from exploring the data? The code below loads and shows the “titanic” data frame.

titanic = sns.load_dataset(“titanic”)
titanic