This page provides a few hints to visualize a dataset composed of several numeric variables. As an example the famous mtcars dataset will be considered. It provides several features like the number of cylinders, the gross horsepower, the weight etc. for 32 car models.

# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(DT)
library(plotly)
library(dendextend)
library(car)
library(FactoMineR)
library(kableExtra)
options(knitr.table.format = "html")

# This dataset is available in R by default, and on the datatoviz github repo
data <- read.csv("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/6_SeveralNum.csv", header=T)
rownames(data) <- data[,1]
data <- data[,-1]

# Save it at .csv for the github repo
#write.csv(mtcars, file="../Example_dataset/6_SeveralNum.csv", quote=F)

# show data
data %>% head(6) %>% kable() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Check distribution

In my opinion, the first thing to do when you have several numeric variables is to observe their distribution one by one. This can be done using a violin plot, a boxplot or a ridgeline plot if your variables are all on the same scale. In the case of the mtcars dataset the variables are completely different one to each other so it make more sense to make an histogram for each of them:

# Keep the numeric variables of the mtcars dataset
data <- mtcars %>% select( disp, drat, hp, mpg, qsec, wt)

# Show the histogram of these variables
data %>%
  as.tibble() %>%
  gather(variable, value) %>%
  ggplot( aes(x=value) ) +
    geom_histogram( fill= "#69b3a2") +
    facet_wrap(~variable, scale="free") +
    theme_ipsum()

Correlogram

A correlogram or correlation matrix allows to analyse the relationship between each pair of numeric variables of a dataset. The relationship between each pair of variable is visualised through a scatterplot, or a symbol that represents the correlation (bubble, line, number..). The diagonal often represents the distribution of each variable, using an histogram or a density plot.

scatterplotMatrix(~mpg+disp+drat+hp+qsec+wt, data=data , reg.line=FALSE, col="#69b3a2", id.col="#69b3a2", smooth=FALSE , cex=1.5 , pch=20 )

It is a powerful method that give a good overview of the dataset in an unique graphic. For instance, it is obvious that displacement (disp) and gross horsepower (hp) have a strong correlation.

Dendrogram

A dendrogram can be used to check the result of a clustering algorythm on the dataset. Basically, the steps are:

compute the distance between each pair of sample using correlation or euclidean distance.
perform clustering on this matrix: it builds a hierarchy of clusters: groups sample that are close one from another
show the result as a dendrogram:

# Clusterisation using 3 variables
data %>% dist() %>% hclust() %>% as.dendrogram() -> dend

# Color in function of the cluster
par(mar=c(1,1,1,7))
dend %>%
  set("labels_col", value = c("#69b3a2", "#404080", "orange"), k=3) %>%
  set("branches_k_color", value = c("#69b3a2", "#404080", "orange"), k = 3) %>%
  plot(horiz=TRUE, axes=FALSE)
abline(v = 350, lty = 2)

Here, the dendrogram informs us that the Mercedes 280 and the Mercedes 280C have similar features, what makes sense. Basically, it gives an idea of group of cars that are similar one another.

See more about it here.

Heatmap

The heatmap is often used in complement of a dendrogram. It is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above.

In addition of a dendrogram, it allows to understand why samples ore features are grouped together.

library(d3heatmap)
d3heatmap(data, k_row = 4, k_col = 2, scale = "column")

The heatmap above allows to understand why cars are split in 2 main clusters. For instance the weight (wt) is much higher for the group on top than for the other.

PCA

The Principal Component Analysis is a statistical procedure that aims to summarize all the available numeric variables in a set of principal components.

myPCA <- PCA(data, scale.unit=TRUE, graph=F)

myPCA$ind %>%
  as.data.frame() %>%
  mutate(name=rownames(.)) %>%
  ggplot( aes(x=coord.Dim.1, y=coord.Dim.2, label=name)) +
    geom_point( color="#69b3a2") +
    theme_ipsum() +
    geom_label(color="#69b3a2")

#plot.PCA(myPCA, axes=c(1, 2), choix="var")

Note: this section needs improvement

Getting a correlation matrix

It is of importance to note that this kind of dataset can be converted to a correlation matrix that is an adjacency matrix. Indeed, we can compute the correlation between each pair of variable or each pair of entities of the dataset and try to visualize this new dataset. But this is a new story: how to visualize an adjacency matrix.

Going further

You can learn more about each type of graphic presented in this story in the dedicated sections. Click the icon below:

Comments

Any thoughts on this? Found any mistake? Have another way to show the data? Please drop me a word on Twitter or in the comment section below:

A work by Yan Holtz for data-to-viz.com