PLSC on Wines

R code
## Load the course library
library(R4SPISE2022)
## Load the data
data("winesOf3Colors", package = "data4PCCAR")
## Define colors...
### of the wines...
wineColors <- list(
  oc = cbind(as.character(recode(
    winesOf3Colors$winesDescriptors$color,
    red = 'indianred4', 
    white = 'gold',
    rose = 'lightpink2'))),
  gc = cbind(c(red = 'indianred4', 
                white = 'gold', rose = 'lightpink2'))
)
### ... and of the variables
varColors <- list(
  oc = list(
    cbind(rep("darkorange1", 4)),
    cbind(rep("olivedrab3", 8))
  )
)

Two tables with descriptors and other supplementary information

R code
descr <- winesOf3Colors$winesDescriptors %>% 
    select(origin, color, varietal)
suppl <- winesOf3Colors$winesDescriptors %>% 
    select(Price)
chemi <- winesOf3Colors$winesDescriptors %>% 
    select(Acidity, Alcohol, Sugar, Tannin)
senso <- winesOf3Colors$winesDescriptors %>% 
    select(fruity, floral, vegetal, 
           spicy, woody, sweet, astringent, 
           hedonic)

Two data tables

  • First table: chemical data (chemi) that includes acidity, alcohol, sugar, and tannin.
  • Second table: sensory data (senso) that includes fruity, floral, vegetal, spicy, woody, sweet, astringent, and hedonic.

Descriptors and supplementary variable

  • Descriptors: origin, color and varietal.
  • Supplementary variables: Price.
R code
## Run Partial Least Square Correlation (PLS-C) on the two tables

res.pls <- tepPLS(chemi, senso, DESIGN = descr$color, graphs = FALSE)
### The default of this function will center (to have means equal 0) and scale (to have the *sums of squares* equal 1) all variables in both data tables. The argument `DESIGN` indicates the groups of the observations and change how the observations are colored in the figures (when `graphs = TRUE`), but it does not change the results of PLS-C.

## Generate the figures

res.pls.plot <- TTAplot(
    res = res.pls, # Output of tepPLS
    color.obs = wineColors, # <optional> colors of wines
    color.tab = varColors, # <optional> colors of the two tables
    tab1.name = "Chemical data", # <optional> Name of Table 1 (for printing)
    tab2.name = "Sensory data", # <optional> Name of Table 2 (for printing)
    DESIGN = descr$color, # design for the wines
    tab1 = chemi,  # First data table
    tab2 = senso)  # Second data table

# In `TTAplot`, if `DESIGN` is specified. The latent variables will be colored according to the groups of the observations with the group means and their 95% bootstrap confidence intervals.

Correlation between the two tables

We can check the data by plotting first the correlation matrix between the two data sets. This correlation matrix is where the dimensions are extracted.

R code
res.pls.plot$results.graphs$heatmap.rxy

Scree plot

The scree plot shows the eigenvalues of each dimension. These eigenvalues give the squared covariance of each pair of latent variables. In other words, the singular values, which are the square root of the eigenvalues, give the covariance of these pairs of latent variables. The sum of the eigenvalues is equal to the sum of the squared covariance between all variables in both tables.

R code
res.pls.plot$results.graphs$scree.eig

Latent variables

Here, we plot the first latent variable of both tables against each other with the observations colored according to their groups. This plot shows how the observations are distributed on the dimension and how the chosen pair of latent variables are related to each other. When plotting the first pair of latent variables, we expect the observations to distribute along the bottom-left-to-top-right diagonal line (which illustrates a perfect association), because PLS-C maximizes the covariance of the latent variables.

To examine the stability of these groups, we plot the group means with their 95% bootstrap confidence intervals (or ellipsoids). If the ellipses do not overlap, the groups are reliably different from each other. However, it’s worth noted that the distribution of the observations does not imply how the groups (represented by the group means) are distributed or whether the groups are reliably different from each other.

Note: The grouping information are independent from PLS-C and are only use to help provide a summary description of the observations.

R code
res.pls.plot$results.graphs$lv.plot

The results from Dimension 1 show that the association between the chemical and the sensory data reliably separates the red wines from rose and white wines.

Contributions

These bar plots illustrate the signed contribution of variables from the two data tables. From these figures, we use the direction and the magnitude of these signed contributions to interpret the dimension.

The direction of the signed contribution is the direction of the loadings, and it shows how the variables contribute to the dimension. The variables that contribute in a similar way have the same sign, and those that contribute in an opposite way will have different signs.

The magnitude of the contributions are computed as squared loadings, and they quantify the amount of variance contributed by each variable. Therefore, contribution is similar to the idea of an effect size. To identify the important variables, we find the variables that contribute more than average (i.e., with a big enough effect size). When the variables are centered and scaled to have their sums of squares equals 1, each variable contributes one unit of variance; therefore, the average contribution is 1/(# of variables of the table).

R code
res.pls.plot$results.graphs$ctrX.plot

R code
res.pls.plot$results.graphs$ctrY.plot

From these two bar plots, the first dimension is characterized by (1) the positive association between Alcohol and Tannin from the Chemical data and Woody and Astringent from the Sensory data, and (2) the negative association between these variables and Hedonic from the Sensory data.

Together with the latent variable plot, we found that, as compared to the rose and the white wines in the sample, the red wines are less Hedonic and stronger in Alcohol, Tannin, Woody, and Astringent.

Circles of correlations

The circle of correlations illustrate how the variables are correlated with each other and with the dimensions. From this figure, the length of an arrow indicates how much this variable is explained by the two given dimensions. The cosine between any two arrows gives their correlation. The cosine between a variable and an axis gives the correlation between that variable and the corresponding dimension.

In this figure, an angle closer to 0° indicates a correlation close to 1; an angle closer to 180° indicates a correlation close to -1; and an 90° angle indicates 0 correlation. However, it’s worth noted that this implication of correlation might only be true within the given dimensions. When a variable is far away from the circle, it is not fully explained by the dimensions, and other dimensions might be characterized by other pattern of relationship between this and other variables.

R code
res.pls.plot$results.graphs$cirCorX.plot

R code
res.pls.plot$results.graphs$cirCorY.plot

These circles of correlations show that Alcohol, Tannin, Woody, Astringent, and Hedonic are strongly correlated to Dimension 1 with Henodic inversely correlated with all other variables. These variables are mostly explained by the first dimension and have close-to-zero correlation with the second dimension (which is not included and discussed in the previous sections).

Inference plots and results

The inference analysis of PCA (performed by OTAplotInference) includes bootstrap test of the proportion of variance explained, permutation tests of the eigenvalues, the bootstrap tests of the loadings (i.e., the left and the right singular vectors).

R code
res.plot.plsc.inference <- TTAplotInference(
    res = res.pls,
    tab1 = chemi, 
    tab2 = senso, 
    DESIGN = descr$color,
    tab1.name = "Chemical data", # <optional> Name of Table 1 (for printing)
    tab2.name = "Sensory data" # <optional> Name of Table 2 (for printing)    
    )

Bootstrap and permutation tests on the eigenvalues

The inference scree plot illustrates the 95% bootstrap confidence intervals of the eigenvalues. If an interval includes 0, the eigenvalue is reliably larger than 0.

R code
res.plot.plsc.inference$results.graphs$scree

The bootstrap test identifies two significant dimensions having eigenvalues reliably larger than 0. The permutation test identifies two significant dimensions with eigenvalues significantly larger than 0.

Bootstrap test on the loadings

The bar plot illustrates the bootstrap ratios which equals \[\frac{M_{p_{j}boot}}{SD_{p_{j}boot}},\] where \(M_{p_{j}boot}\) is the mean of the bootstrapped sample of the jth loading and \(SD_{p_{j}boot}\) is the bootstrapped standard deviation of the factor score. A bootstrap ratio is equivalent to a t-statistics for the column factor score with the \(\mathrm{H}_0: g_j = 0\). The threshold is set to 2 to approximate the critical t-value of 1.96 at \(\alpha\) = .05.

R code
res.plot.plsc.inference$results.graphs$BR.X

The bootstrap test identifies Alcohol, Sugar, and Tannin as chemical variables with loadings significantly different from 0.

R code
res.plot.plsc.inference$results.graphs$BR.Y

The bootstrap test identifies Floral, Spicy, Woody, Sweet, Astringent and Hedonic as significant sensory variables with loadings different from 0.

It’s worth noted that the bootstrap ratio gives different information as contributions. Contributions describe the size of the effect, and the bootstrap ratios describe the reliability of the loadings. Therefore, when we interpret the results, we will combine both to draw a conclusion.

Interpreting loadings with contributions and bootstrap ratios

The results from the Chemical data show that Alcohol, Sugar, and Tannin all have loadings stably (or significantly) different from 0. But, only Alcohol and Tannin contribute a significant amount of variance to this dimension. Although the loading of Sugar is stable and is significant different from 0, the effect size is small; in other words, it’s not considered as important.

Similarly, for the Sensory data, the results show that Floral, Spicy, Woody, Sweet, Astringent and Hedonic all have loadings stably (or significantly) different from 0. But, only Woody, Astringent, and Hedoniccontribute a significant amount of variance to this dimension. In other words, the loadings of Floral, Spicy, Woody, Sweet, Astringent and Hedonic are stable and reliably different from 0, but the important variables that define this dimension are Woody, Astringent, and Hedonic.