关于R语言的笔记 - 续 | Underlake Reading Room

接着上一回的笔记.
上次收录了些基础语法,这次是case study.

R和python中的pandas有不少相似之处.

Tutorial 2: Pokémon Analysis

In this document we begin to take a look at the pokemon.csv dataset introduced in the first lecture. This exercise is intended to provide additional exposure to R and Markdown.

Load in the data and summarize it

1 2	setwd("/Users/nstevens/Dropbox/Teaching/STAT_341/Winter_2021/Tutorials/Tutorial 2/") poke <- read.csv(file = "pokemon.csv", header = TRUE)

Now that we’ve loaded the data, let’s look at a summary of it:

1	summary(poke)

For purposes of the tutorial, let’s examine the base_happiness, speed, height_m and weight_kg variates more closely. Here are the values of these variates for the first 15 Pokémon:

1	head(x = poke[,c(4, 8, 9, 10)], n = 15)

Let’s construct a $4 \times 4$ matrix of plots that visually summarize these:

poke_df_for_plot <- poke[,c(4, 8, 9, 10)]
header <- names(poke_df_for_plot)
par(mfrow = c(4,4))
for(i in 1:4){
  for(j in 1:4){
    if(i == j){
      hist(poke_df_for_plot[,i], main = "", xlab = header[i], col = "dodgerblue3")
    }else{
      plot(x = poke_df_for_plot[,i], y = poke_df_for_plot[,j],
           xlab = header[i], ylab = header[j],
           pch = 16, col = adjustcolor(col = "dodgerblue4", alpha.f = 0.3))
    }
  }
}

Since base_happiness takes on just a handful of integer values, let’s jitter this variate to try and make the plots more informative.

poke_df_for_plot$base_happiness <- jitter(poke_df_for_plot$base_happiness, factor = 3)
par(mfrow = c(4,4))
for(i in 1:4){
  for(j in 1:4){
    if(i == j){
      hist(poke_df_for_plot[,i], main = "", xlab = header[i], col = "dodgerblue3")
    }else{
      plot(x = poke_df_for_plot[,i], y = poke_df_for_plot[,j],
           xlab = header[i], ylab = header[j],
           pch = 16, col = adjustcolor(col = "dodgerblue4", alpha.f = 0.3))
    }
  }
}

Let’s also calculate the correlations between these variables:

1	cor(poke[,c(4, 8, 9, 10)])

Questions:

Do faster Pokémon tend to be happier?
Do taller Pokémon tend to be happier?
Do heavier Pokémon tend to be happier?

Answers:

Who is the tallest Pokémon?

# this is the maximum height:
max(poke$height_m)
# this is the row in the dataset in which the maximum height is found:
which(poke$height_m == max(poke$height_m))
# this is the corresponding Pokémon:
poke$name[which(poke$height_m == max(poke$height_m))]

Note that in the code above, rather than the call

1
2
3

which(poke$height_m == max(poke$height_m))
# we could have equivalently used
which.max(poke$height_m)

Which are the 10 heaviest Pokémon?

# these are the weights sorted from largest to smallest
sort(x = poke$weight_kg, decreasing = TRUE)
# these are the 10 largest recorded weights
sort(x = poke$weight_kg, decreasing = TRUE)[1:10]
# these are the rows containing the 10 largest recorded weights
which(poke$weight_kg >= sort(x = poke$weight_kg, decreasing = TRUE)[10])
# these are the corresponding Pokémon
poke$name[which(poke$weight_kg >= sort(x = poke$weight_kg, decreasing = TRUE)[10])]

Construct a histogram of hp with the mean identified

hist(x = poke$hp, col = "mistyrose", main = "Histogram of HP", xlab = "HP")
abline(v = mean(poke$hp), col = "navyblue", lwd = 2)
text(x = 70, y = 200, labels = paste("Mean = ", round(mean(poke$hp), 2)),
     col = "navyblue", pos = 4)

Plot the influence ( $\Delta$ ) of each Pokémon on the average hp

y <- poke$hp
delta <- (y-mean(y))/(length(y)-1)
par(mfrow = c(1,2))
plot(delta,
     main = "Influence for Average HP", ylab = bquote(Delta),
     pch = 16, col = adjustcolor("mediumvioletred", 0.3))
plot(y, delta,
     main = "Influence vs. HP", ylab = bquote(Delta), xlab = "HP",
     pch = 16, col = adjustcolor("mediumvioletred", 0.3))

Which two Pokémon have the largest influence?

1	poke$name[which(delta > 0.2)]

Which two Pokémon have the highest HP?

1	poke$name[order(poke$hp, decreasing = TRUE)][1:2]

What are the unique types of Pokémon?

1	unique(poke$type1)

How many Pokémon of each type1 are their?

1	table(poke$type1)

Calculate the average happiness for each type1 of Pokémon

1
2
3

aggregate(x = poke$base_happiness,
          by = list(Type = poke$type1),
          FUN = mean)

Tutorial 3: Pokémon Analysis - continue

1
2
3

library(knitr)
opts_chunk$set(tidy.opts=list(width.cutoff=70),tidy=TRUE)
poke <- read.csv("/Users/nstevens/Dropbox/Teaching/STAT_341/Winter_2021/Tutorials/Tutorial 3/pokemon.csv", header = TRUE)

1

Create a $1 \times 3$ plot which includes:

A histogram of speed
A histogram of weight_kg
A scatterplot of speed vs. weight_kg

par(mfrow = c(1,3))
hist(x = poke$speed, main = "Histogram of Speed", xlab = "Speed",
     col = adjustcolor(col = "darkorchid", alpha = 0.4))
hist(x = poke$weight_kg, main = "Histogram of Weight", xlab = "Weight (kg)",
     col = adjustcolor(col = "navyblue", alpha = 0.4))
plot(x = poke$weight_kg, y = poke$speed,
     main = "Scatterplot of Speed vs. Weight", xlab = "Weight (kg)", ylab = "Speed",
     pch = 16, col = adjustcolor(col = "firebrick", alpha = 0.4))

2

Find a range of $\alpha$ values that would help to “symmetrize” the speed distribution. Plot the histogram of $T_{\alpha}(\texttt{speed})$ using the value of $\alpha$ that seems most appropriate.

(也就是第一个作业里的Power Transformations)
第一张图是right-skewed, bump is towards the lower value, consider alpha value less than 1.

powerfun <- function(x, alpha) {
  if(sum(x <= 0) > 0) stop("x must be positive")
  if (alpha == 0)
    log(x)
  else if (alpha > 0) {
    x^alpha
  } else -x^alpha
}

par(mfrow = c(1,9), mar = c(1,1,1,1))
alpha <- seq(from = 1, to = -1, by = -0.25)
for(i in 1:length(alpha)){
  hist(x = powerfun(x = poke$speed, alpha = alpha[i]),
       main = bquote(alpha == .(alpha[i])), xlab = "",
       col = adjustcolor(col = "darkorchid", alpha = 0.4))
}

Based on the figure above, $\alpha = 0.5$ seems to do the best job. Here is a plot of the square-root-transformed speed data:

par(mfrow = c(1,1))
hist(x = powerfun(x = poke$speed, alpha = 0.5),
     main = "Histogram of Square-Root-Transformed Speed Data",
     xlab = expression(sqrt(Speed)), col = adjustcolor(col = "darkorchid", alpha = 0.4))

3

Find a range of $\alpha$ values that would help to “symmetrize” the weight distribution. Plot the histogram of $T_{\alpha}(\texttt{weight})$ using the value of $\alpha$ that seems most appropriate.

par(mfrow = c(1,9), mar = c(1,1,1,1))
alpha <- seq(from = 1, to = -1, by = -0.25)
for(i in 1:length(alpha)){
  hist(x = powerfun(x = poke$weight_kg, alpha = alpha[i]),
       main = bquote(alpha == .(alpha[i])), xlab = "",
       col = adjustcolor(col = "navyblue", alpha = 0.4))
}

Based on the figure above, $\alpha \in [0,0.25]$ looks good. Let’s refine our search:

par(mfrow = c(1,6), mar = c(1,1,1,1))
alpha <- seq(from = 0.25, to = 0, by = -0.05)
for(i in 1:length(alpha)){
  hist(x = powerfun(x = poke$weight_kg, alpha = alpha[i]),
       main = bquote(alpha == .(alpha[i])), xlab = "",
       col = adjustcolor(col = "navyblue", alpha = 0.4))
}

Based on the plots above, choosing $\alpha = 0.1$ seems like it should work well. Here is a plot of this power-transformed weight data:

par(mfrow = c(1,1))
hist(x = powerfun(x = poke$weight_kg, alpha = 0.1),
     main = "Histogram of Power-Transformed Weight Data",
     xlab = bquote(Weight^0.1), col = adjustcolor(col = "navyblue", alpha = 0.4))

4

Cast the ‘optimal power transformation’ problem in the univariate setting as an implicitly defined attribute problem and determine the optimal $\alpha$ values for both the speed and weight data. Construct histograms of these optimally-power-transformed variates.

Let $\alpha$ be the attribute of interest, and let $t_u = T_{\alpha}(y_u)$ be the power-transformed observation for unit $u$ . Let us also define the objective function to be minimized as the absolute value of Pearson’s moment coefficient of skewness for the transformed data: $\rho(\alpha;\mathcal{P}) = \left|\frac{\frac{1}{N}\sum_{u \in \mathcal{P}}\left(t_u-\bar{t}\right)^3}{[SD(t)]^3}\right|$

Thus the problem of finding the optimal $\alpha$ reduces to the following optimization problem: $\widehat{\alpha} = \operatorname*{arg min}_{\alpha \in \mathbb{R}}\rho(\alpha;\mathcal{P})$

In R let’s define $\rho(\alpha;\mathcal{P})$ and then use the nlminb function to minimize it.

rho_skew <- function(theta, y){
  t <- powerfun(y, alpha = theta)
  skew <- mean((t - mean(t))^3) / sd(t)^3
  return(abs(skew))

Let’s find the optimal $\alpha$ for the speed data and plot the resulting histogram:

min1 <- nlminb(start = 1, objective = rho_skew, y = poke$speed)
print(min1)
hist(x = powerfun(x = poke$speed, alpha = min1$par),
     main = "Histogram of Optimally Power-Transformed Speed Data",
     xlab = "T(Speed)", col = adjustcolor(col = "darkorchid", alpha = 0.4))
mtext(text = bquote("(" ~ alpha == .(min1$par) ~ ")"), side = 3)

Let’s find the optimal $\alpha$ for the weight data and plot the resulting histogram:

min2 <- nlminb(start = 1, objective = rho_skew, y = poke$weight_kg)
print(min2)
hist(x = powerfun(x = poke$weight_kg, alpha = min2$par),
     main = "Histogram of Optimally Power-Transformed Weight Data",
     xlab = "T(Weight)", col = adjustcolor(col = "navyblue", alpha = 0.4))
mtext(text = bquote("("~ alpha == .(min2$par) ~ ")"), side = 3)

Notice that in both cases the “optimal $\alpha$ “ was not far from the values we determined using Bump Rule #1. In general, although we can always find an optimal power transformation algorithmically (like we did above), the ladder heuristc tends to work pretty well.

5

Find a range of $\alpha_x$ and $\alpha_y$ values that would help to “linearize” the relationship between $T_{\alpha_y}(\texttt{speed})$ and $T_{\alpha_x}(\texttt{weight})$ . Plot the scatterplot of $T_{\alpha_y}(\texttt{speed})$ vs. $T_{\alpha_x}(\texttt{weight})$ using the values of $\alpha_y$ and $\alpha_x$ that seem most appropriate.

alpha_x <- seq(from = 1, to = -1, length.out = 5)
alpha_y <- seq(from = 1, to = -1, length.out = 5)

corMat <- matrix(0, nrow = length(alpha_x), ncol = length(alpha_y))

par(mfrow = c(length(alpha_x),length(alpha_y)), mar = c(1,1,1.5,1))
for(i in  1:length(alpha_x)){
  for(j in 1:length(alpha_y)){
    plot(x = powerfun(poke$weight_kg, alpha = alpha_x[i]), y = powerfun(poke$speed, alpha = alpha_y[j]),
         main = bquote(alpha[x] == .(alpha_x[i]) ~ ", " ~ alpha[y] == .(alpha_y[j])),
         xlab = "", ylab = "", xaxt = "n", yaxt = "n",
         pch = 16, col = adjustcolor(col = "firebrick", alpha = 0.4))
    corMat[i,j] <- cor(powerfun(poke$weight_kg, alpha = alpha_x[i]), powerfun(poke$speed, alpha = alpha_y[j]))
  }
}
print(corMat)

The plot and matrix of correlations above suggest that $\alpha_x=0$ and $\alpha_y=0$ provide power transformations such that the relationship between $T_{\alpha_y}(\texttt{speed})$ and $T_{\alpha_x}(\texttt{weight})$ is most linear. Further refinement of the $\alpha$ ‘s is definitely possible, but a log-transformation is nice and easy to interpret so let’s go ahead with it:

par(mfrow = c(1,1))
plot(x = powerfun(poke$weight_kg, alpha = 0), y = powerfun(poke$speed, alpha = 0),
         main = "Scatterplot of Log-Transformed Variates",
         xlab = "log(Weight)", ylab = "log(Speed)",
         pch = 16, col = adjustcolor(col = "firebrick", alpha = 0.4))

6

Cast the ‘optimal power transformation’ problem in the bivariate setting as an implicitly defined attribute problem and determine the optimal $\boldsymbol\alpha$ value to “straighten” the scatterplot of speed versus weight. Construct a scatterplot using the optimally-power-transformed variates.

Let $\boldsymbol\alpha = (\alpha_x, \alpha_y)$ be the attribute of interest, and let $t_u = T_{\alpha_y}(y_u)$ and $v_u = T_{\alpha_x}(x_u)$. Let us also define the objective function to be minimized as the negative of the absolute value of the correlation coefficient between the transformed variates (since we actually want to maximize this): $\rho(\boldsymbol\alpha;\mathcal{P}) = -\left|\frac{\sum_{u \in \mathcal{P}}(v_u-\bar{v})(t_u-\bar{t})}{\sqrt{\sum_{u \in \mathcal{P}}(v_u-\bar{v})^2}\sqrt{\sum_{u \in \mathcal{P}}(t_u-\bar{t})^2}}\right|$

Thus the problem of finding the optimal $\boldsymbol\alpha$ reduces to the following optimization problem: $\widehat{\boldsymbol\alpha} = \operatorname*{arg min}_{\boldsymbol\alpha \in \mathbb{R}^2}\rho(\boldsymbol\alpha;\mathcal{P})$

In R let’s define $\rho(\boldsymbol\alpha;\mathcal{P})$ and then use the nlminb function to minimize it.

rho_cor <- function(alpha, x, y){
  alphax <- alpha[1]
  alphay <- alpha[2]
  return(-abs(cor(x = powerfun(x, alpha = alphax), y = powerfun(y, alpha = alphay))))
}

Let’s find the optimal $\boldsymbol\alpha$ for the scatterplot of speed versus weight and plot the resulting optimally transformed data:

min3 <- nlminb(start = c(1,1), objective = rho_cor, x = poke$weight_kg, y = poke$speed)
print(min3)
plot(x = powerfun(poke$weight_kg, alpha = min3$par[1]), y = powerfun(poke$speed, alpha = min3$par[2]),
         main = "Scatterplot of Optimally Power-Transformed Data",
         xlab = "T(Weight)", ylab = "T(Speed)",
         pch = 16, col = adjustcolor(col = "firebrick", alpha = 0.4))
mtext(text = bquote("(" ~ alpha[x] == .(min3$par)[1] ~ " , " ~ alpha[y] == .(min3$par[2]) ~ ")"), side = 3)

Notice the “optimal correlation” is 0.1431949, which is not so different from the one we found (0.14163987) using Bump Rule #2
– and our transformations are much more easily interpreted. In general, although we can always find an optimal power transformation algorithmically (like we did above), the ladder heuristc tends to work pretty well.