接着上一回的笔记.
上次收录了些基础语法,这次是case study.
R和python中的pandas有不少相似之处.
Tutorial 2: Pokémon Analysis
In this document we begin to take a look at the pokemon.csv
dataset introduced in the first lecture. This exercise is intended to provide additional exposure to R and Markdown.
Load in the data and summarize it
1 | setwd("/Users/nstevens/Dropbox/Teaching/STAT_341/Winter_2021/Tutorials/Tutorial 2/") |
Now that we’ve loaded the data, let’s look at a summary of it:
1 | summary(poke) |
For purposes of the tutorial, let’s examine the base_happiness
, speed
, height_m
and weight_kg
variates more closely. Here are the values of these variates for the first 15 Pokémon:
1 | head(x = poke[,c(4, 8, 9, 10)], n = 15) |
Let’s construct a $4 \times 4$
matrix of plots that visually summarize these:
1 | poke_df_for_plot <- poke[,c(4, 8, 9, 10)] |
Since base_happiness
takes on just a handful of integer values, let’s jitter this variate to try and make the plots more informative.
1 | poke_df_for_plot$base_happiness <- jitter(poke_df_for_plot$base_happiness, factor = 3) |
Let’s also calculate the correlations between these variables:
1 | cor(poke[,c(4, 8, 9, 10)]) |
Questions:
- Do faster Pokémon tend to be happier?
- Do taller Pokémon tend to be happier?
- Do heavier Pokémon tend to be happier?
Answers:
- No
- No
- No
Who is the tallest Pokémon?
1 | # this is the maximum height: |
Note that in the code above, rather than the call
1 | which(poke$height_m == max(poke$height_m)) |
Which are the 10 heaviest Pokémon?
1 | # these are the weights sorted from largest to smallest |
Construct a histogram of hp
with the mean identified
1 | hist(x = poke$hp, col = "mistyrose", main = "Histogram of HP", xlab = "HP") |
Plot the influence ($\Delta$
) of each Pokémon on the average hp
1 | y <- poke$hp |
Which two Pokémon have the largest influence?
1 | poke$name[which(delta > 0.2)] |
Which two Pokémon have the highest HP?
1 | poke$name[order(poke$hp, decreasing = TRUE)][1:2] |
What are the unique types of Pokémon?
1 | unique(poke$type1) |
How many Pokémon of each type1
are their?
1 | table(poke$type1) |
Calculate the average happiness for each type1
of Pokémon
1 | aggregate(x = poke$base_happiness, |
Tutorial 3: Pokémon Analysis - continue
1 | library(knitr) |
1
Create a $1 \times 3$
plot which includes:
- A histogram of
speed
- A histogram of
weight_kg
- A scatterplot of
speed
vs.weight_kg
1 | par(mfrow = c(1,3)) |
2
Find a range of $\alpha$
values that would help to “symmetrize” the speed
distribution. Plot the histogram of $T_{\alpha}(\texttt{speed})$
using the value of $\alpha$
that seems most appropriate.
(也就是第一个作业里的Power Transformations)
第一张图是right-skewed, bump is towards the lower value, consider alpha value less than 1.
1 | powerfun <- function(x, alpha) { |
1 | par(mfrow = c(1,9), mar = c(1,1,1,1)) |
Based on the figure above, $\alpha = 0.5$
seems to do the best job. Here is a plot of the square-root-transformed speed
data:
1 | par(mfrow = c(1,1)) |
3
Find a range of $\alpha$
values that would help to “symmetrize” the weight
distribution. Plot the histogram of $T_{\alpha}(\texttt{weight})$
using the value of $\alpha$
that seems most appropriate.
1 | par(mfrow = c(1,9), mar = c(1,1,1,1)) |
Based on the figure above, $\alpha \in [0,0.25]$
looks good. Let’s refine our search:
1 | par(mfrow = c(1,6), mar = c(1,1,1,1)) |
Based on the plots above, choosing $\alpha = 0.1$
seems like it should work well. Here is a plot of this power-transformed weight
data:
1 | par(mfrow = c(1,1)) |
4
Cast the ‘optimal power transformation’ problem in the univariate setting as an implicitly defined attribute problem and determine the optimal $\alpha$
values for both the speed
and weight
data. Construct histograms of these optimally-power-transformed variates.
Let $\alpha$
be the attribute of interest, and let $t_u = T_{\alpha}(y_u)$
be the power-transformed observation for unit $u$
. Let us also define the objective function to be minimized as the absolute value of Pearson’s moment coefficient of skewness for the transformed data: $\rho(\alpha;\mathcal{P}) = \left|\frac{\frac{1}{N}\sum_{u \in \mathcal{P}}\left(t_u-\bar{t}\right)^3}{[SD(t)]^3}\right|$
Thus the problem of finding the optimal $\alpha$
reduces to the following optimization problem: $\widehat{\alpha} = \operatorname*{arg min}_{\alpha \in \mathbb{R}}\rho(\alpha;\mathcal{P})$
In R
let’s define $\rho(\alpha;\mathcal{P})$
and then use the nlminb
function to minimize it.
1 | rho_skew <- function(theta, y){ |
Let’s find the optimal $\alpha$
for the speed
data and plot the resulting histogram:
1 | min1 <- nlminb(start = 1, objective = rho_skew, y = poke$speed) |
Let’s find the optimal $\alpha$
for the weight
data and plot the resulting histogram:
1 | min2 <- nlminb(start = 1, objective = rho_skew, y = poke$weight_kg) |
Notice that in both cases the “optimal $\alpha$
“ was not far from the values we determined using Bump Rule #1. In general, although we can always find an optimal power transformation algorithmically (like we did above), the ladder heuristc tends to work pretty well.
5
Find a range of $\alpha_x$
and $\alpha_y$
values that would help to “linearize” the relationship between $T_{\alpha_y}(\texttt{speed})$
and $T_{\alpha_x}(\texttt{weight})$
. Plot the scatterplot of $T_{\alpha_y}(\texttt{speed})$
vs. $T_{\alpha_x}(\texttt{weight})$
using the values of $\alpha_y$
and $\alpha_x$
that seem most appropriate.
1 | alpha_x <- seq(from = 1, to = -1, length.out = 5) |
The plot and matrix of correlations above suggest that $\alpha_x=0$
and $\alpha_y=0$
provide power transformations such that the relationship between $T_{\alpha_y}(\texttt{speed})$
and $T_{\alpha_x}(\texttt{weight})$
is most linear. Further refinement of the $\alpha$
‘s is definitely possible, but a log-transformation is nice and easy to interpret so let’s go ahead with it:
1 | par(mfrow = c(1,1)) |
6
Cast the ‘optimal power transformation’ problem in the bivariate setting as an implicitly defined attribute problem and determine the optimal $\boldsymbol\alpha$
value to “straighten” the scatterplot of speed
versus weight
. Construct a scatterplot using the optimally-power-transformed variates.
Let $\boldsymbol\alpha = (\alpha_x, \alpha_y)$
be the attribute of interest, and let $t_u = T_{\alpha_y}(y_u)$ and $v_u = T_{\alpha_x}(x_u)$
. Let us also define the objective function to be minimized as the negative of the absolute value of the correlation coefficient between the transformed variates (since we actually want to maximize this): $\rho(\boldsymbol\alpha;\mathcal{P}) = -\left|\frac{\sum_{u \in \mathcal{P}}(v_u-\bar{v})(t_u-\bar{t})}{\sqrt{\sum_{u \in \mathcal{P}}(v_u-\bar{v})^2}\sqrt{\sum_{u \in \mathcal{P}}(t_u-\bar{t})^2}}\right|$
Thus the problem of finding the optimal $\boldsymbol\alpha$
reduces to the following optimization problem: $\widehat{\boldsymbol\alpha} = \operatorname*{arg min}_{\boldsymbol\alpha \in \mathbb{R}^2}\rho(\boldsymbol\alpha;\mathcal{P})$
In R
let’s define $\rho(\boldsymbol\alpha;\mathcal{P})$
and then use the nlminb
function to minimize it.
1 | rho_cor <- function(alpha, x, y){ |
Let’s find the optimal $\boldsymbol\alpha$
for the scatterplot of speed
versus weight
and plot the resulting optimally transformed data:
1 | min3 <- nlminb(start = c(1,1), objective = rho_cor, x = poke$weight_kg, y = poke$speed) |
Notice the “optimal correlation” is 0.1431949, which is not so different from the one we found (0.14163987) using Bump Rule #2
– and our transformations are much more easily interpreted. In general, although we can always find an optimal power transformation algorithmically (like we did above), the ladder heuristc tends to work pretty well.