Chapter 24 Re-expression with R

Welcome back to Quantitative Reasoning! In chapter 4, our textbook introduces re-expression as a method to make skewed distributions more symmetric. Let’s use the data set from the book to see how we can use R to generate re-expressed data. You can find a link to the spreadsheet forbes500_ceo_comp.csv below this video (http://michaelgastner.com/data_for_QR/forbes500_ceo_comp.csv). Please import the data into R. I’m shortening the name of the data frame to forbes so that later on I don’t need to type so much.

forbes <- read.csv("forbes500_ceo_comp.csv")

The data frame contains information about CEOs of Forbes 500 companies in 2012.3

head(forbes)
##   rank                 name         company   compens
## 1    1    John H Hammergren        McKesson 131190000
## 2    2         Ralph Lauren    Ralph Lauren  66650000
## 3    3 Michael D Fascitelli  Vornado Realty  64400000
## 4    4     Richard D Kinder   Kinder Morgan  60940000
## 5    5         David M Cote       Honeywell  55790000
## 6    6           George Paz Express Scripts  51520000

The first column shows the CEOs’ ranks in terms of their compensation. The second column contains the CEOs’ names. The third column stores the company names. The fourth column shows the CEOs’ compensation in US dollars.

In Figure 4.7, our textbook shows a histogram of the compensation. We can reconstruct this histogram with the R commands shown here. For this histogram, the textbook divides the compensation by millions of dollars to make the numbers on the x-axis more legible.

hist(forbes$compens / 1000000,  # Divide by 1 million to make numbers smaller.
     breaks = seq(0, 150, 2.5),
     col = "lightgoldenrod1",
     main = "CEOs' compensation for Forbes 500 companies (Year 2012)",
     xlab = "Compensation (million $)",
     ylab = "# of CEOs")

The distribution is strongly right-skewed, which makes it difficult to choose a single number that captures the centre of the distribution. For example, the mean is much larger than the median.

mean(forbes$compens)
## [1] 10473580
median(forbes$compens)
## [1] 6965000

It’s often more convenient—and for some statistical analysis even required—to work with symmetric distributions. Re-expression is a technique that often makes skewed distributions more symmetric. We re-express data by applying a mathematical function to each value before we proceed with the data analysis. For right-skewed data such as CEO compensation, common re-expressions are the square root and the logarithm. In R, we take the square root of elements in a vector with the sqrt() function.

sqrt(c(1, 4, 9, 16))
## [1] 1 2 3 4

The distribution of the square root of the CEOs’ compensation is still moderately right-skewed.

hist(sqrt(forbes$compens))

Let’s try taking the logarithm instead of the square root. You may remember from high school that there’s a large family of logarithms whose members differ by their base. For data analysis, the most important logarithms are

  • the natural logarithm, which uses Euler’s number (\(\approx\) 2.71828) as its base, and
  • the decadic logarithm, whose base is 10.

The natural logarithm is mathematically more convenient, but the decadic logarithm is easier to interpret. Because the natural and decadic logarithms only differ by a constant factor, it doesn’t matter which one we choose. If a distribution is symmetric after transforming all numeric values with the natural logarithm, then the distribution is also symmetric after applying the decadic logarithm to the same numbers, and vice versa.

In R, we take the natural logarithm of a vector with the log() function and the decadic logarithm with the log10() function.

log(c(1, 2.71828, 10))
## [1] 0.0000000 0.9999993 2.3025851
log10(c(1, 2.71828, 10))
## [1] 0.0000000 0.4342942 1.0000000

In Figure 4.8, the authors of our textbook use the decadic logarithm. Let’s follow their example (i.e. we use log10() instead of log()). Here is R code that reproduces their histogram.

hist(log10(forbes$compens),
     breaks = seq(5, 8.25, 0.25),
     col = "lightgreen",
     main = "Re-expressed CEO compensation",
     xlab = "log10 of compensation",
     ylab = "# of CEOs")

Now the distribution looks nearly symmetric, so we expect the mean and median on a log-scale to be almost equal. Looking at the histogram, we estimate both values to be between 6.75 and 7. When we apply R’s median() function to the re-expressed data, the return value is consistent with our estimate.

median(log10(forbes$compens))
## [1] 6.842921

However, the mean() function returns “minus infinity”.

mean(log10(forbes$compens))
## [1] -Inf

The reason for this result is that some CEOs nominally received no payment, …

forbes[forbes$compens == 0, ]
##     rank               name                 company compens
## 498  498       Malon Wilkus American Capital Agency       0
## 499  498 Matthew J Lambiase      Chimera Investment       0
## 500  498         Larry Page                  Google       0

… and the logarithm of 0 is an infinitely large negative number.

log10(0)
## [1] -Inf

The situation would have been even worse if any of the compensations would have been a negative number. Mathematically, the logarithm of a negative number is undefined, so R issues a warning and returns the special value NaN, which stands for “Not a Number”.

log10(-1)
## Warning: NaNs produced
## [1] NaN

The lesson to learn is that we must be careful when re-expressing data with a logarithm. We can remove all non-positive numbers from the data to proceed with our analysis, but this technique may leave us with only few data points. An alternative is to add a positive constant to all data values, but this strategy can feel artificial unless there’s a natural choice for the added constant. In our case, only 3 out of 500 numbers are non-positive, so it is sensible to alert the reader to this fact, remove these values and continue with the analysis. Our textbook applies this strategy according to footnote 4 on page 111. After removing the non-positive values, the mean of the log-transformed data is only 0.015% larger than the log-transformed median.

forbes_positive_compens <- forbes$compens[forbes$compens > 0]
median(log10(forbes_positive_compens))
## [1] 6.843233
mean(log10(forbes_positive_compens))
## [1] 6.84427

This result is a piece of evidence that shows that the log-transformation has made the distribution nearly symmetric.

In summary, we learned how to re-express data with R.

  • We take the square root of a numeric vector with the function sqrt().
  • We make log-transformations with the log() or log10() functions. log() returns the natural logarithms, log10() the decadic logarithms.
  • When we re-express data, the result can become infinite or undefined (NaN). Some R functions handle infinite or undefined numbers gracefully (e.g. hist() silently removes infinite values). Other R functions (e.g. mean()) only produce interpretable results if the input consists of finite numbers. Consequently, we may have to remove some data values to proceed with the analysis.

Next time we talk about a different kind of data transformation: the z-score.

See you soon.