The sum of random variables should not be confused with the sum of their distributions. If both distributions are normal, the former is also a normal distribution with appropriately scaled parameters. The latter would be what is called a Gaussian mixture. This piece will illustrate both sums, beginning with two normal distributions with identical standard deviations yet different means.
First, we load necessary packages and define our distributions:
pacman::p_load("ggplot2",
"RColorBrewer",
"extrafont")
set.seed(1234)
### Define standard-normal random variables
sample_size = 10^5
dist1 = rnorm(sample_size, mean = 50, sd = 5)
dist2 = rnorm(sample_size, mean = 65, sd = 5)
dist3 = dist1 + dist2 # arithemtic sum of random variables
Then, we graph dist1
and dist2
where both components are contained in the dataframe, df1
:
### Separate distributions: designated by position = "identity"
df1 <- data.frame(flip = factor(rep(c("H", "T"), each = sample_size)), value = round(c(dist1, dist2)))
ind_dist <- ggplot(df1, aes(x = value, color = flip, fill = flip)) +
geom_histogram(aes(y = ..density..), position = "identity", alpha = 0.8, binwidth = 1) +
scale_color_manual(values = c("#000000", "#000000", "#999999")) +
scale_fill_manual(values = c("#2C3E50", "#0CB1B9", "#000000")) +
labs(title = "Individual Distributions", x = "", y = "") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text=element_text(size=10, family="Montserrat")) +
theme(legend.position = "none") +
stat_function(fun = dnorm, args = list(mean = mean(dist1), sd = sd(dist1))) +
stat_function(fun = dnorm, args = list(mean = mean(dist2), sd = sd(dist2)))
ind_dist
The result is the following figure: Fig. 1: Two normal distributions, with different means and the same standard deviations. Note the the overlap between the two. This overlap is dealt with in Fig. 3. The next figure illustrates the summation of random variables.
Next, we graph the sum of the random variables, designated in the first code block as dist3
:
### Summation of Random Variables with color interpolation
df2 <- data.frame(flip = factor(rep(c("HT"), 2*sample_size)), value = round(c(dist3)))
blend <- colorRampPalette(c("#2C3E50", "#0CB1B9"), interpolate = "spline")
arithmetic_sum <- ggplot(df2, aes(x = value, color = flip, fill = flip)) +
geom_histogram(aes(y = ..density..), position = "stack", alpha = 0.8, binwidth = 1) +
scale_color_manual(values = c("#000000", "#000000", "#999999")) +
scale_fill_manual(values = blend(17)[[7]]) +
labs(title = "Arithmetic Sum of Random Variables", x = "", y = "") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text=element_text(size = 10, family = "Montserrat")) +
theme(legend.position = "none") +
stat_function(fun = dnorm, args = list(mean = mean(df2$value), sd = sd(df2$value)))
arithmetic_sum
The result is as follows: Fig. 2: The arithmetic sum of two random variables is itself normally distributed. Here, we sum the values of each random variable, drawn from its respective probability distribution, and then take the frequency of observations, bin by bin. In other words, in the context of this graph, we sum horizontally, adding the x-axis components of the two distributions together.
In both figures 1 and 2, the equations for the normal curve were added using R’s dnorm
function. This time, however, we want to graph the equation of our Gaussian mixture, which requires the writing of a custom analogue to dnorm
as the sum of two normal probability-density functions:
### custom function returning dnorm equivalent of Gaussian Mixture
gaussian_mixture <- function(x, mean1, mean2, sd1, sd2){
coeff1 = 1/sqrt(2*pi * sd1^2)
coeff2 = 1/sqrt(2*pi * sd2^2)
freq = coeff1 * exp( -((x - mean1)^2) / (2*sd1^2)) + coeff2 * exp(-(x - mean2)^2 / (2*sd2^2))
return(freq)
}
### Add the distributions as opposed to adding the random variables
gauss_mix <- ggplot(df1, aes(x = value, color = flip, fill = flip)) +
geom_histogram(aes(y = ..density..), position = "stack", alpha = 0.8, binwidth = 1) +
scale_color_manual(values = c("#000000", "#000000", "#999999")) +
scale_fill_manual(values = c("#2C3E50", "#0CB1B9", "#000000")) +
labs(title = "Gaussian Mixture: Sum of Distributions", x = "", y = "") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "none") +
theme(text = element_text(size = 10, family = "Montserrat")) +
stat_function(fun = gaussian_mixture, args = list(mean1 = mean(dist1),
sd1 = sd(dist1),
mean2 = mean(dist2),
sd2 = sd(dist2)))
gauss_mix
The customized function, entitled gaussian_mixture
, is superimposed as follows:
Fig 3: The sum of normal distributions is bimodal. Here, we sum the frequencies of the the probability distributions, leaving the underlying values of each random variable untouched. The result can be thought of as a stacked bar graph whenever the overlap of the distributions is significant enough to be visible.
Conclusion
Since fitting a unimodal distribution to a multimodal one would result in a poor fit, it is important to be able to deal constructively with Gaussian mixtures. Ordinary Least Squares (OLS) regression runs on assumptions of normality, not multimodality. It is therefore important, when faced with a Gaussian mixture, to be able to efficiently decompose the distribution into its component univariate distributions. In the next post, we begin with the algorithmic decomposition of a given multimodal distribution into its Gaussian components. Once the theoretical framework has been built, we will study applications.
Packages Used
Carson Sievert (2018) plotly for R. https://plotly-r.com
Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.