Posts List

Yes, It's Normal!

In a prior blog post, Is it Normal?, we began with two normal distributions and summed their frequencies to obtain a Gaussian mixture. In this post, we begin with a Gaussian mixture and deploy the Expectation-Maximization (EM) algorithm to decompose a given Gaussian mixture into its component distributions. Example code is included, and the results of these examples are contrasted with that of an R package mixtools, a professional software release which is based upon based upon work supported by the National Science Foundation under Grant No.

Correlation, Sample Size, and Significance

The correlation coefficient, r, measures the strength of a linear relationship between variables, but not its significance. The null hypothesis of zero correlation between variables, r = 0, can be refuted by a statistical test where the associated p-value is a function both of the magnitude of correlation as well as the sample size. In general, larger sample sizes with larger |r| values are more significant. But how often do p-values and sample size simultaneously increase?

Predicting Kilowatt Consumption

Whenever past performance is indicative of future results, predictive modeling is prescient. Such is the case with electrical bills. Twenty-two months worth of electrical bills for a four bedroom, two bath apartment of a 1500 square foot duplex in the Lincoln, Nebraska area were submitted by residents. The following billing-period statistics were abstracted from each electrical bill: kWh, total kilowatt hour usage, avg_kWh_per_day, average kilowatt hour usage per day, avg_high, average high temperature, and avg_low, average low temperature.

The Curse of Dimensionality: A Visual Approach

The Curse of Dimensionality refers to the phenomenon by which all observations become extrema as the number of free parameters, also called dimensions, grows. In other words, hyper-dimensional cubes are almost all corners. Corners, in this context, refer to the volume contained in cubes outside of the volume contained by inscribed spheres regardless of dimension. The Curse of Dimensionality is of central importance to machine learning datasets which are often high dimensional.

Random Forests Outperform Simple Regression on Solubility Data

Solubility can be defined as the propensity of a solid, liquid, or gaseous quantity (solute) to dissolve in another substance (solvent). Among many factors, temperature, pH, and pressure, and entropy of mixing all impact solubility. (Loudon, Parise 2016) Solvents can be classified as either protic or aprotic, polar or apolar, and donor or nondonor. (Loudon, Parise 2016) The specifics, illustrated by the solubility data set from Applied Predictive Modeling, are beyond the scope of this paper.