A client looking to save money on energy expenses replaced their conventional thermostat with a smart one. The goal was to reduce energy bills by making efficiency gains. When significant reduction in energy expenditure failed to materialize, the client reached out to Midnight Mechanism for analysis from first principles. An overview of the client’s system immediately follows.
Overview An acceptable indoor temperature range of 62°F to 77°F was established, with an optimum setpoint of 68°F.

In a prior blog post, Is it Normal?, we began with two normal distributions and summed their frequencies to obtain a Gaussian mixture. In this post, we begin with a Gaussian mixture and deploy the Expectation-Maximization (EM) algorithm to decompose a given Gaussian mixture into its component distributions. Example code is included, and the results of these examples are contrasted with that of an R package mixtools, a professional software release which is based upon based upon work supported by the National Science Foundation under Grant No.

The sum of random variables should not be confused with the sum of their distributions. If both distributions are normal, the former is also a normal distribution with appropriately scaled parameters. The latter would be what is called a Gaussian mixture. This piece will illustrate both sums, beginning with two normal distributions with identical standard deviations yet different means.
First, we load necessary packages and define our distributions:
pacman::p_load("ggplot2", "RColorBrewer", "extrafont") set.

In this post, we consider the annualized cost of hiring a data scientist as an employee, and offer our services on a contract basis as a viable alternative.
Problem The hidden costs of hiring begin long before an offer is made and continue long after. Newly-posted jobs require an average of 42 days to fill. (1) Professional onboarding costs include the inevitable orientation period during which new hires earn full salary but have yet to attain full productivity.

Cash may be king, but cashflow is god—at least for small businesses, who can be simultaneously profitable yet bankrupt. A creditor, perhaps with cash flow problems of its own, need only delay the payment of its accounts receivable long enough such that the small business in question lacks the resources to meet its short-term obligations.
Unfortunately, even fractional CFOs may be disinterested in a small business due to its size. The responsibility of cash flow management therefore falls frequently on the shoulders of those who feel ill equipped to manage what amounts to the differencing of statistical distributions over time.

The correlation coefficient, r, measures the strength of a linear relationship between variables, but not its significance. The null hypothesis of zero correlation between variables, r = 0, can be refuted by a statistical test where the associated p-value is a function both of the magnitude of correlation as well as the sample size. In general, larger sample sizes with larger |r| values are more significant. But how often do p-values and sample size simultaneously increase?

Whenever past performance is indicative of future results, predictive modeling is prescient. Such is the case with electrical bills. Twenty-two months worth of electrical bills for a four bedroom, two bath apartment of a 1500 square foot duplex in the Lincoln, Nebraska area were submitted by residents. The following billing-period statistics were abstracted from each electrical bill:
kWh, total kilowatt hour usage, avg_kWh_per_day, average kilowatt hour usage per day, avg_high, average high temperature, and avg_low, average low temperature.

The Curse of Dimensionality refers to the phenomenon by which all observations become extrema as the number of free parameters, also called dimensions, grows. In other words, hyper-dimensional cubes are almost all corners. Corners, in this context, refer to the volume contained in cubes outside of the volume contained by inscribed spheres regardless of dimension. The Curse of Dimensionality is of central importance to machine learning datasets which are often high dimensional.

George Moody of MIT-BIH noted back in 1996 that “neither first-order statistics nor frequency-domain analyses of HR (heart rate) time series reveal all of the information hidden in heart rate variations.” (Moody 1996) This post will evaluate that claim using a time series similarity metric contrasted with classical statistical tools on heart-rate data first made available at the website listed in the Works Cited section of this paper.
The Dynamic Time Warping algorithm (DTW) can detect similarities between time series missed by other statistical tests.

Solubility can be defined as the propensity of a solid, liquid, or gaseous quantity (solute) to dissolve in another substance (solvent). Among many factors, temperature, pH, and pressure, and entropy of mixing all impact solubility. (Loudon, Parise 2016) Solvents can be classified as either protic or aprotic, polar or apolar, and donor or nondonor. (Loudon, Parise 2016) The specifics, illustrated by the solubility data set from Applied Predictive Modeling, are beyond the scope of this paper.