Midnight Mechanism

All Posts

Correlation, Sample Size, and Significance

David Pratt · Sep 6, 2018 · 3 min read

The correlation coefficient, r, measures the strength of a linear relationship between variables, but not its significance. The null hypothesis of zero correlation between variables, r = 0, can be refuted by a statistical test where the associated p-value is a function both of the magnitude of correlation as well as the sample size. In general, larger sample sizes with larger |r| values are more significant. But how often do p-values and sample size simultaneously increase?

Predicting Kilowatt Consumption

David Pratt · Aug 22, 2018 · 5 min read

Whenever past performance is indicative of future results, predictive modeling is prescient. Such is the case with electrical bills. Twenty-two months worth of electrical bills for a four bedroom, two bath apartment of a 1500 square foot duplex in the Lincoln, Nebraska area were submitted by residents. The following billing-period statistics were abstracted from each electrical bill: kWh, total kilowatt hour usage, avg_kWh_per_day, average kilowatt hour usage per day, avg_high, average high temperature, and avg_low, average low temperature.

The Curse of Dimensionality: A Visual Approach

David Pratt · Mar 21, 2018 · 6 min read

The Curse of Dimensionality refers to the phenomenon by which all observations become extrema as the number of free parameters, also called dimensions, grows. In other words, hyper-dimensional cubes are almost all corners. Corners, in this context, refer to the volume contained in cubes outside of the volume contained by inscribed spheres regardless of dimension. The Curse of Dimensionality is of central importance to machine learning datasets which are often high dimensional.

Dynamic Time Warping of Heart-Rate Time Series

David Pratt · Mar 17, 2018 · 6 min read

George Moody of MIT-BIH noted back in 1996 that “neither first-order statistics nor frequency-domain analyses of HR (heart rate) time series reveal all of the information hidden in heart rate variations.” (Moody 1996) This post will evaluate that claim using a time series similarity metric contrasted with classical statistical tools on heart-rate data first made available at the website listed in the Works Cited section of this paper. The Dynamic Time Warping algorithm (DTW) can detect similarities between time series missed by other statistical tests.

Random Forests Outperform Simple Regression on Solubility Data

David Pratt · Mar 12, 2018 · 3 min read

Solubility can be defined as the propensity of a solid, liquid, or gaseous quantity (solute) to dissolve in another substance (solvent). Among many factors, temperature, pH, and pressure, and entropy of mixing all impact solubility. (Loudon, Parise 2016) Solvents can be classified as either protic or aprotic, polar or apolar, and donor or nondonor. (Loudon, Parise 2016) The specifics, illustrated by the solubility data set from Applied Predictive Modeling, are beyond the scope of this paper.

Speeding Up Postgres Queries with Indices

Justin Sleep · Nov 10, 2017 · 1 min read

Newcomers to SQL often find themselves asking: Why are my queries so slow? Sure, they have half a million rows in their table, but they are only fetching a handful using a WHERE clause. How can it take so long just to return a few rows? The simple answer is: databases don’t know how you’re going to filter the column in your queries. They’re smart, but they can’t read your mind.