The correlation coefficient, r, measures the strength of a linear relationship between variables, but not its significance. The null hypothesis of zero correlation between variables, r = 0, can be refuted by a statistical test where the associated p-value is a function both of the magnitude of correlation as well as the sample size. In general, larger sample sizes with larger |r| values are more significant. But how often do p-values and sample size simultaneously increase?
Whenever past performance is indicative of future results, predictive modeling is prescient. Such is the case with electrical bills. Twenty-two months worth of electrical bills for a four bedroom, two bath apartment of a 1500 square foot duplex in the Lincoln, Nebraska area were submitted by residents. The following billing-period statistics were abstracted from each electrical bill:
kWh, total kilowatt hour usage, avg_kWh_per_day, average kilowatt hour usage per day, avg_high, average high temperature, and avg_low, average low temperature.
The Curse of Dimensionality refers to the phenomenon by which all observations become extrema as the number of free parameters, also called dimensions, grows. In other words, hyper-dimensional cubes are almost all corners. Corners, in this context, refer to the volume contained in cubes outside of the volume contained by inscribed spheres regardless of dimension. The Curse of Dimensionality is of central importance to machine learning datasets which are often high dimensional.
George Moody of MIT-BIH noted back in 1996 that “neither first-order statistics nor frequency-domain analyses of HR (heart rate) time series reveal all of the information hidden in heart rate variations.” (Moody 1996) This post will evaluate that claim using a time series similarity metric contrasted with classical statistical tools on heart-rate data first made available at the website listed in the Works Cited section of this paper.
The Dynamic Time Warping algorithm (DTW) can detect similarities between time series missed by other statistical tests.
Solubility can be defined as the propensity of a solid, liquid, or gaseous quantity (solute) to dissolve in another substance (solvent). Among many factors, temperature, pH, and pressure, and entropy of mixing all impact solubility. (Loudon, Parise 2016) Solvents can be classified as either protic or aprotic, polar or apolar, and donor or nondonor. (Loudon, Parise 2016) The specifics, illustrated by the solubility data set from Applied Predictive Modeling, are beyond the scope of this paper.
Newcomers to SQL often find themselves asking: Why are my queries so slow? Sure, they have half a million rows in their table, but they are only fetching a handful using a WHERE clause. How can it take so long just to return a few rows?
The simple answer is: databases don’t know how you’re going to filter the column in your queries. They’re smart, but they can’t read your mind.