Recently I was talking, on the internet no less, with someone who was trying to take some data presumably drawn at random and find the parameters of the corresponding pdf (some weird Weibull something-er-other for doing risk modelling but that's beside the point for what follows). In simple cases the common man would look up the maximum likelihood estimators for the parameters of the given pdf, or maybe numerically find the parameters that minimize the Kolmogorov-Smirnov statistic or something. This guy's plan was to do a non-linear least squares fit of the pdf.

At first blush: *what?* Typically least squares fitting is done when you have some relation $y = f(x) + \epsilon$ where $\epsilon$ is some normally distributed measurement error. In this case you would be fitting a frequency (I guess) and with the measured frequency as an estimate of the "real frequency" plus some "error term" but it's not clear to me how you account for how that error is distributed. I doubt it will be normally distributed in general. This is important as the error needs some important properties to ensure that the parameters estimated by least-squares are unbiased etc. As I understand it, it is the properties of the error that determine if least-squares provides a *good fit* in accordance with some definition of *good*.

Secondly fitting a histogram to the pdf is going to have issues of just deciding how to bin the data, the same data can generate very different histograms depending on how you chop it up, and these will all generate very different fits of the pdf (a similar problem comes up with kde with the bandwidth, which you can solve with cross-validation so this isn't necessarily a fatal problem). Below I sample 50 numbers from a standard Birnbaum-Saunders distribution (commonly used in reliability engineering which was the context for the original conversation) and show 3 histograms with different bin sizes. As you can see the features seem to change as the binsize changes, I can get a wide array of estimates for the distributions parameters and it isn't immediately obvious which one is the *best*.

But at the end of the day, for all that I can spin objections for why I don't think it is "valid" -- i.e. "it isn't how I was taught to solve this particular problem" -- I just don't know whether it works or not. The only thing left to do is try it out and see.

The second problem can be dealt with somewhat by fitting to the cdf as opposed to the pdf, this is shown below. The cumulative histogram eye-balls better to the cdf for the distribution at least.

But at this point I haven't addressed whether or not least-squares *works*. I am not a mathemagician and I am sitting in front of a computer so why not make the computer do some experiments and see how well it performs? For simplicity I'm going to just try the Birnbaum-Saunders distribution with only a shape parameter *c*. I'm picking this distribution partly because it is adjacent to the field we were discussing (reliability engineering) and also because it is not normal. If I used a normal distribution and got normally distributed errors, what would that tell me? I'm not sure.

I want to compare a few methods, the standard maximum likelihood estimation, non-linear least squares, and minimizing the KS statistic (hey why not?).

```
%%px --local
import numpy as np
import scipy.stats as spst
from scipy.optimize import curve_fit, minimize_scalar
def bscdf(x, *c):
return spst.fatiguelife.cdf(x, c, loc=0.0, scale=1.0)
def mlfit(data):
res = spst.fatiguelife.fit(data)
return res[0]
def nlfit(data):
xdata = np.sort(data)
ydata = np.cumsum(np.ones(xdata.shape)/len(xdata))
p0 = 1.0
popt, pcov = curve_fit(bscdf, xdata, ydata, p0)
return float(popt)
def ksfit(data):
def fun(c, xdata):
return spst.kstest(xdata, bscdf, args=(c,)).statistic
res = minimize_scalar(fun, args=(data,), method='Brent')
return float(res.x)
def simulation(run, c=27):
dataset = spst.fatiguelife.rvs(c, size=50)
mlres = mlfit(dataset)
nlres = nlfit(dataset)
ksres = ksfit(dataset)
return [mlres, nlres, ksres]
```

The simulation generates a random sample of 50 points from a standard Birnbaum-Saunders distribution with shape factor $c=27$, then tries the built-in MLE fit method (yay scipy!), a non-linear fit, and a fit by minimizing the Kolmogorov-Smirnov statistic. So lets run this 1000 times and see what happens.

I am actually rather surprised, it looks like nonlinear fit outperformed the MLE, at least in terms of having less variance and actually being closer to the real value of the shape parameter, the KS fit appears to be the worst in terms of variance. All of this could just be a random fluke so I wouldn't read too much into it.

It's great that in *this particular instance* using least-squares turned out to be a great idea, but does it work generally? Let me wave my hands and pretend I know something about probability here (I am not a mathematician, I'm an engineer (and a daft one at that)).

Consider what I was doing: I draw some number $n$ observations from the distribution then I count up how many observations $X \le x$, then divide by $n$ to get a frequency.

Let $Y$ be the fraction of observations $X \le x$ out of a lot of $n$ total observations, i.e. if there are $k$ observations, then $Y = k/n$. Note the probability that $X \le x$ is given by the cdf for the distribution of $X$, that is: $F(x) = Pr(X \le x)$.

It should be pretty clear that the probability that $Y = k/n$ at a given $x$ is $$Pr(Y = k/n) = \frac{1}{n} { n \choose k } \left( F(x) \right)^{k} \left( 1 - F(x) \right)^{n-k} $$

With expecation value $$E(Y) = F(x)$$

and variance $$ Var(Y) = { F(x)\cdot (1-F(x)) \over n }$$

By de Moivre-Laplace theorem as $n \to \infty$ the $ Pr(Y=k/n) $ is approximated by a normal distribution. Which is a round-about way of saying that as the number of data points in the original sampling increase the distribution of *Y*s is better approximated by a normal distribution, which is encouraging at least.

Anyways's, let's consider the error, *e*, used in the least squares estimate:
$$ e_{i} = Y_{i} - F(x_{i}) $$

The expectation value for the error is: $$ E(e_{i}) = E\left( Y_{i} - F(x_{i}) \right) $$ $$ E(e_{i}) = E(Y_{i}) - F(x_{i}) $$ $$ E(e_{i}) = F(x_{i}) - F(x_{i}) = 0 $$

And the variance for the error is: $$ Var(e_{i}) = Var \left( Y - F(x_{i}) \right) $$ $$ Var(e_{i}) = Var(Y) = { F(x_{i}) \cdot \left( 1 - F(x_{i}) \right) \over n } $$

This is encouraging, the error is centered at zero, but it is heteroscedastic so some weighting matrix will be needed. For least-squares estimators to be efficient and unbiased, as far as I understand, the error has to be uncorrelated, homoscedastic, and centered at zero. With appropriate weighting and thought, least squares fitting of a cdf may not be a dumb idea afterall.

As usual the ipython notebook is on github