a fun little exercise

Posted in math on Thursday, May 18 2017

I had a conversation today about the curse of dimensionality and ended up pulling out ESL as a resource for some more talk on this. In there I found a great little derivation that both seems ridiculously complicated on its face (it sounds complicated), has deep practical implications, and can be done entirely with facts you learn in an intro probability and statisitcs course.

The problem is this: Suppose you are sampling data points from a p-dimensional unit-ball centered at the origin, and points are uniformly distributed within this ball. The question is, for successive samplings of N points from this ball, what is the median minimum radial distance of the points? That is, if you draw N points one of them is nearest to the origin. If you do this draw many times, what is the median value for these nearest points.

If I was a textbook writer I would tackle this problem linearly but I'm not, so I'm going to trace my circuitous route to the answer. I'm playing a little fast and loose with notation, but hopefully this is a pretty clear sketch of how to get from A to B.

First off this median-minimum thing. That is finding the median value of the 1st order statistic for the N radial distances. Ignore for now how I'm going to get those radial distances. What is the median value of any 1st order statistic?

Straight out of my undergrad stats textbook, the pdf of the 1st order statistic for some sample of N random variables $X_{i}$ (with pdf f(x) and cdf F(x)) is:

$$ f_{Xmin}(x) = N \left(1 - F(x) \right)^{N-1} f(x) $$

We know that the median is defined as the point x s.t. $ P(X \le x) = \frac{1}{2} $, so:

$$ \frac{1}{2} = \int \limits_0^{x} f_{Xmin}(x') dx' = \int \limits_0^{x} N \left(1 - F(x') \right)^{N-1} f(x') dx' = - \left( 1 - F(x) \right)^{N} $$

Or in terms of the radii R, I just need to solve the following for r and I'm done. $$ 1 - \left( \frac{1}{2} \right)^{\frac{1}{N}} = F_{R}(r) $$

Which means I need to know what the is cdf for R, the radial distance to a random point in the ball (I don't care at all how to get the angular distribution in this case). This is pretty easy it turns out, since we want the probability of a point being within any differential hypercube to be the same throughout the ball (uniformly distributed) things conveniently cancel and we end up with..

$$ R^{p} \sim U(0,1) $$ $$ F_{R}(r) = r^{p} $$

So plugging in and solving for r, we end up with: $$ r = \left( 1 - \left( \frac{1}{2} \right)^{\frac{1}{N}} \right)^{\frac{1}{p}} $$

This is exciting, as p gets large r, the median distance to the nearest point, goes to 1 or the edge of the ball. In high dimensional space most points end up being near and edge. Like I said you find a deep and interesting result after tossing together some intro level prob.