Curse of Dimensionality
Introduction
The Curse of Dimensionality refers to the difficulties in modeling high dimensional data. A dimension refers to the features, or variables: the things we are measuring. Traditionally, the Curse of Dimensionality wasn’t an issue because n, the number of observations (data points) was far greater than p, the number of dimensions. Imagine we are measuring the statistical distribution of Body Mass Index (BMI) throughout hundreds of thousands of people. We determine the features, or dimensions, we really need are age, height, weight, and sex; that is, p=4. Let’s say we get a big study of 200,000 people, that is, 200,000 observations. In this case we have n much much larger than p.
Why Does n Being Larger Than p Matter?
Imagine that, instead of 200,000 observations, we have 200,000 dimensions, or features, to measure by, and only 4 observations, making p far larger than n. Imagine these 200,000 dimensions are questions about someone’s life; how much ice cream they’ve aten in total through their whole life in gallons, how many miles they’ve ran through their life, how many hours spent watching tv through their life, etc. And it turns out that upon asking our single subject, she only knows 4 of our questions (because who is really keeping track of this stuff?). We can’t say 0 for these questions, because she doesn’t know at all. We can’t just guess, because that’s not accurate. So we have to have a null response for 199,996 dimensions. But if we are going to predict a few of these dimensions based off of the responses of all the others… how can we do so if we don’t have any input at all on the vast majority of these dimensions?
Sometimes, you will see the Curse of Dimensionality mentioned as being the result of having "sparse" data. Sparsity comes from linear algebra, and can be represented by matrices with many zeroes, such as this one:
\begin{pmatrix} 0 & 0 & 0 & 0 & 4 \\ 0 & 0 & 2 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 3 & 0 & 0 & 0 & 11 \\ 0 & 0 & 0 & 0 & 0 \\ \end{pmatrix}
Although sparsity refers to many zeroes or ones in matrices, in data science, the curse of dimensionality often refers to data with null data points. The difference between null and zero is important. |
How Can We Deal With The Curse of Dimensionality In Data Science?
There are numerous methods for dealing with the Curse.
The Obvious: Reduce The Dimensions
If in our example above, only 4 of the dimensions are able to be answered, then maybe sticking to just those 4 is best, and removing the 199,996 others. Simple, but sometimes this doesn’t work: for instance, what if we discover that although people answer on average 4 dimensions, they typically don’t answer the same 4? We would still have very sparse data.
Dimensionality Reduction
Dimensionality reduction is a technique where we reduce the number of dimensions while at the same time try to preserve the statistical signal embedded at each dimension. It is somewhat lossy, in that the statistical signal is not guaranteed to be preserved; however, dimensionality reduction techniques have proven incredibly useful as of late. Here is a list of some of the most common techniques in data science:
-
Principal Component Analysis (PCA)
-
Kernel PCA
-
Linear Discriminant Analysis (LDA)
-
Autoencoders
-
t-Distributed Stochastic Neighbor Embedding
-
Uniform Manifold Approximation and Projection (UMAP)
-
Generalized Discriminant Analysis (GDA)
Thinking About Your Problem
Sometimes, our data can present ways in which we can trim out dimensions if we think carefully about the problem. An example of this can be found in analyzing healthcare x-ray images. We are given a set of a few hundred thousand x-rays of human skulls. We are tasked with identifying anomalies in the skulls. This is incredibly high dimensional data: given that our pixel (x,y) dimensions are 2000*3000, our total dimensions are
$\underbrace{2000*3000}_{x*y}*\underbrace{3}_{RGB}*\underbrace{16}_{BPC}=288,000,000$
Above, the x*y are the pixel dimensions of the image. The RGB is the Red, Green and Blue colors at each pixel (between 0 and 255). The BPC is "Bits Per Channel", which is the number of bits per channel that the image type supports. Commonly, this is 16 in PNG images for instance.
Maybe, after EDA, we discover that the vast majority of the space around the edges is black; we notice that, if we just slice out 25% from the centered x and y values, we should be able to still capture all or mostly all of the skull but without all that black trim outline. So now our pixel dimensions are 1500*2250.
It’s an x-ray image, so grayscale is what matters by nature of the x-ray. Somehow these are RGB; by converting RGB to grayscale, we would divide the dimensions by 3 (because the 3 RGB color channels would go to 1 grayscale channel).
Finally, we could use 8 bits per channel instead of 16. This method is somewhat lossy, and it amounts to a form of compression.
Now our dimensions would be:
$\underbrace{1500*2250}_{x*y}*\underbrace{1}_{Grayscale}*\underbrace{8}_{BPC}=27,000,000$
Now, 27 million dimensions is still way too many- clearly other techniques are needed here, possibly autoencoding. But we reduced the total dimensions by a factor of ~10, without doing much!
Thinking about the number of dimensions in your problem, where they come from, and seeing if there are simple things you can do to reduce the dimensionality often is a great step.