HD_Basics slides

Machine Learning for Engineering¶

High-Dimensional Systems: Introduction¶

Instructor: Daning Huang¶

$$ \newcommand{\bR}{\mathbb{R}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cD}{\mathcal{D}} $$

TODAY: High-Dimensional Systems - I¶

Introduction
"Feel" high dimensionality

Introduction¶

So far we have introduced a few regression models:

Useful for establishing the functional relation between several input and output variables.
Possible to handle dozens of variables
- but not millions of variables
- and more variables require larger dataset, which is not always available
But in practical engineering systems, the number of variables can easily exceed millions
- In terms of outputs, inputs, and system/design parameters
- In engineering analysis (computational fluid dynamics, finite element analysis, etc.)
- In control and dynamics (trajectory planning, uncertainty quantification, etc.)

Curse of Dimensionality¶

We have encountered one case - data sampling as number of inputs/parameters increases
- Having $N$ points in 1-D space, to achieve the same density, $N^m$ points are needed in m-D space
- We introduced the low-discrepancy sequence to tackle the problem.
There are certainly more, e.g., in genereal function approximation theory
- To approximate a function in a m-D space with error less than $\epsilon$, $(1/\epsilon)^m$ evaluations are needed

But there are also blessings - some will be discussed today
- A classical paper: Donoho2000, High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality

Overall this module will provide a few ways to help tackling high-dim problems:

Dimensionality reduction
- Obtain a compressed representation of data with significantly fewer dimensions
- Conventional regression model will then become applicable.
Dynamical system modeling
- If data comes in order of time, how to utilize this information of sequence
- In preparation for the next module (reduced-order models)

Getting a feeling of High-dimensionality¶

Example 1: Volumes¶

Consider a cube $C$ of side length 2 and a unit ball $B$ in a $\bR^n$ space. What is the ratio $r$ between the volumes of $B$ and $C$?
- So $B$ is inside and tangent to $C$.

In general, the volumes of a cube and a ball in $\bR^n$ are, respectively, $$ \mathrm{Vol}(C) = (2R)^n,\quad \mathrm{Vol}(B) = \frac{\pi^{n/2}}{\Gamma(n/2+1)}R^n $$ where $\Gamma$ is the Gamma function (i.e., $\Gamma(n)=(n-1)!$). Then the ratio decays exponentially $$ r = \frac{\pi^{n/2}}{2^n\Gamma(n/2+1)} $$

In [2]:

ns = np.arange(25)
rs = np.pi**(ns/2) / gamma(ns/2+1) / 2**ns
f = plt.figure(figsize=(6,4))
plt.semilogy(ns, rs, 'bo')
plt.xlabel('Dimension')
_=plt.ylabel('Ratio')

Example 2: Distribution in unit ball¶

Now we continue to look at this unit ball in $\bR^n$. Suppose we randomly and uniformly sample some points $\{x_i\}_{i=1}^N$ in this ball, will they be
1. Concentrated at the center, i.e. $||x_i||\approx 0$?
2. Mostly near the boundary, i.e. $||x_i||\approx 1$?
3. Uniformly distributed, i.e. distribution of $||x_i||$ is uniform?

Thinking in a "polar" coordinate system, the probability of $||x_i||<d$ should be equal to $$ \frac{\mbox{Volume of Ball of Radius }d}{\mbox{Volume of Unit Ball}} $$ so $P(||x_i||<d)=d^n$

Then, what is the probability $P(d)$ that all the points are at least of distance $d$ away from the origin (i.e., center of ball)?

This is equivalent to $$ P(d) = P(\forall ||x_i|| \geq d) = \prod_{i=1}^N \left[ 1-P(||x_i||< d) \right] = (1-d^n)^N $$ Now that $d<1$ and $n$ is large, so $d^n\approx 0$ and $P(d)\approx 1$.

For example, when $n=20$, $N=500$, $P(0.72)\approx 0.5$, so

In a unit ball in $\bR^{20}$ and 500 uniformly distributed points, it is 50% chance that all the points are at least 0.72 (!) away from the origin.

Example 3: Gaussian distribution¶

Consider $N$ points $\{x_i\}_{i=1}^N$ sampled from a unit Gaussian distribution in $\bR^n$. How are the lengths of these sample points distributed?
1. Something close to 0
2. A non-trivial nonzero value

Even though intuitively Gaussians tend to concentrate at the mean (or the origin if mean=0) ...

First, the length of one sample follows the well-known Chi-squared distribution, i.e. $$ \chi_n^2 \equiv \sum_{i=1}^n x_i^2 = ||x||^2 $$ And $\mathbb{E}(\chi_n^2)=n$ - so on average the length is $\sqrt{n}$.

Second, there is a theorem for our case - $$ P(|||x||-\sqrt{n}|\leq t) \geq 2\exp(-ct^2) $$ where $c>0$ is a constant. So the deviation $t$ of $||x||$ from its mean $\sqrt{n}$ decays exponentially fast in $t$.

In other words, high-dimensional Gaussian variables, e.g. in $\bR^n$, mostly reside on a spherical shell of radius $\sqrt{n}$.

New intuition
- Near center, the value of PDF is high; however, in high-dim, the volume near center is small. So the actual probability of having a point near center is low
- Very away from center, the volume is indeed large, but Gaussian PDF decay exponentially. So the probability of having a very distant point is also low
- As a result, the points will distribute over a band (in terms of their length/radius)
Some more in-depth derivations here
This called concentration of measure, see [Donoho2000] for more details

In [3]:

Ns = 1000     # Numerical verification over 1000 samples of unit Gaussian from R^2 and R^20
S2 = np.random.randn(Ns,2)
Sh = np.random.randn(Ns,20)
d2 = np.linalg.norm(S2, axis=1) / np.sqrt(2)   # Normalize the lengths, so they should be around 1
dh = np.linalg.norm(Sh, axis=1) / np.sqrt(20)

f = plt.figure(figsize=(6,4))
plt.hist(d2, density=True, label='R^2')
plt.hist(dh, density=True, label='R^20')
plt.legend()
plt.xlabel('Normalized length')
_=plt.ylabel('PDF')

Example 4: Orthogonality¶

If we randomly sample two vectors $x$ and $y$ from a $\bR^n$ space, how likely will the two be orthogonal to each other?
- For simplicity assume $||x||=||y||=1$; and we can just fix one vector, say $y=[1,0,\cdots,0]$
- In 2D and 3D, it is very unlikely

After some derivation (some tedious integral on $\bR^n$), the PDF of the angle between $x$ and $y$ is $$ p(\theta) = \frac{\Gamma(n/2)}{\Gamma((n-1)/2)\sqrt{\pi}}\sin^{n-2}(\theta) $$ where $\theta=\cos^{-1}(x^Ty)$ and $$ P(|x^T y|\geq \epsilon) \leq \frac{1}{n\epsilon^2} $$

i.e., $x^Ty$ is likely to be 0
i.e., randomly chosen high-dimensional vectors are likely to be orthogonal

In [4]:

tt = np.linspace(0,np.pi,100)
f, ax = plt.subplots(figsize=(6,4))
for _i in [2, 4, 8, 16, 32, 64]:
    pt = gamma(_i/2) / gamma((_i-1)/2) / np.sqrt(np.pi) * np.sin(tt)**(_i-2)
    ax.plot(tt, pt, label=f'R^{_i}')
ax.legend()
ax.set_xticks([0,np.pi/2,np.pi])
ax.set_xticklabels(['0','$\pi/2$','$\pi$'])
ax.set_xlabel(r'$\theta$')
_=ax.set_ylabel(r'$p(\theta)$')

We can also verify numerically. Randomly generate 100 $\bR^{100}$ vectors $w_i$ and form a matrix $W=[w_1,w_2,\cdots,w_{100}]$. If the vectors are orthogonal to each other, then $$ W^TW\approx I $$ To make sure the vectors are of unit length, we chose $N(0,1/n)$.

In [5]:

n = 100
W = np.random.randn(n, n) / np.sqrt(n)
X = np.dot(W.T, W)
f = plt.figure(figsize=(3,3))
plt.imshow(X)
_=plt.title(f'Error {np.mean(np.abs(X - np.eye(n))):5.4f}')

Connect back to the weight initialization in neural networks
We suggested, for example, follow Uniform $U[-s,s]$ to generate the initial weights (and let bias be 0)
- Xavier (tanh/sigmoid): $s=\sqrt{3/n}$
- Kaiming (ReLU): $s=\sqrt{6/n}$
We argued that in this way for the linear layer $$ h=Wx+b $$ the output $h$ would be small if the input $x$ is small; so subsequently the activation function $\sigma(h)$ is not saturated.

Since $W$ is a high-dimensional random matrix, and for simplicity assume $W$ is square, then $$ ||h|| = ||Wx|| = \sqrt{x^TW^TWx} \approx \sqrt{x^Tx} = ||x|| $$ i.e. the norm is unchanged before and after the linear layer.
This is a geometrical way to make sense of the weight initialization

Example 5: Data compression¶

How much storage do we need to store (approximately) $N$ vectors in $\bR^m$?
1. Proportional to $N$ and $m$
2. Proportional to only $m$
3. Proportional to only $N$

By Johnson–Lindenstrauss lemma,

We only need $O(\log(N))$ space to store $N$ vectors, regardless of dimension m
This example actually connects most of the conclusions from the previous examples

Mathematically, suppose we want to store $N$ vectors $v_1,v_2,\cdots,v_N$, $v_i\in\bR^{m}$

The J-L lemma says, one can

Choose $n>\frac{24\log N}{\epsilon^2}$, where $\epsilon$ is an error tolerance
Randomly generate a matrix $A\in\bR^{n\times m}$, where each entry is sampled i.i.d. from $N(0,1/n)$
- Note that $A$ would be approximately orthogonal
Then the vectors $v_i$ can be stored as $w_i=Av_i\in\bR^n$, for $i=1,2,\cdots,N$
- So this is like a projection on to a subspace
And the error is less than $\epsilon$ with a probability of $\frac{N-1}{N}$ $$ (1-\epsilon)||v_i-v_j||^2 \leq ||w_i-w_j||^2 \leq (1+\epsilon)||v_i-v_j||^2 $$

The above result is independent of the dimension of the origional vector, and only depends on the number of vectors.
And the approximation is likely to be more accurate when there are more vectors.
The J-L lemma is the basis of the dimension reduction study; the factor $\frac{24\log N}{\epsilon^2}$ might be large, but the important part is the $\log N$

Let's numerically verify the lemma. Consider $N=1000$ vectors that represent sinsoidal curves sampled over a grid, i.e. $$ v_i = [\sin(i\pi x_1), \sin(i\pi x_2),\cdots, \sin(i\pi x_m)],\quad i=1,2,\cdots,N $$ Consider a uniform grid of $m$ points over $[0,1]$, i.e., $x_i=\frac{i}{m-1}$, $i=0,1,\cdots,m-1$.

For an error $\epsilon=0.1$, by J-L lemma, we need $n>\frac{24\log N}{\epsilon^2}\approx 16578.6$.

But, let's just do $n=200$.

For simplicity, we will check the error just for the first vector, against the rest vectors, $$ \mbox{Error}_i = \left| \frac{||w_1-w_i||}{||v_1-v_i||} - 1 \right|,\quad i=2,3,\cdots,N $$

In [6]:

N, n = 1000, 200   # Number of vectors, Reduced space of w
f = plt.figure(figsize=(6,4))
# Loop over a range of dimensions for v_i
for _g in [101, 501, 1001, 2001, 5001]:
    xx = np.linspace(0,1,_g)
    V  = np.array([np.sin(i*np.pi*xx) for i in range(1,N+1)]).T
    A  = np.random.randn(n,_g) / np.sqrt(n)  # Random projection matrix
    W  = A.dot(V)                            # The projection
    Wt, Vt = W.T, V.T
    Er = np.abs( (np.linalg.norm(Wt-Wt[0], axis=1) / np.linalg.norm(Vt-Vt[0], axis=1))-1 )
    plt.plot(Er, label=f'm={_g}')
plt.legend()
plt.xlabel('No. of vector')     # See how the errors are constant on average,
_=plt.ylabel('Relative Error')  # regardless of dimension of v_i