Machine Learning for Engineering¶

High-Dimensional Systems: Introduction¶

Instructor: Daning Huang¶

$$ \newcommand{\bR}{\mathbb{R}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cD}{\mathcal{D}} $$

TODAY: High-Dimensional Systems - I¶

  • Introduction
  • "Feel" high dimensionality

Introduction¶

So far we have introduced a few regression models:

  • Useful for establishing the functional relation between several input and output variables.
  • Possible to handle dozens of variables
    • but not millions of variables
    • and more variables require larger dataset, which is not always available
  • But in practical engineering systems, the number of variables can easily exceed millions
    • In terms of outputs, inputs, and system/design parameters
    • In engineering analysis (computational fluid dynamics, finite element analysis, etc.)
    • In control and dynamics (trajectory planning, uncertainty quantification, etc.)

Curse of Dimensionality¶

  • We have encountered one case - data sampling as number of inputs/parameters increases
    • Having $N$ points in 1-D space, to achieve the same density, $N^m$ points are needed in m-D space
    • We introduced the low-discrepancy sequence to tackle the problem.
  • There are certainly more, e.g., in genereal function approximation theory
    • To approximate a function in a m-D space with error less than $\epsilon$, $(1/\epsilon)^m$ evaluations are needed
  • But there are also blessings - some will be discussed today
    • A classical paper: Donoho2000, High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality

Overall this module will provide a few ways to help tackling high-dim problems:

  • Dimensionality reduction
    • Obtain a compressed representation of data with significantly fewer dimensions
    • Conventional regression model will then become applicable.
  • Dynamical system modeling
    • If data comes in order of time, how to utilize this information of sequence
    • In preparation for the next module (reduced-order models)

Getting a feeling of High-dimensionality¶

Example 1: Volumes¶

  • Consider a cube $C$ of side length 2 and a unit ball $B$ in a $\bR^n$ space. What is the ratio $r$ between the volumes of $B$ and $C$?
    • So $B$ is inside and tangent to $C$.

In general, the volumes of a cube and a ball in $\bR^n$ are, respectively, $$ \mathrm{Vol}(C) = (2R)^n,\quad \mathrm{Vol}(B) = \frac{\pi^{n/2}}{\Gamma(n/2+1)}R^n $$ where $\Gamma$ is the Gamma function (i.e., $\Gamma(n)=(n-1)!$). Then the ratio decays exponentially $$ r = \frac{\pi^{n/2}}{2^n\Gamma(n/2+1)} $$

In [2]:
ns = np.arange(25)
rs = np.pi**(ns/2) / gamma(ns/2+1) / 2**ns
f = plt.figure(figsize=(6,4))
plt.semilogy(ns, rs, 'bo')
plt.xlabel('Dimension')
_=plt.ylabel('Ratio')

Example 2: Distribution in unit ball¶

  • Now we continue to look at this unit ball in $\bR^n$. Suppose we randomly and uniformly sample some points $\{x_i\}_{i=1}^N$ in this ball, will they be
    1. Concentrated at the center, i.e. $||x_i||\approx 0$?
    2. Mostly near the boundary, i.e. $||x_i||\approx 1$?
    3. Uniformly distributed, i.e. distribution of $||x_i||$ is uniform?

Thinking in a "polar" coordinate system, the probability of $||x_i||<d$ should be equal to $$ \frac{\mbox{Volume of Ball of Radius }d}{\mbox{Volume of Unit Ball}} $$ so $P(||x_i||<d)=d^n$

Then, what is the probability $P(d)$ that all the points are at least of distance $d$ away from the origin (i.e., center of ball)?

This is equivalent to $$ P(d) = P(\forall ||x_i|| \geq d) = \prod_{i=1}^N \left[ 1-P(||x_i||< d) \right] = (1-d^n)^N $$ Now that $d<1$ and $n$ is large, so $d^n\approx 0$ and $P(d)\approx 1$.

For example, when $n=20$, $N=500$, $P(0.72)\approx 0.5$, so

In a unit ball in $\bR^{20}$ and 500 uniformly distributed points, it is 50% chance that all the points are at least 0.72 (!) away from the origin.

Example 3: Gaussian distribution¶

  • Consider $N$ points $\{x_i\}_{i=1}^N$ sampled from a unit Gaussian distribution in $\bR^n$. How are the lengths of these sample points distributed?
    1. Something close to 0
    2. A non-trivial nonzero value

Even though intuitively Gaussians tend to concentrate at the mean (or the origin if mean=0) ...

First, the length of one sample follows the well-known Chi-squared distribution, i.e. $$ \chi_n^2 \equiv \sum_{i=1}^n x_i^2 = ||x||^2 $$ And $\mathbb{E}(\chi_n^2)=n$ - so on average the length is $\sqrt{n}$.

Second, there is a theorem for our case - $$ P(|||x||-\sqrt{n}|\leq t) \geq 2\exp(-ct^2) $$ where $c>0$ is a constant. So the deviation $t$ of $||x||$ from its mean $\sqrt{n}$ decays exponentially fast in $t$.

  • In other words, high-dimensional Gaussian variables, e.g. in $\bR^n$, mostly reside on a spherical shell of radius $\sqrt{n}$.
  • New intuition
    • Near center, the value of PDF is high; however, in high-dim, the volume near center is small. So the actual probability of having a point near center is low
    • Very away from center, the volume is indeed large, but Gaussian PDF decay exponentially. So the probability of having a very distant point is also low
    • As a result, the points will distribute over a band (in terms of their length/radius)
  • Some more in-depth derivations here
  • This called concentration of measure, see [Donoho2000] for more details
In [3]:
Ns = 1000     # Numerical verification over 1000 samples of unit Gaussian from R^2 and R^20
S2 = np.random.randn(Ns,2)
Sh = np.random.randn(Ns,20)
d2 = np.linalg.norm(S2, axis=1) / np.sqrt(2)   # Normalize the lengths, so they should be around 1
dh = np.linalg.norm(Sh, axis=1) / np.sqrt(20)

f = plt.figure(figsize=(6,4))
plt.hist(d2, density=True, label='R^2')
plt.hist(dh, density=True, label='R^20')
plt.legend()
plt.xlabel('Normalized length')
_=plt.ylabel('PDF')

Example 4: Orthogonality¶

  • If we randomly sample two vectors $x$ and $y$ from a $\bR^n$ space, how likely will the two be orthogonal to each other?
    • For simplicity assume $||x||=||y||=1$; and we can just fix one vector, say $y=[1,0,\cdots,0]$
    • In 2D and 3D, it is very unlikely

After some derivation (some tedious integral on $\bR^n$), the PDF of the angle between $x$ and $y$ is $$ p(\theta) = \frac{\Gamma(n/2)}{\Gamma((n-1)/2)\sqrt{\pi}}\sin^{n-2}(\theta) $$ where $\theta=\cos^{-1}(x^Ty)$ and $$ P(|x^T y|\geq \epsilon) \leq \frac{1}{n\epsilon^2} $$

  • i.e., $x^Ty$ is likely to be 0
  • i.e., randomly chosen high-dimensional vectors are likely to be orthogonal
In [4]:
tt = np.linspace(0,np.pi,100)
f, ax = plt.subplots(figsize=(6,4))
for _i in [2, 4, 8, 16, 32, 64]:
    pt = gamma(_i/2) / gamma((_i-1)/2) / np.sqrt(np.pi) * np.sin(tt)**(_i-2)
    ax.plot(tt, pt, label=f'R^{_i}')
ax.legend()
ax.set_xticks([0,np.pi/2,np.pi])
ax.set_xticklabels(['0','$\pi/2$','$\pi$'])
ax.set_xlabel(r'$\theta$')
_=ax.set_ylabel(r'$p(\theta)$')

We can also verify numerically. Randomly generate 100 $\bR^{100}$ vectors $w_i$ and form a matrix $W=[w_1,w_2,\cdots,w_{100}]$. If the vectors are orthogonal to each other, then $$ W^TW\approx I $$ To make sure the vectors are of unit length, we chose $N(0,1/n)$.

In [5]:
n = 100
W = np.random.randn(n, n) / np.sqrt(n)
X = np.dot(W.T, W)
f = plt.figure(figsize=(3,3))
plt.imshow(X)
_=plt.title(f'Error {np.mean(np.abs(X - np.eye(n))):5.4f}')
  • Connect back to the weight initialization in neural networks
  • We suggested, for example, follow Uniform $U[-s,s]$ to generate the initial weights (and let bias be 0)
    • Xavier (tanh/sigmoid): $s=\sqrt{3/n}$
    • Kaiming (ReLU): $s=\sqrt{6/n}$
  • We argued that in this way for the linear layer $$ h=Wx+b $$ the output $h$ would be small if the input $x$ is small; so subsequently the activation function $\sigma(h)$ is not saturated.
  • Since $W$ is a high-dimensional random matrix, and for simplicity assume $W$ is square, then $$ ||h|| = ||Wx|| = \sqrt{x^TW^TWx} \approx \sqrt{x^Tx} = ||x|| $$ i.e. the norm is unchanged before and after the linear layer.

    This is a geometrical way to make sense of the weight initialization

Example 5: Data compression¶

  • How much storage do we need to store (approximately) $N$ vectors in $\bR^m$?
    1. Proportional to $N$ and $m$
    2. Proportional to only $m$
    3. Proportional to only $N$
  • By Johnson–Lindenstrauss lemma,

    We only need $O(\log(N))$ space to store $N$ vectors, regardless of dimension m

  • This example actually connects most of the conclusions from the previous examples

Mathematically, suppose we want to store $N$ vectors $v_1,v_2,\cdots,v_N$, $v_i\in\bR^{m}$

The J-L lemma says, one can

  • Choose $n>\frac{24\log N}{\epsilon^2}$, where $\epsilon$ is an error tolerance
  • Randomly generate a matrix $A\in\bR^{n\times m}$, where each entry is sampled i.i.d. from $N(0,1/n)$
    • Note that $A$ would be approximately orthogonal
  • Then the vectors $v_i$ can be stored as $w_i=Av_i\in\bR^n$, for $i=1,2,\cdots,N$
    • So this is like a projection on to a subspace
  • And the error is less than $\epsilon$ with a probability of $\frac{N-1}{N}$ $$ (1-\epsilon)||v_i-v_j||^2 \leq ||w_i-w_j||^2 \leq (1+\epsilon)||v_i-v_j||^2 $$
  • The above result is independent of the dimension of the origional vector, and only depends on the number of vectors.
  • And the approximation is likely to be more accurate when there are more vectors.
  • The J-L lemma is the basis of the dimension reduction study; the factor $\frac{24\log N}{\epsilon^2}$ might be large, but the important part is the $\log N$

Let's numerically verify the lemma. Consider $N=1000$ vectors that represent sinsoidal curves sampled over a grid, i.e. $$ v_i = [\sin(i\pi x_1), \sin(i\pi x_2),\cdots, \sin(i\pi x_m)],\quad i=1,2,\cdots,N $$ Consider a uniform grid of $m$ points over $[0,1]$, i.e., $x_i=\frac{i}{m-1}$, $i=0,1,\cdots,m-1$.

For an error $\epsilon=0.1$, by J-L lemma, we need $n>\frac{24\log N}{\epsilon^2}\approx 16578.6$.

But, let's just do $n=200$.

For simplicity, we will check the error just for the first vector, against the rest vectors, $$ \mbox{Error}_i = \left| \frac{||w_1-w_i||}{||v_1-v_i||} - 1 \right|,\quad i=2,3,\cdots,N $$

In [6]:
N, n = 1000, 200   # Number of vectors, Reduced space of w
f = plt.figure(figsize=(6,4))
# Loop over a range of dimensions for v_i
for _g in [101, 501, 1001, 2001, 5001]:
    xx = np.linspace(0,1,_g)
    V  = np.array([np.sin(i*np.pi*xx) for i in range(1,N+1)]).T
    A  = np.random.randn(n,_g) / np.sqrt(n)  # Random projection matrix
    W  = A.dot(V)                            # The projection
    Wt, Vt = W.T, V.T
    Er = np.abs( (np.linalg.norm(Wt-Wt[0], axis=1) / np.linalg.norm(Vt-Vt[0], axis=1))-1 )
    plt.plot(Er, label=f'm={_g}')
plt.legend()
plt.xlabel('No. of vector')     # See how the errors are constant on average,
_=plt.ylabel('Relative Error')  # regardless of dimension of v_i