So far we have introduced a few regression models:
Overall this module will provide a few ways to help tackling high-dim problems:
In general, the volumes of a cube and a ball in $\bR^n$ are, respectively, $$ \mathrm{Vol}(C) = (2R)^n,\quad \mathrm{Vol}(B) = \frac{\pi^{n/2}}{\Gamma(n/2+1)}R^n $$ where $\Gamma$ is the Gamma function (i.e., $\Gamma(n)=(n-1)!$). Then the ratio decays exponentially $$ r = \frac{\pi^{n/2}}{2^n\Gamma(n/2+1)} $$
ns = np.arange(25)
rs = np.pi**(ns/2) / gamma(ns/2+1) / 2**ns
f = plt.figure(figsize=(6,4))
plt.semilogy(ns, rs, 'bo')
plt.xlabel('Dimension')
_=plt.ylabel('Ratio')
Thinking in a "polar" coordinate system, the probability of $||x_i||<d$ should be equal to $$ \frac{\mbox{Volume of Ball of Radius }d}{\mbox{Volume of Unit Ball}} $$ so $P(||x_i||<d)=d^n$
Then, what is the probability $P(d)$ that all the points are at least of distance $d$ away from the origin (i.e., center of ball)?
This is equivalent to $$ P(d) = P(\forall ||x_i|| \geq d) = \prod_{i=1}^N \left[ 1-P(||x_i||< d) \right] = (1-d^n)^N $$ Now that $d<1$ and $n$ is large, so $d^n\approx 0$ and $P(d)\approx 1$.
For example, when $n=20$, $N=500$, $P(0.72)\approx 0.5$, so
In a unit ball in $\bR^{20}$ and 500 uniformly distributed points, it is 50% chance that all the points are at least 0.72 (!) away from the origin.
Even though intuitively Gaussians tend to concentrate at the mean (or the origin if mean=0) ...
First, the length of one sample follows the well-known Chi-squared distribution, i.e. $$ \chi_n^2 \equiv \sum_{i=1}^n x_i^2 = ||x||^2 $$ And $\mathbb{E}(\chi_n^2)=n$ - so on average the length is $\sqrt{n}$.
Second, there is a theorem for our case - $$ P(|||x||-\sqrt{n}|\leq t) \geq 2\exp(-ct^2) $$ where $c>0$ is a constant. So the deviation $t$ of $||x||$ from its mean $\sqrt{n}$ decays exponentially fast in $t$.
Ns = 1000 # Numerical verification over 1000 samples of unit Gaussian from R^2 and R^20
S2 = np.random.randn(Ns,2)
Sh = np.random.randn(Ns,20)
d2 = np.linalg.norm(S2, axis=1) / np.sqrt(2) # Normalize the lengths, so they should be around 1
dh = np.linalg.norm(Sh, axis=1) / np.sqrt(20)
f = plt.figure(figsize=(6,4))
plt.hist(d2, density=True, label='R^2')
plt.hist(dh, density=True, label='R^20')
plt.legend()
plt.xlabel('Normalized length')
_=plt.ylabel('PDF')
After some derivation (some tedious integral on $\bR^n$), the PDF of the angle between $x$ and $y$ is $$ p(\theta) = \frac{\Gamma(n/2)}{\Gamma((n-1)/2)\sqrt{\pi}}\sin^{n-2}(\theta) $$ where $\theta=\cos^{-1}(x^Ty)$ and $$ P(|x^T y|\geq \epsilon) \leq \frac{1}{n\epsilon^2} $$
tt = np.linspace(0,np.pi,100)
f, ax = plt.subplots(figsize=(6,4))
for _i in [2, 4, 8, 16, 32, 64]:
pt = gamma(_i/2) / gamma((_i-1)/2) / np.sqrt(np.pi) * np.sin(tt)**(_i-2)
ax.plot(tt, pt, label=f'R^{_i}')
ax.legend()
ax.set_xticks([0,np.pi/2,np.pi])
ax.set_xticklabels(['0','$\pi/2$','$\pi$'])
ax.set_xlabel(r'$\theta$')
_=ax.set_ylabel(r'$p(\theta)$')
We can also verify numerically. Randomly generate 100 $\bR^{100}$ vectors $w_i$ and form a matrix $W=[w_1,w_2,\cdots,w_{100}]$. If the vectors are orthogonal to each other, then $$ W^TW\approx I $$ To make sure the vectors are of unit length, we chose $N(0,1/n)$.
n = 100
W = np.random.randn(n, n) / np.sqrt(n)
X = np.dot(W.T, W)
f = plt.figure(figsize=(3,3))
plt.imshow(X)
_=plt.title(f'Error {np.mean(np.abs(X - np.eye(n))):5.4f}')
This is a geometrical way to make sense of the weight initialization
By Johnson–Lindenstrauss lemma,
We only need $O(\log(N))$ space to store $N$ vectors, regardless of dimension m
This example actually connects most of the conclusions from the previous examples
Mathematically, suppose we want to store $N$ vectors $v_1,v_2,\cdots,v_N$, $v_i\in\bR^{m}$
The J-L lemma says, one can
Let's numerically verify the lemma. Consider $N=1000$ vectors that represent sinsoidal curves sampled over a grid, i.e. $$ v_i = [\sin(i\pi x_1), \sin(i\pi x_2),\cdots, \sin(i\pi x_m)],\quad i=1,2,\cdots,N $$ Consider a uniform grid of $m$ points over $[0,1]$, i.e., $x_i=\frac{i}{m-1}$, $i=0,1,\cdots,m-1$.
For an error $\epsilon=0.1$, by J-L lemma, we need $n>\frac{24\log N}{\epsilon^2}\approx 16578.6$.
But, let's just do $n=200$.
For simplicity, we will check the error just for the first vector, against the rest vectors, $$ \mbox{Error}_i = \left| \frac{||w_1-w_i||}{||v_1-v_i||} - 1 \right|,\quad i=2,3,\cdots,N $$
N, n = 1000, 200 # Number of vectors, Reduced space of w
f = plt.figure(figsize=(6,4))
# Loop over a range of dimensions for v_i
for _g in [101, 501, 1001, 2001, 5001]:
xx = np.linspace(0,1,_g)
V = np.array([np.sin(i*np.pi*xx) for i in range(1,N+1)]).T
A = np.random.randn(n,_g) / np.sqrt(n) # Random projection matrix
W = A.dot(V) # The projection
Wt, Vt = W.T, V.T
Er = np.abs( (np.linalg.norm(Wt-Wt[0], axis=1) / np.linalg.norm(Vt-Vt[0], axis=1))-1 )
plt.plot(Er, label=f'm={_g}')
plt.legend()
plt.xlabel('No. of vector') # See how the errors are constant on average,
_=plt.ylabel('Relative Error') # regardless of dimension of v_i