Machine Learning for Engineering¶

Gaussian Process Regression: Sampling¶

Instructor: Daning Huang¶

TODAY: Gaussian Process Regression - IV¶

  • Design of computer experiments
  • Latin hypercube sampling
  • Low-discrepancy sequence

Motivation¶

  • Previously, we assume the training dataset is given to us, i.e. we do not take control of the data generation. The sample input may not be representative of the parameter space of interest.
  • But in many cases, we want a surrogate of a computational model that is expensive to evaluate.
    • For example, a CFD or FEM model
    • But in this case, we can specify a set of inputs that populates the parameter space.
  • The question then becomes how to choose inputs to best represent the parameter space, and build the ML model that approximates the input-output relation of interest.

Differences between this lecture and the previous lecture.

  • Previously, the sampling focused on the approximation of a probabilistic distribution.
  • This time, the sampling is about the infilling of a space.

Design of Experiments¶

  • Full-factorial
    • For $d$ dimensions, divide each into $N$ intervals, and generate a $N^d$ grid.
    • Good for low-dimensions
  • Variants
    • Central composite design: Factorial + Center + Axial
    • Box-Behnken, Plackett-Burman, etc.
    • See wikipedia
  • Issues
    • Curse of dimensionality
    • Insufficient for complex functions

Latin Hypercube Sampling¶

  • For $d$ dimensions, divide each into $N$ intervals, and generate $N$ points.
    • No two points share the same interval in the same dimension.
    • "having N rooks on a chess board without threatening each other."
  • Typical criteria
    • Center the points within the sampling intervals.
    • Maximize the minimum distance between points and place the point in a randomized location within its interval.
    • Maximize the minimum distance between points and center the point within its interval.
    • Minimize the maximum correlation coefficient.
  • Issues
    • Sometimes not orthogonal, i.e. not sufficiently uniform.
    • Need multiple runs to obtain satisfactory sample distribution.

Thinking about Sampling¶

What we want:

  • Uniformly cover the parameter space
  • Does not explode with dimensionality
  • Randomness to avoid system bias
  • Easy to generate (hopefully)

Low-Discrepancy Sequence¶

It generates quasirandom numbers

Four levels of randomness¶

  • Eurandom (e.g. quantum random number)
    • Unpredictable
    • Uncorrelated (independent)
    • Uniform (unbiased)
  • Pseudorandom (e.g. conventional random number)
    • Look like random, but predictable
    • Uncorrelated (independent)
    • Uniform (unbiased)
  • Quasirandom
    • Does not look like random, but effectively random.
    • Uniform (unbiased)

Nonrandom (e.g. Full Factorial Design)

In [3]:
np.random.seed(1)  # Pseudorandom
np.random.rand()   # Always 0.417022004702574
Out[3]:
0.417022004702574

Discrepancy as Uniformity¶

  • Definition for an interval
    • Difference between the fractions of samples and area.
    • A larger area should contain more samples. $$ D(a,b) = \left| \frac{\mathbf{Count}[a,b)^2}{\mathbf{Count}[0,1)^2} - \frac{\mathbf{Area}[a,b)^2}{\mathbf{Area}[1,0)^2} \right| $$
  • This provides an index for the uniformity of the samples (over a unit space) $$ D=\max_{a,b} D(a,b) $$
  • Simplified index, star discrepancy $$ D^*=\max_b D(0,b) $$

One-dimension example¶

  • Van der Corput sequence

    0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.01, 0.11, 0.21, 0.31, 0.41, 0.51, 0.61, 0.71, 0.81, 0.91, 0.02, 0.12, 0.22, 0.32, ...

  • Works for other number bases, too. $$ n=\sum_{i=0}^{L-1} d_k(n)b^k \quad\Rightarrow\quad g_b(n)=\sum_{i=0}^{L-1} d_k(n)b^{-k-1} $$
  • Extends to $d$ dimensions
    • Halton: The $n$th sample: $(g_{b_1}(n),\cdots,g_{b_d}(n))$
    • Hammersley: The $n$th sample: $(g_{b_1}(n),\cdots,g_{b_{d-1}}(n), n/N)$
    • where $b_1,\cdots,b_d$ are co-prime numbers

Comparison of Sampling Methods¶

Over a unit square:

  • Full factorial
  • Uniformly random
  • Latin hypercube
  • LDS: Halton
  • LDS: Hammersley
  • LDS: Sobol
In [6]:
s = runDisc(Grid, 2500, 21, 1)  # Deterministic, so running only once
pltDisc(s, 21)
Mean: 1.890e-02, Std: 0.000e+00
In [7]:
s = runDisc(URand, 2500, 21, 10)
pltDisc(s, 21)
Mean: 1.894e-02, Std: 5.344e-03
In [8]:
s = runDisc(LHS, 2500, 21, 10)
pltDisc(s, 21)
Mean: 9.690e-03, Std: 1.947e-03
In [9]:
s = runDisc(Halton, 2500, 21, 1)  # Deterministic, so running only once
pltDisc(s, 21)
Mean: 2.000e-03, Std: 0.000e+00
In [10]:
s = runDisc(Hmsl, 2500, 21, 1)  # Deterministic, so running only once
pltDisc(s, 21)
Mean: 1.200e-03, Std: 0.000e+00
In [11]:
s = runDisc(lambda N: Sobol(N,sk=10), 2500, 21, 1)  # Deterministic, so running only once
pltDisc(s, 21)
Mean: 1.200e-03, Std: 0.000e+00

Additional Benefits using LDS¶

Leads to quasi Monte Carlo:

  • Monte Carlo using LDS samples
  • Error in a $d$-dimensional space $$ \epsilon \sim \frac{\ln{N}^d}{N} $$
  • In low-dimensions, better than regular Monte Carlo $$ \epsilon \sim \frac{\sqrt{N}}{N} $$
In [13]:
lbls = ['URand', 'LHS', 'Halton', 'Sobol']
stys = ['r-', 'b-', 'g-', 'm-']
Ns = np.arange(11.0,N+1.0)
f = plt.figure(figsize=(6,4))
for _i in range(4):
    plt.loglog(Ns, R[_i], stys[_i], label=lbls[_i])
plt.loglog(Ns, 1/np.sqrt(Ns), 'k--', linewidth=2, label='N^1/2/N')
plt.loglog(Ns, 0.01*np.log(Ns)**2/Ns, 'r--', linewidth=2, label='logN^2/N')
plt.ylabel('Error')
plt.xlabel('No. of samples')
_=plt.legend()
In [ ]: