Optimisation¶
Convexity¶
Convex Set¶
Note
We consider a set \(A\subseteq\mathcal{X}\) where \(\mathcal{X}\) is a vector space.
If we connect any two points, \(\mathbf{x},\mathbf{y}\in A\), by a line segment, then all the points in that line should also be in the set \(A\).
More formally, for any \(0\leq\alpha\leq 1\),
\[\alpha\mathbf{x}+(1-\alpha)\mathbf{y}\in A\]
Convex Function¶
Note
We consider of a function \(f:\mathcal{X}\mapsto\mathcal{Y}\).
If \(A\subseteq\mathcal{X}\) is a convex subset, then its image, \(f(A)\) also forms a convex subset in \(\mathcal{Y}\).
Formally, for any \(0\leq\alpha\leq 1\),
\[f(\alpha\mathbf{x}+(1-\alpha)\mathbf{y})=\alpha f(\mathbf{x})+(1-\alpha)f(\mathbf{y})\]
Convex Optimisation¶
Note
The functions we’re dealing with are of the form \(f(\mathbf{x}):\mathbb{R}^d\mapsto\mathbb{R}\)
Taylor expansion of the function around a fixed point \(\mathbf{x}_0\in\mathbb{R}^d\) if the function is infinitely differentiable
\[f(\mathbf{x})=f(\mathbf{x}_0)+\langle\nabla_f(\mathbf{x}_0), (\mathbf{x}-\mathbf{x}_0)\rangle+\frac{1}{2!}(\mathbf{x}-\mathbf{x}_0)^T\nabla^2_f(\mathbf{x}_0)(\mathbf{x}-\mathbf{x}_0)+\cdots\]We use the notation \(\mathbf{g}_0:=\nabla_f(\mathbf{x}_0)\in\mathbb{R}^d\) for the gradient vector evaluated at \(\mathbf{x}_0\) and \(\mathbf{H}_0:=\nabla^2_f(\mathbf{x}_0)\in:\mathbb{R}^{d\times d}\) for the Hessian matrix evaluated at \(\mathbf{x}_0\).
The Taylor expansion is then written as
\[f(\mathbf{x})=f(\mathbf{x}_0)+\mathbf{g}_0^T(\mathbf{x}-\mathbf{x}_0)+\frac{1}{2!}(\mathbf{x}-\mathbf{x}_0)^T\mathbf{H}_0(\mathbf{x}-\mathbf{x}_0)+\cdots\]
Unconstrained Optimisation¶
Note
We’re interested in finding the minimum \(\min_{\mathbf{x}\in\mathbb{R}^d}f(\mathbf{x})\).
First-order Methods¶
Note
First-order methods use the first-order Taylor’s approximation
It is assumed that the function \(f(\mathbf{x})\) behaves locally linear around a point \(\mathbf{x}\in\mathbb{R}^d\) and therefore can be approximated by
\[f(\mathbf{x})\approx f(\mathbf{x}_0)+\mathbf{g}_0^T(\mathbf{x}-\mathbf{x}_0)\]
Gradient Descent¶
Note
[TODO: proof] The direction of the gradient gives the direction of steepest ascent (i.e. the opposite of gradient provides the direction of steepest descent).
We start with an abritrary starting point \(\mathbf{x}_t=\mathbf{x}_0\in\mathbb{R}^d\).
We take a small step along the direction of the gradient as
\[\mathbf{x}_{t+1}=\mathbf{x}_t-\eta\mathbf{g}_t\]Here, the scalar \(\eta>0\) is a parameter that determines the stepsize.
Code example for Linear Regression¶
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd
import seaborn as seaborn
# create the function as linear with random normal noise
def define_function(d):
return np.random.randn(d)
def create_dataset(w, noise_sigma, N=1000):
d = w.shape[0]
X = [np.random.rand(d).tolist() for i in np.arange(N)] # N rows and d columns
return pd.DataFrame([(*x, w.dot(x) + np.random.randn() * noise_sigma) for x in X])
def compute_loss(X, y, wt):
return np.linalg.norm(y-X*wt)
def compute_gradient(X, y, wt):
return -2*X.T*(y-X*wt)
def gradient_descent(X, y, lr=0.0001, eps=1e-5, max_iter=100):
wt = np.matrix(np.random.randn(X.shape[1],1))
loss = compute_loss(X, y, wt)
i = 0
print(f'iter={i}')
print(f'wt={wt}')
print(f'loss={loss}')
loss_values = []
while loss > eps and i < max_iter:
print(f'iter={i}')
g = compute_gradient(X, y, wt)
wt = wt - lr*g
loss = compute_loss(X, y, wt)
i = i+1
print(f'wt={wt}')
print(f'loss={loss}')
loss_values.append([loss])
return wt, loss_values
w = define_function(2)
df = create_dataset(w, noise_sigma=0.01, N=1000)
X = np.asarray(df.iloc[:,:2])
y = np.asarray(df.iloc[:,2])
# direct estimator from least square
w_hat = (np.linalg.inv(X.T * X)) * X.T * y
X = np.asmatrix(X)
y = np.asmatrix(y).T
w_gd, loss_values = gradient_descent(X, y, lr=0.001, eps=1e-5, max_iter=50)
plot.plot(np.arange(len(loss_values)), loss_values)
plot.show()
Accelerated Gradient Descent¶
Second-order Methods¶
Newton’s Method¶
Note
Originally developed for finding roots of equations \(f(x)=0\).
We start with an abritrary starting point \(x_t=x_0\in\mathbb{R}\).
We compute the gradient and obtain the point where the tangent line of \(f\) at \(x_t\) equals 0.
We use this point as the next iteration.
\[0=f(x_t)+g(x_t)(x_{t+1}-x_t)\implies x_{t+1}=x_t-\frac{f(x_t)}{g(x_t)}\]This can be used for minimizing a function \(f\) as well by finding roots of \(\nabla_f(x)=0\).
For a function \(f:\mathbb{R}^d\mapsto\mathbb{R}\), the iteration rule becomes
\[\mathbf{0}=\mathbf{g}_t+\mathbf{H}_t(\mathbf{x}_{t+1}-\mathbf{x}_t)\implies \mathbf{x}_{t+1}=\mathbf{x}_t-\mathbf{H}_t^{-1}\mathbf{g}_t\]It approximates the functional locally (around \(\mathbf{x}_t\)) by a quadratic function.
Tip
Here the learning rate is not required. The rate is implied automatically by the geometric behaviour of \(\mathbf{H}_t\) at every \(\mathbf{x}_t\).
If \(\mathbf{H}_t\) is symmetric positive definite, the inverse always exists and we can investigate the eigenvalues to find out the step-size across each dimension
\[\mathbf{H}_t=\mathbf{Q}^T\boldsymbol{\Lambda}\mathbf{Q}\implies \mathbf{x}_{t+1}=\mathbf{x}_t-\mathbf{Q}^T\boldsymbol{\Lambda}^{-1}\mathbf{Q}\mathbf{g}_t\]If the original function is quadratic, this method finds the minima in 1 step (TODO: prove)
Code example for Linear Regression¶
Note
For Linear Regression, since the function is quadratic in its parameter, Newton’s method finds the minima in exactly 1 step.
TODO: prove why?
def compute_loss(X, y, wt):
return np.linalg.norm(y-X*wt)
def compute_gradient(X, y, wt):
return -2*X.T*(y-X*wt)
def compute_hessian(X):
return 2*X.T*X
def newton_method(X, y, eps=1e-5, max_iter=5):
wt = np.matrix(np.random.randn(X.shape[1],1))
loss = compute_loss(X, y, wt)
i = 0
print(f'iter={i}')
print(f'wt={wt}')
print(f'loss={loss}')
loss_values = []
while loss > eps and i < max_iter:
print(f'iter={i}')
g = compute_gradient(X, y, wt)
H = compute_hessian(X)
wt = wt - np.linalg.inv(H)*g
loss = compute_loss(X, y, wt)
i = i+1
print(f'wt={wt}')
print(f'loss={loss}')
loss_values.append([loss])
return wt, loss_values
w_newt, loss_values_newt = newton_method(X, y, eps=1e-5, max_iter=2)
plot.plot(np.arange(len(loss_values_newt)), loss_values_newt)
plot.show()
Constrained Optimisation¶
Note
We’re interested in finding the minimum \(\min_{\mathbf{x}\in S\subseteq \mathbb{R}^d}f(\mathbf{x})\).