Optimisation

Convexity

Convex Set

Note

  • We consider a set \(A\subseteq\mathcal{X}\) where \(\mathcal{X}\) is a vector space.

  • If we connect any two points, \(\mathbf{x},\mathbf{y}\in A\), by a line segment, then all the points in that line should also be in the set \(A\).

  • More formally, for any \(0\leq\alpha\leq 1\),

    \[\alpha\mathbf{x}+(1-\alpha)\mathbf{y}\in A\]

Convex Function

Note

  • We consider of a function \(f:\mathcal{X}\mapsto\mathcal{Y}\).

  • If \(A\subseteq\mathcal{X}\) is a convex subset, then its image, \(f(A)\) also forms a convex subset in \(\mathcal{Y}\).

  • Formally, for any \(0\leq\alpha\leq 1\),

    \[f(\alpha\mathbf{x}+(1-\alpha)\mathbf{y})=\alpha f(\mathbf{x})+(1-\alpha)f(\mathbf{y})\]

Convex Optimisation

Note

  • The functions we’re dealing with are of the form \(f(\mathbf{x}):\mathbb{R}^d\mapsto\mathbb{R}\)

  • Taylor expansion of the function around a fixed point \(\mathbf{x}_0\in\mathbb{R}^d\) if the function is infinitely differentiable

    \[f(\mathbf{x})=f(\mathbf{x}_0)+\langle\nabla_f(\mathbf{x}_0), (\mathbf{x}-\mathbf{x}_0)\rangle+\frac{1}{2!}(\mathbf{x}-\mathbf{x}_0)^T\nabla^2_f(\mathbf{x}_0)(\mathbf{x}-\mathbf{x}_0)+\cdots\]
  • We use the notation \(\mathbf{g}_0:=\nabla_f(\mathbf{x}_0)\in\mathbb{R}^d\) for the gradient vector evaluated at \(\mathbf{x}_0\) and \(\mathbf{H}_0:=\nabla^2_f(\mathbf{x}_0)\in:\mathbb{R}^{d\times d}\) for the Hessian matrix evaluated at \(\mathbf{x}_0\).

  • The Taylor expansion is then written as

    \[f(\mathbf{x})=f(\mathbf{x}_0)+\mathbf{g}_0^T(\mathbf{x}-\mathbf{x}_0)+\frac{1}{2!}(\mathbf{x}-\mathbf{x}_0)^T\mathbf{H}_0(\mathbf{x}-\mathbf{x}_0)+\cdots\]

Unconstrained Optimisation

Note

  • We’re interested in finding the minimum \(\min_{\mathbf{x}\in\mathbb{R}^d}f(\mathbf{x})\).

First-order Methods

Note

  • First-order methods use the first-order Taylor’s approximation

  • It is assumed that the function \(f(\mathbf{x})\) behaves locally linear around a point \(\mathbf{x}\in\mathbb{R}^d\) and therefore can be approximated by

    \[f(\mathbf{x})\approx f(\mathbf{x}_0)+\mathbf{g}_0^T(\mathbf{x}-\mathbf{x}_0)\]

Gradient Descent

Note

  • [TODO: proof] The direction of the gradient gives the direction of steepest ascent (i.e. the opposite of gradient provides the direction of steepest descent).

  • We start with an abritrary starting point \(\mathbf{x}_t=\mathbf{x}_0\in\mathbb{R}^d\).

  • We take a small step along the direction of the gradient as

    \[\mathbf{x}_{t+1}=\mathbf{x}_t-\eta\mathbf{g}_t\]
  • Here, the scalar \(\eta>0\) is a parameter that determines the stepsize.

Code example for Linear Regression
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd
import seaborn as seaborn

# create the function as linear with random normal noise
def define_function(d):
  return np.random.randn(d)

def create_dataset(w, noise_sigma, N=1000):
  d = w.shape[0]
  X = [np.random.rand(d).tolist() for i in np.arange(N)] # N rows and d columns
  return pd.DataFrame([(*x, w.dot(x) + np.random.randn() * noise_sigma) for x in X])

def compute_loss(X, y, wt):
  return np.linalg.norm(y-X*wt)

def compute_gradient(X, y, wt):
  return -2*X.T*(y-X*wt)

def gradient_descent(X, y, lr=0.0001, eps=1e-5, max_iter=100):
  wt = np.matrix(np.random.randn(X.shape[1],1))
  loss = compute_loss(X, y, wt)
  i = 0
  print(f'iter={i}')
  print(f'wt={wt}')
  print(f'loss={loss}')
  loss_values = []

  while loss > eps and i < max_iter:
    print(f'iter={i}')
    g = compute_gradient(X, y, wt)
    wt = wt - lr*g
    loss = compute_loss(X, y, wt)
    i = i+1
    print(f'wt={wt}')
    print(f'loss={loss}')
    loss_values.append([loss])

  return wt, loss_values

w = define_function(2)
df = create_dataset(w, noise_sigma=0.01, N=1000)
X = np.asarray(df.iloc[:,:2])
y = np.asarray(df.iloc[:,2])

# direct estimator from least square
w_hat = (np.linalg.inv(X.T * X)) * X.T * y

X = np.asmatrix(X)
y = np.asmatrix(y).T
w_gd, loss_values = gradient_descent(X, y, lr=0.001, eps=1e-5, max_iter=50)

plot.plot(np.arange(len(loss_values)), loss_values)
plot.show()

Accelerated Gradient Descent

Second-order Methods

Newton’s Method

Note

  • Originally developed for finding roots of equations \(f(x)=0\).

  • We start with an abritrary starting point \(x_t=x_0\in\mathbb{R}\).

  • We compute the gradient and obtain the point where the tangent line of \(f\) at \(x_t\) equals 0.

  • We use this point as the next iteration.

    \[0=f(x_t)+g(x_t)(x_{t+1}-x_t)\implies x_{t+1}=x_t-\frac{f(x_t)}{g(x_t)}\]
  • This can be used for minimizing a function \(f\) as well by finding roots of \(\nabla_f(x)=0\).

  • For a function \(f:\mathbb{R}^d\mapsto\mathbb{R}\), the iteration rule becomes

    \[\mathbf{0}=\mathbf{g}_t+\mathbf{H}_t(\mathbf{x}_{t+1}-\mathbf{x}_t)\implies \mathbf{x}_{t+1}=\mathbf{x}_t-\mathbf{H}_t^{-1}\mathbf{g}_t\]
  • It approximates the functional locally (around \(\mathbf{x}_t\)) by a quadratic function.

Tip

  • Here the learning rate is not required. The rate is implied automatically by the geometric behaviour of \(\mathbf{H}_t\) at every \(\mathbf{x}_t\).

  • If \(\mathbf{H}_t\) is symmetric positive definite, the inverse always exists and we can investigate the eigenvalues to find out the step-size across each dimension

    \[\mathbf{H}_t=\mathbf{Q}^T\boldsymbol{\Lambda}\mathbf{Q}\implies \mathbf{x}_{t+1}=\mathbf{x}_t-\mathbf{Q}^T\boldsymbol{\Lambda}^{-1}\mathbf{Q}\mathbf{g}_t\]
  • If the original function is quadratic, this method finds the minima in 1 step (TODO: prove)

Code example for Linear Regression

Note

  • For Linear Regression, since the function is quadratic in its parameter, Newton’s method finds the minima in exactly 1 step.

  • TODO: prove why?

def compute_loss(X, y, wt):
  return np.linalg.norm(y-X*wt)

def compute_gradient(X, y, wt):
  return -2*X.T*(y-X*wt)

def compute_hessian(X):
  return 2*X.T*X

def newton_method(X, y, eps=1e-5, max_iter=5):
  wt = np.matrix(np.random.randn(X.shape[1],1))
  loss = compute_loss(X, y, wt)
  i = 0
  print(f'iter={i}')
  print(f'wt={wt}')
  print(f'loss={loss}')
  loss_values = []

  while loss > eps and i < max_iter:
    print(f'iter={i}')
    g = compute_gradient(X, y, wt)
    H = compute_hessian(X)
    wt = wt - np.linalg.inv(H)*g
    loss = compute_loss(X, y, wt)
    i = i+1
    print(f'wt={wt}')
    print(f'loss={loss}')
    loss_values.append([loss])

  return wt, loss_values

w_newt, loss_values_newt = newton_method(X, y, eps=1e-5, max_iter=2)
plot.plot(np.arange(len(loss_values_newt)), loss_values_newt)
plot.show()

Constrained Optimisation

Note

  • We’re interested in finding the minimum \(\min_{\mathbf{x}\in S\subseteq \mathbb{R}^d}f(\mathbf{x})\).