Learning Objectives

By the end of this lecture, you will be able to:

Why do we need optimization?

Optimization is at the heart of machine learning, engineering design, and scientific computing. Whether we’re training neural networks, fitting models to data, or designing optimal structures, we need efficient algorithms to find the best solutions.

Gradient-based optimization algorithms leverage the gradient (first-order derivative) of the objective function to iteratively find optimal solutions. These methods are particularly powerful when the objective function is differentiable and can achieve fast convergence rates compared to gradient-free alternatives.

Understanding the Gradient

Intuition First! 🤔

Imagine you’re hiking on a mountain in dense fog. You can’t see far ahead, but you can feel the slope under your feet. The gradient tells you:

Key Properties:

🎯 Interactive Question: What happens at a local minimum? Think about it before reading on…

Answer: At a local minimum,

(the gradient vanishes)!

Gradient Descent: The Foundation

The Big Idea 💡

If we want to minimize a function

, we should move in the direction that decreases

most rapidly. That direction is

!

Mathematical Foundation

Gradient descent is based on the Taylor expansion. Moving

slightly in the negative gradient direction:

Since

, we have

for small

.

The Algorithm

The gradient descent update rule is beautifully simple:

where:

🎯 Think About It: What happens if

is too large? Too small?

function gradient_descent(f, x; niters::Int, learning_rate::Real)
history = [x]
for i = 1:niters
g = ForwardDiff.gradient(f, x)
x -= learning_rate * g
push!(history, x)
end
return history
end

🧪 Hands-on Example: The Rosenbrock Function

The Rosenbrock function is the “fruit fly” of optimization research:

Why is this function special?

💻 Live Coding Exercise

Let’s implement and test gradient descent together:

# Step 1: Define the Rosenbrock function
function rosenbrock(x)
(1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
end

# Step 2: Compute gradient (we'll use ForwardDiff)
using ForwardDiff
∇f = x -> ForwardDiff.gradient(rosenbrock, x)

# Step 3: Test the gradient at point (0, 0)
println("Gradient at (0,0): ", ∇f([0.0, 0.0]))

# Step 4: Run gradient descent
x0 = [-1.0, -1.0] # Starting point
history = gradient_descent(rosenbrock, x0; niters=10000, learning_rate=0.002)

println("Final point: ", history[end])
println("Final value: ", rosenbrock(history[end]))

🎯 Experiment Time: Try different learning rates: 0.001, 0.01, 0.1. What happens?

⚠️ Common Problems with Basic Gradient Descent

🚫 What Can Go Wrong?

  1. Slow convergence in narrow valleys (like Rosenbrock!)
  2. Oscillatory behavior when learning rate is too large
  3. Getting trapped in plateaus with tiny gradients
  4. Wrong direction near saddle points

🎯 Debugging Exercise: If your gradient descent is oscillating, what should you do?

Answer: B) Decrease the learning rate!

Advanced Method 1: Gradient Descent with Momentum

🏃‍♂️ The Momentum Analogy

Imagine a ball rolling down a hill:

Momentum helps us “remember” where we were going and keeps moving in consistent directions.

The Mathematics

The momentum method adds a “velocity” term that accumulates gradients over time:

where:

💻 Implementation

function gradient_descent_momentum(f, x; niters::Int, β::Real, learning_rate::Real)
history = [x]
v = zero(x) # Initialize velocity to zero

for i = 1:niters
g = ForwardDiff.gradient(f, x)
v = β .* v .- learning_rate .* g # Update velocity
x += v # Move by velocity
push!(history, x)
end
return history
end

⚖️ Trade-offs

Advantages:

Disadvantages:

🎯 Quick Quiz: What happens if

? What if

?

Advanced Method 2: AdaGrad (Adaptive Gradient)

🎯 The Core Insight

Different parameters might need different learning rates!

AdaGrad automatically adapts the learning rate for each parameter.

The Algorithm

AdaGrad keeps track of the sum of squared gradients and uses it to scale the learning rate:

Key components:

function adagrad_optimize(f, x; niters, learning_rate, ϵ=1e-8)
rt = zero(x)
η = zero(x)
history = [x]
for step in 1:niters
Δ = ForwardDiff.gradient(f, x)
@. rt = rt + Δ^2
@. η = learning_rate / sqrt(rt + ϵ)
x = x .- Δ .* η
push!(history, x)
end
return history
end

AdaGrad’s main advantage is that it automatically adjusts the learning rate and is less sensitive to the choice of initial learning rate. However, the accumulated squared gradients grow monotonically, causing the effective learning rate to shrink over time, potentially leading to premature convergence.

Advanced Method 3: Adam (The Champion!) 🏆

🌟 Why Adam is Popular

Adam = Momentum + AdaGrad + Bias Correction

It’s like having the best of both worlds:

The Complete Algorithm

Adam maintains two exponential moving averages:

Then applies bias correction (important for early iterations):

Finally updates parameters:

Default hyperparameters (work well in practice):

function adam_optimize(f, x; niters, learning_rate, β1=0.9, β2=0.999, ϵ=1e-8)
mt = zero(x)
vt = zero(x)
βp1 = β1
βp2 = β2
history = [x]
for step in 1:niters
Δ = ForwardDiff.gradient(f, x)
@. mt = β1 * mt + (1 - β1) * Δ
@. vt = β2 * vt + (1 - β2) * Δ^2
@. Δ = mt / (1 - βp1) / (√(vt / (1 - βp2)) + ϵ) * learning_rate
βp1, βp2 = βp1 * β1, βp2 * β2
x = x .- Δ
push!(history, x)
end
return history
end

Adam has become one of the most popular optimization algorithms in machine learning due to:

  1. Fast convergence in practice
  2. Robustness to hyperparameter choices
  3. Good performance across a wide range of problems
  4. Built-in bias correction for the moment estimates

The Optimisers.jl Package

Julia provides a comprehensive collection of optimization algorithms through the Optimisers.jl package. This package includes implementations of various gradient-based optimizers with a unified interface.

using Optimisers, ForwardDiff

# Available optimizers include:
# Descent, Momentum, Nesterov, RMSProp, Adam, AdaGrad, etc.

function optimize_with_optimisers(f, x0, optimizer_type; niters=1000)
method = optimizer_type(0.01) # learning rate
state = Optimisers.setup(method, x0)
history = [x0]
x = copy(x0)

for i = 1:niters
grad = ForwardDiff.gradient(f, x)
state, x = Optimisers.update(state, x, grad)
push!(history, copy(x))
end

return history
end

# Example usage:
# history = optimize_with_optimisers(rosenbrock, [-1.0, -1.0], Optimisers.Adam)

The package provides a clean, composable interface where different optimizers can be easily swapped and combined. It also supports advanced features like gradient clipping, weight decay, and learning rate schedules.

📊 Method Comparison & Selection Guide

🎯 The Big Question: Which Optimizer Should I Use?

Short Answer: Start with Adam. It works well for most problems!

Long Answer: It depends on your problem…

Complete Comparison

Method Convergence Memory Hyperparameters Pros Cons
Gradient Descent Linear O(n) α Simple, stable Slow, sensitive to α
Momentum Superlinear O(n) α, β Faster, stable Can overshoot
AdaGrad Good O(n) α Auto-adapts Learning rate decay
Adam Fast O(n) α, β₁, β₂ Robust, fast More complex
Newton Quadratic O(n²) None Very fast Expensive O(n³)
BFGS Superlinear O(n²) None Fast, no Hessian O(n²) memory
L-BFGS Superlinear O(mn) m Scalable Approximation errors
Table 1: Complete comparison of optimization methods (n = problem size, m = memory parameter)

🛠️ Practical Guidelines & Tips

Method Selection Decision Tree

🌳 Choose Your Optimizer

  1. Problem size < 100 variables? → Try Newton or BFGS
  2. Problem size 100-10,000? → Use Adam or L-BFGS
  3. Problem size > 10,000? → Start with Adam
  4. Sparse gradients/NLP? → Try AdaGrad
  5. Not sure? → Use Adam (good default!)

Hyperparameter Tuning Tips

Learning Rate Selection:

Other Parameters:

🚨 Troubleshooting Common Issues

Problem: Loss is exploding/NaN values Solution: Reduce learning rate, check gradient computation

Problem: Very slow convergence Solution: Increase learning rate, try Adam instead of SGD

Problem: Oscillating around minimum Solution: Reduce learning rate, add momentum

Problem: Stuck in plateau Solution: Check gradient computation, try different initialization

Convergence Monitoring

Monitor these metrics during optimization:

  1. Function value

    (should decrease)

  2. Gradient norm

    (should approach 0)

  3. Parameter change

    (should get smaller)

Stop when: gradient norm < threshold OR parameter change < threshold OR max iterations reached.

Hessian-based Optimization

Newton’s Method

Newton’s method is a second-order optimization algorithm that uses both first and second derivatives of the objective function. It approximates the function locally as a quadratic and finds the minimum of this quadratic approximation.

The Newton method update rules are:

where:

function newton_optimizer(f, x; tol=1e-5)
k = 0
history = [x]
while k < 1000
k += 1
gk = ForwardDiff.gradient(f, x)
hk = ForwardDiff.hessian(f, x)
dx = -hk \ gk # Solve Hk * dx = -gk
x += dx
push!(history, x)
sum(abs2, dx) < tol && break
end
return history
end

Advantages of Newton’s Method:

Disadvantages:

The BFGS Algorithm

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a quasi-Newton method that approximates the Hessian matrix using only gradient information. It belongs to the class of quasi-Newton methods, which build up curvature information over iterations.

The BFGS algorithm updates an approximation

of the Hessian matrix using the secant equation:

The Hessian approximation is updated using the BFGS formula:

This update satisfies the secant equation:

.

using Optim, ForwardDiff

function optimize_bfgs(f, x0; iterations=1000)
options = Optim.Options(iterations=iterations, store_trace=true, extended_trace=true)
result = optimize(f, x -> ForwardDiff.gradient(f, x), x0, BFGS(), options)
return result
end

# Example usage
result = optimize_bfgs(rosenbrock, [-1.0, -1.0])

Key Features of BFGS:

Limited-Memory BFGS (L-BFGS)

For large-scale problems, the

memory requirement of BFGS becomes prohibitive. L-BFGS addresses this by storing only the last

pairs of vectors

instead of the full Hessian approximation.

L-BFGS uses a two-loop recursion to compute the search direction implicitly, requiring only

memory where

is typically 5-20.

# L-BFGS is available in Optim.jl
result = optimize(rosenbrock, [-1.0, -1.0], LBFGS())

Line Search Methods

Golden Section Search

The golden section search is a technique for finding the minimum of a unimodal function by successively narrowing the range of values. It’s particularly useful as a line search method within other optimization algorithms.

The method uses the golden ratio

to place evaluation points that maintain the same proportional reduction in the search interval.

function golden_section_search(f, a, b; tol=1e-5)
τ = (√5 - 1) / 2 # Golden ratio conjugate
x1 = a + (1 - τ) * (b - a)
x2 = a + τ * (b - a)
f1, f2 = f(x1), f(x2)
k = 0

while b - a > tol
k += 1
if f1 > f2
a = x1
x1 = x2
f1 = f2
x2 = a + τ * (b - a)
f2 = f(x2)
else
b = x2
x2 = x1
f2 = f1
x1 = a + (1 - τ) * (b - a)
f1 = f(x1)
end
end

return f1 < f2 ? (x1, f1) : (x2, f2)
end

# Example: Find minimum of (x-4)²
x_min, f_min = golden_section_search(x -> (x-4)^2, -5, 5)

Properties of Golden Section Search:

Applications in Optimization

Line search methods like golden section search are commonly used within gradient-based algorithms to determine optimal step sizes:

  1. Exact Line Search: Find

  2. Inexact Line Search: Use conditions like Armijo or Wolfe conditions
  3. Backtracking: Start with large step and reduce until sufficient decrease

These techniques ensure that each iteration makes substantial progress toward the minimum while maintaining algorithm stability.

Summary and Method Selection

Method Order Convergence Cost per Iter Memory Best Use Case
Gradient Descent 1st Linear O(n) O(n) Large-scale, simple
Momentum 1st Linear+ O(n) O(n) Consistent gradients
Adam 1st Fast O(n) O(n) Deep learning, general
Newton 2nd Quadratic O(n³) O(n²) Small-scale, accurate
BFGS Quasi-2nd Superlinear O(n²) O(n²) Medium-scale
L-BFGS Quasi-2nd Superlinear O(mn) O(mn) Large-scale

Table 2: Comprehensive comparison of optimization methods. Here

is the problem dimension and

is the L-BFGS memory parameter.

🎓 Lecture Summary

🌟 Key Takeaways

  1. Gradient descent is the foundation - simple but can be slow
  2. Momentum adds memory - helps with consistent directions
  3. AdaGrad adapts learning rates - good for sparse problems
  4. Adam combines the best ideas - excellent default choice
  5. Newton/BFGS use curvature - fastest for smooth problems
  6. Line search optimizes step size - improves any method

Method Selection Guidelines

Problem-based Selection:

🎯 Practical Advice:

🏋️‍♂️ Exercises & Next Steps

Hands-on Exercises

💻 Programming Exercises

  1. Compare optimizers on the Rosenbrock function:

  2. Hyperparameter sensitivity:

  3. Real-world application:

🔍 Further Reading

Next lecture preview: Automatic Differentiation - How to compute gradients efficiently! 🚀