Learning Objectives By the end of this lecture, you will be able to:
Why do we need optimization?
Optimization is at the heart of machine learning, engineering design, and scientific computing. Whether we’re training neural networks, fitting models to data, or designing optimal structures, we need efficient algorithms to find the best solutions.
Gradient-based optimization algorithms leverage the gradient (first-order derivative) of the objective function to iteratively find optimal solutions. These methods are particularly powerful when the objective function is differentiable and can achieve fast convergence rates compared to gradient-free alternatives.
Intuition First! 🤔 Imagine you’re hiking on a mountain in dense fog. You can’t see far ahead, but you can feel the slope under your feet. The gradient tells you:
Key Properties:
The gradient
points in the direction of steepest ascent
The negative gradient
points in the direction of steepest descent
The magnitude
indicates how steep the slope is
🎯 Interactive Question: What happens at a local minimum? Think about it before reading on…
Answer: At a local minimum,
(the gradient vanishes)!
The Big Idea 💡 If we want to minimize a function , we should move in the direction that decreases most rapidly. That direction is !
Gradient descent is based on the Taylor expansion. Moving
slightly in the negative gradient direction:
Since
, we have
for small
.
The gradient descent update rule is beautifully simple:
where:
is our current position (parameter vector) at iteration
is the gradient at the current position
is the learning rate or step size
🎯 Think About It: What happens if
is too large? Too small?
function gradient_descent(f, x; niters::Int, learning_rate::Real)
history = [x]
for i = 1:niters
g = ForwardDiff.gradient(f, x)
x -= learning_rate * g
push!(history, x)
end
return history
end
The Rosenbrock function is the “fruit fly” of optimization research:
Why is this function special?
Global minimum at
with
✅
💻 Live Coding Exercise Let’s implement and test gradient descent together:
# Step 1: Define the Rosenbrock function
function rosenbrock(x)
(1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
end
# Step 2: Compute gradient (we'll use ForwardDiff)
using ForwardDiff
∇f = x -> ForwardDiff.gradient(rosenbrock, x)
# Step 3: Test the gradient at point (0, 0)
println("Gradient at (0,0): ", ∇f([0.0, 0.0]))
# Step 4: Run gradient descent
x0 = [-1.0, -1.0] # Starting point
history = gradient_descent(rosenbrock, x0; niters=10000, learning_rate=0.002)
println("Final point: ", history[end])
println("Final value: ", rosenbrock(history[end]))
🎯 Experiment Time: Try different learning rates: 0.001, 0.01, 0.1. What happens?
🚫 What Can Go Wrong?
🎯 Debugging Exercise: If your gradient descent is oscillating, what should you do?
Answer: B) Decrease the learning rate!
🏃♂️ The Momentum Analogy Imagine a ball rolling down a hill: Momentum helps us “remember” where we were going and keeps moving in consistent directions.
The momentum method adds a “velocity” term that accumulates gradients over time:
where:
is the velocity (momentum) vector
is the momentum coefficient (typically
)
is the learning rate
function gradient_descent_momentum(f, x; niters::Int, β::Real, learning_rate::Real)
history = [x]
v = zero(x) # Initialize velocity to zero
for i = 1:niters
g = ForwardDiff.gradient(f, x)
v = β .* v .- learning_rate .* g # Update velocity
x += v # Move by velocity
push!(history, x)
end
return history
end
Advantages:
Disadvantages:
Adds hyperparameter (
) to tune
🎯 Quick Quiz: What happens if
? What if
?
🎯 The Core Insight Different parameters might need different learning rates! AdaGrad automatically adapts the learning rate for each parameter.
AdaGrad keeps track of the sum of squared gradients and uses it to scale the learning rate:
Key components:
: accumulated squared gradients (never decreases!)
: tiny number (
) to avoid division by zero
: element-wise multiplication
function adagrad_optimize(f, x; niters, learning_rate, ϵ=1e-8)
rt = zero(x)
η = zero(x)
history = [x]
for step in 1:niters
Δ = ForwardDiff.gradient(f, x)
@. rt = rt + Δ^2
@. η = learning_rate / sqrt(rt + ϵ)
x = x .- Δ .* η
push!(history, x)
end
return history
end
AdaGrad’s main advantage is that it automatically adjusts the learning rate and is less sensitive to the choice of initial learning rate. However, the accumulated squared gradients grow monotonically, causing the effective learning rate to shrink over time, potentially leading to premature convergence.
🌟 Why Adam is Popular Adam = Momentum + AdaGrad + Bias Correction It’s like having the best of both worlds:
Adam maintains two exponential moving averages:
Then applies bias correction (important for early iterations):
Finally updates parameters:
Default hyperparameters (work well in practice):
(momentum decay)
(gradient variance decay)
(learning rate)
(numerical stability)
function adam_optimize(f, x; niters, learning_rate, β1=0.9, β2=0.999, ϵ=1e-8)
mt = zero(x)
vt = zero(x)
βp1 = β1
βp2 = β2
history = [x]
for step in 1:niters
Δ = ForwardDiff.gradient(f, x)
@. mt = β1 * mt + (1 - β1) * Δ
@. vt = β2 * vt + (1 - β2) * Δ^2
@. Δ = mt / (1 - βp1) / (√(vt / (1 - βp2)) + ϵ) * learning_rate
βp1, βp2 = βp1 * β1, βp2 * β2
x = x .- Δ
push!(history, x)
end
return history
end
Adam has become one of the most popular optimization algorithms in machine learning due to:
Julia provides a comprehensive collection of optimization algorithms through the Optimisers.jl package. This package includes implementations of various gradient-based optimizers with a unified interface.
using Optimisers, ForwardDiff
# Available optimizers include:
# Descent, Momentum, Nesterov, RMSProp, Adam, AdaGrad, etc.
function optimize_with_optimisers(f, x0, optimizer_type; niters=1000)
method = optimizer_type(0.01) # learning rate
state = Optimisers.setup(method, x0)
history = [x0]
x = copy(x0)
for i = 1:niters
grad = ForwardDiff.gradient(f, x)
state, x = Optimisers.update(state, x, grad)
push!(history, copy(x))
end
return history
end
# Example usage:
# history = optimize_with_optimisers(rosenbrock, [-1.0, -1.0], Optimisers.Adam)
The package provides a clean, composable interface where different optimizers can be easily swapped and combined. It also supports advanced features like gradient clipping, weight decay, and learning rate schedules.
🎯 The Big Question: Which Optimizer Should I Use? Short Answer: Start with Adam. It works well for most problems! Long Answer: It depends on your problem…
| Method | Convergence | Memory | Hyperparameters | Pros | Cons |
| Gradient Descent | Linear | O(n) | α | Simple, stable | Slow, sensitive to α |
| Momentum | Superlinear | O(n) | α, β | Faster, stable | Can overshoot |
| AdaGrad | Good | O(n) | α | Auto-adapts | Learning rate decay |
| Adam | Fast | O(n) | α, β₁, β₂ | Robust, fast | More complex |
| Newton | Quadratic | O(n²) | None | Very fast | Expensive O(n³) |
| BFGS | Superlinear | O(n²) | None | Fast, no Hessian | O(n²) memory |
| L-BFGS | Superlinear | O(mn) | m | Scalable | Approximation errors |
🌳 Choose Your Optimizer
Learning Rate Selection:
Start with
for Adam,
for SGD
Other Parameters:
Adam: Use default
,
Momentum: Try
or
L-BFGS: Use
(memory parameter)
Problem: Loss is exploding/NaN values Solution: Reduce learning rate, check gradient computation Problem: Very slow convergence Solution: Increase learning rate, try Adam instead of SGD Problem: Oscillating around minimum Solution: Reduce learning rate, add momentum Problem: Stuck in plateau Solution: Check gradient computation, try different initialization
Monitor these metrics during optimization:
Function value
(should decrease)
Gradient norm
(should approach 0)
Parameter change
(should get smaller)
Stop when: gradient norm < threshold OR parameter change < threshold OR max iterations reached.
Newton’s method is a second-order optimization algorithm that uses both first and second derivatives of the objective function. It approximates the function locally as a quadratic and finds the minimum of this quadratic approximation.
The Newton method update rules are:
where:
is the Hessian matrix (matrix of second derivatives) at iteration
is the gradient vector at iteration
is the Newton step direction
function newton_optimizer(f, x; tol=1e-5)
k = 0
history = [x]
while k < 1000
k += 1
gk = ForwardDiff.gradient(f, x)
hk = ForwardDiff.hessian(f, x)
dx = -hk \ gk # Solve Hk * dx = -gk
x += dx
push!(history, x)
sum(abs2, dx) < tol && break
end
return history
end
Advantages of Newton’s Method:
Disadvantages:
Requires computation and inversion of the Hessian matrix:
cost per iteration
Hessian computation requires
times more resources than gradient computation
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a quasi-Newton method that approximates the Hessian matrix using only gradient information. It belongs to the class of quasi-Newton methods, which build up curvature information over iterations.
The BFGS algorithm updates an approximation
of the Hessian matrix using the secant equation:
The Hessian approximation is updated using the BFGS formula:
This update satisfies the secant equation:
.
using Optim, ForwardDiff
function optimize_bfgs(f, x0; iterations=1000)
options = Optim.Options(iterations=iterations, store_trace=true, extended_trace=true)
result = optimize(f, x -> ForwardDiff.gradient(f, x), x0, BFGS(), options)
return result
end
# Example usage
result = optimize_bfgs(rosenbrock, [-1.0, -1.0])
Key Features of BFGS:
Memory requirement:
for storing the Hessian approximation
For large-scale problems, the
memory requirement of BFGS becomes prohibitive. L-BFGS addresses this by storing only the last
pairs of vectors
instead of the full Hessian approximation.
L-BFGS uses a two-loop recursion to compute the search direction implicitly, requiring only
memory where
is typically 5-20.
# L-BFGS is available in Optim.jl
result = optimize(rosenbrock, [-1.0, -1.0], LBFGS())
The golden section search is a technique for finding the minimum of a unimodal function by successively narrowing the range of values. It’s particularly useful as a line search method within other optimization algorithms.
The method uses the golden ratio
to place evaluation points that maintain the same proportional reduction in the search interval.
function golden_section_search(f, a, b; tol=1e-5)
τ = (√5 - 1) / 2 # Golden ratio conjugate
x1 = a + (1 - τ) * (b - a)
x2 = a + τ * (b - a)
f1, f2 = f(x1), f(x2)
k = 0
while b - a > tol
k += 1
if f1 > f2
a = x1
x1 = x2
f1 = f2
x2 = a + τ * (b - a)
f2 = f(x2)
else
b = x2
x2 = x1
f2 = f1
x1 = a + (1 - τ) * (b - a)
f1 = f(x1)
end
end
return f1 < f2 ? (x1, f1) : (x2, f2)
end
# Example: Find minimum of (x-4)²
x_min, f_min = golden_section_search(x -> (x-4)^2, -5, 5)
Properties of Golden Section Search:
Convergence rate:
where
is the number of iterations
Line search methods like golden section search are commonly used within gradient-based algorithms to determine optimal step sizes:
Exact Line Search: Find
These techniques ensure that each iteration makes substantial progress toward the minimum while maintaining algorithm stability.
| Method | Order | Convergence | Cost per Iter | Memory | Best Use Case |
| Gradient Descent | 1st | Linear | O(n) | O(n) | Large-scale, simple |
| Momentum | 1st | Linear+ | O(n) | O(n) | Consistent gradients |
| Adam | 1st | Fast | O(n) | O(n) | Deep learning, general |
| Newton | 2nd | Quadratic | O(n³) | O(n²) | Small-scale, accurate |
| BFGS | Quasi-2nd | Superlinear | O(n²) | O(n²) | Medium-scale |
| L-BFGS | Quasi-2nd | Superlinear | O(mn) | O(mn) | Large-scale |
Table 2: Comprehensive comparison of optimization methods. Here
is the problem dimension and
is the L-BFGS memory parameter.
🌟 Key Takeaways
Problem-based Selection:
🎯 Practical Advice:
💻 Programming Exercises Compare optimizers on the Rosenbrock function: Hyperparameter sensitivity: Real-world application:
Optim.jl, Optimisers.jl, Flux.jlNext lecture preview: Automatic Differentiation - How to compute gradients efficiently! 🚀