Convex Optimization: Theory, Algorithms, and Applications
Convex optimization is the study of minimizing convex functions over convex sets. Its power lies in a fundamental guarantee: every local minimum is a global minimum. This makes convex problems tractable and explains why they appear throughout machine learning, engineering, finance, and operations research.
Learning Objectives
- ✓Define convex sets and verify convexity using standard operations
- ✓Characterize convex functions via epigraphs, Jensen's inequality, and second-order conditions
- ✓State and apply the Lagrangian, weak duality, and strong duality
- ✓Derive and verify KKT conditions for constrained problems
- ✓Understand the simplex method and interior point methods for LP
- ✓Analyze convergence of gradient descent and Nesterov acceleration
- ✓Apply proximal gradient and ADMM to composite objectives
- ✓Recognize convex structure in SVMs, LASSO, and portfolio optimization
1. Convex Sets
A set C in R^n is convex if the line segment between any two points in C lies entirely within C. This geometric condition underpins all of convex optimization theory.
Formal Definition
A set C is convex if for all x, y in C and all t in [0, 1]:
t*x + (1-t)*y is in C
The point t*x + (1-t)*y is called a convex combination of x and y. More generally, a convex combination of k points x1, ..., xk uses non-negative weights t1, ..., tk summing to 1.
Examples of Convex Sets
Always Convex
- Affine sets (lines, planes, hyperplanes)
- Halfspaces: (x : a^T x ≤ b)
- Polyhedra: (x : Ax ≤ b, Cx = d)
- Euclidean balls: (x : ||x - c|| ≤ r)
- Ellipsoids: (x : (x-c)^T P^(-1) (x-c) ≤ 1)
- Positive semidefinite cone S^n_+
Not Convex
- Non-convex: the union of two disjoint balls
- Non-convex: the set of rank-1 matrices
- Non-convex: the integer lattice Z^n
- Non-convex: the complement of a convex set
- Non-convex: annuli (rings between two circles)
Operations Preserving Convexity
Convexity is preserved under several fundamental operations, which allows building complex convex sets from simple ones:
- Intersection: If C1 and C2 are convex, so is C1 intersect C2 (and any arbitrary intersection of convex sets).
- Affine image: If C is convex and f(x) = Ax + b is affine, then f(C) is convex. Likewise, the preimage f^(-1)(C) of a convex set under an affine map is convex.
- Sum (Minkowski sum): C1 + C2 = (x + y : x in C1, y in C2) is convex when both sets are convex.
- Perspective image: The perspective map P(z, t) = z/t (t greater than 0) maps convex sets to convex sets.
- Linear fractional map: f(x) = (Ax + b) / (c^T x + d) preserves convexity on domains where c^T x + d is positive.
Convex Hull
The convex hull of a set S, written conv(S), is the smallest convex set containing S. Equivalently, it is the set of all convex combinations of points in S. For a finite set of points v1, ..., vk, the convex hull is the polytope formed by all points t1*v1 + ... + tk*vk where each ti ≥ 0 and the sum of ti equals 1.
Supporting Hyperplanes
A supporting hyperplane of a convex set C at a boundary point x0 is a hyperplane H = (x : a^T x = a^T x0) such that C lies entirely in one halfspace: a^T x ≤ a^T x0 for all x in C. The supporting hyperplane theorem guarantees that every boundary point of a convex set has at least one supporting hyperplane. This result underlies the separation theorem and duality theory.
Separating Hyperplane Theorem
If C and D are disjoint convex sets, there exists a hyperplane that separates them: there exist a and b such that a^T x ≤ b for all x in C and a^T x ≥ b for all x in D. Strong separation (with strict inequalities) holds when both sets are closed and at least one is compact.
2. Convex Functions
A function f: R^n to R is convex if its domain is a convex set and for all x, y in the domain and t in [0, 1]:
f(t*x + (1-t)*y) ≤ t*f(x) + (1-t)*f(y)
Geometrically: the chord between any two points on the graph lies above or on the graph. If the inequality is strict for x ≠ y and t in (0,1), the function is strictly convex.
The Epigraph
The epigraph of f is the set of points on and above its graph: epi(f) = (x, t) : f(x) ≤ t. A function is convex if and only if its epigraph is a convex set. This geometric characterization connects convex functions to convex sets and allows analysis tools from set theory to apply to functions.
Jensen's Inequality
For a convex function f and a random variable X with finite expectation:
f(E[X]) ≤ E[f(X)]
This is one of the most important inequalities in mathematics. Applications include: proving the AM-GM inequality (f(x) = -log x is convex), establishing information-theoretic bounds, and deriving the log-sum inequality used in EM algorithms.
First-Order Condition
For a differentiable function f, convexity is equivalent to the first-order condition: f lies above all its tangent planes. Formally, f is convex if and only if for all x, y in the domain:
f(y) ≥ f(x) + ∇f(x)^T (y - x)
This means the gradient ∇f(x) provides a global underestimate of f. A critical point (where ∇f(x) = 0) is automatically a global minimum for convex functions — this is the key property exploited by optimization algorithms.
Second-Order Condition
For a twice-differentiable function f on R^n, f is convex if and only if its Hessian matrix is positive semidefinite everywhere on its domain:
∇²f(x) is positive semidefinite (PSD) for all x in dom(f)
Meaning: for all x and all directions v, v^T ∇²f(x) v ≥ 0. If the Hessian is strictly positive definite (all eigenvalues positive), then f is strictly convex. Common examples: f(x) = x^2 has Hessian 2 (PSD); f(x, y) = x^2 + y^2 has Hessian 2I (PSD); f(x) = log(x) has Hessian -1/x^2 (negative), so it is concave.
Strong Convexity
A function f is m-strongly convex (m greater than 0) if:
f(y) ≥ f(x) + ∇f(x)^T (y - x) + (m/2) * ||y - x||²
Strong convexity means f curves up at least as fast as a quadratic. Equivalent characterization: ∇²f(x) is at least mI in the PSD order. Strong convexity guarantees a unique global minimum and linear convergence for gradient descent.
Operations Preserving Convexity of Functions
- Non-negative sum: if f and g are convex, so is alpha*f + beta*g for alpha, beta ≥ 0.
- Affine composition: f(Ax + b) is convex when f is convex.
- Pointwise maximum: max(f, g) is convex; more generally, the supremum of any collection of convex functions is convex.
- Partial minimization: g(x) = inf_y f(x, y) is convex in x when f is jointly convex in (x, y).
- Composition rules: h(f(x)) is convex when h is convex non-decreasing and f is convex, or when h is convex non-increasing and f is concave.
3. Optimization Problem Standard Forms
A convex optimization problem has the standard form: minimize f0(x) subject to fi(x) ≤ 0 for i = 1, ..., m and hi(x) = 0 for i = 1, ..., p, where f0 and all fi are convex and all hi are affine.
Feasibility and Optimal Value
Feasible Set
The feasible set D is the set of all points satisfying the constraints. An optimization problem is feasible if D is non-empty; infeasible otherwise. A problem is bounded below if f0 has a finite infimum over D. The optimal value p* is the infimum of f0 over D, which may be -infinity (unbounded) or not attained.
Optimality
A feasible point x* is optimal if f0(x*) = p*. For convex problems, the set of optimal points is convex (and is a singleton when f0 is strictly convex). Local minima are global minima — this is what makes convex problems solvable efficiently.
The Lagrangian
The Lagrangian L(x, lambda, nu) incorporates the constraints into the objective using multipliers:
L(x, lambda, nu) = f0(x) + sum_i lambda_i * fi(x) + sum_j nu_j * hj(x)
where lambda_i ≥ 0 are the dual variables (Lagrange multipliers) for the inequality constraints, and nu_j are the dual variables for the equality constraints (which can be any sign). The Lagrangian provides a lower bound on the optimal value: for lambda ≥ 0 and any feasible x, L(x, lambda, nu) ≤ f0(x).
Optimality Conditions (Unconstrained)
For unconstrained minimization of a differentiable convex f, x* is optimal if and only if ∇f(x*) = 0. For a non-differentiable convex function, x* is optimal if and only if 0 is in the subdifferential of f at x*, where the subdifferential is the set of all subgradients — vectors g satisfying f(y) ≥ f(x) + g^T (y - x) for all y.
4. Lagrangian Duality
Duality theory provides a systematic way to derive lower bounds on the optimal value, verify optimality, and sometimes solve problems more efficiently via their dual formulations.
The Lagrange Dual Function
The dual function g(lambda, nu) is the infimum of the Lagrangian over all x:
g(lambda, nu) = inf_x L(x, lambda, nu)
The dual function is always concave (as the infimum of affine functions of (lambda, nu)), even when the primal problem is non-convex. For any lambda ≥ 0 and any nu, we have g(lambda, nu) ≤ p* — the dual function provides a lower bound on the optimal primal value.
Weak and Strong Duality
Weak Duality
Always holds: d* ≤ p*, where d* is the optimal dual value (the supremum of g over lambda ≥ 0 and all nu). The difference p* - d* ≥ 0 is called the duality gap. Weak duality applies to any optimization problem, convex or not.
Strong Duality
The duality gap is zero: d* = p*. For convex problems, strong duality holds when a constraint qualification is satisfied. The most common is Slater's condition. Strong duality means solving the dual yields the primal optimum.
Slater's Condition
Slater's condition holds if there exists a strictly feasible point: a point x in the interior of the domain with fi(x) less than 0 for all i (strict inequality for all inequality constraints) and hi(x) = 0 for all equality constraints.
When Slater's condition holds for a convex problem, strong duality holds and the dual optimal is attained. This is the most commonly used condition in practice — most well-posed convex programs satisfy it automatically.
Complementary Slackness
If strong duality holds and x*, (lambda*, nu*) are primal and dual optimal, then complementary slackness holds:
lambda_i* * fi(x*) = 0 for all i = 1, ..., m
This means for each inequality constraint: either the constraint is active (fi(x*) = 0) or the dual variable is zero (lambda_i* = 0) — they cannot both be nonzero simultaneously. This condition is used to identify which constraints are active at the optimum and to verify optimality.
5. KKT Conditions
The Karush-Kuhn-Tucker (KKT) conditions are the cornerstone of constrained optimization theory. They unify primal feasibility, dual feasibility, stationarity, and complementary slackness into a single system.
The Four KKT Conditions
1. Primal feasibility
fi(x*) ≤ 0 for all i, hj(x*) = 0 for all j
2. Dual feasibility
lambda_i* ≥ 0 for all i
3. Complementary slackness
lambda_i* * fi(x*) = 0 for all i
4. Stationarity
∇f0(x*) + sum_i lambda_i* ∇fi(x*) + sum_j nu_j* ∇hj(x*) = 0
Necessity and Sufficiency
Necessary Conditions
If x* is a local minimum of a smooth problem and a constraint qualification holds (e.g., LICQ: the gradients of active constraints are linearly independent), then there exist multipliers such that all four KKT conditions hold at x*. Without a constraint qualification, KKT may fail even at an optimum.
Sufficient Conditions
For convex problems, the KKT conditions are both necessary and sufficient (under Slater's condition). If x*, (lambda*, nu*) satisfy all four KKT conditions and the problem is convex, then x* is globally optimal. This makes KKT a complete characterization of optimality for convex programs.
KKT Example: Equality-Constrained QP
Minimize (1/2) x^T P x + q^T x subject to Ax = b (P PSD). The Lagrangian is L(x, nu) = (1/2) x^T P x + q^T x + nu^T (Ax - b). Stationarity gives Px + q + A^T nu = 0. Combining with Ax = b:
[ P A^T ] [ x* ] [ -q ]
[ A 0 ] [ nu* ] = [ b ]
This KKT system is a linear system in x* and nu*. When P is positive definite, the system has a unique solution.
6. Linear Programming
Linear programming (LP) minimizes a linear objective over a polyhedron defined by linear constraints. It is the simplest and most widely applied class of convex optimization.
Standard Form
minimize c^T x
subject to: Ax = b, x ≥ 0
Any LP can be converted to this form. The feasible set is a polytope (bounded) or polyhedron (unbounded). If an optimal solution exists for a bounded LP, it occurs at a vertex.
LP Duality
The dual of the standard-form LP is:
maximize b^T y
subject to: A^T y ≤ c
Strong duality always holds for LP (when both primal and dual are feasible). The optimal primal and dual values are equal. This is proved via the fundamental theorem of linear programming and is a special case of the general duality theory.
The Simplex Method
The simplex method is the classical algorithm for LP. It exploits the fact that the optimal solution (if finite) is attained at a vertex of the feasible polytope. A basic feasible solution corresponds to a vertex — it is determined by n - m basic variables (in an m-by-n constraint matrix) set to specific values while m variables are zero (non-basic). The algorithm:
- 1. Start at an initial basic feasible solution (vertex).
- 2. Compute reduced costs for non-basic variables.
- 3. If all reduced costs are non-negative, current solution is optimal.
- 4. Otherwise, select an entering variable (negative reduced cost).
- 5. Perform a pivot: move along an edge to an adjacent vertex.
- 6. Repeat from step 2.
Simplex Complexity
In the worst case, the simplex method visits exponentially many vertices (Klee-Minty examples). However, in practice it typically terminates in O(m) to O(3m) pivots, making it highly efficient. Degeneracy (multiple optimal vertices) requires anti-cycling rules such as Bland's rule.
Interior Point Methods for LP
Interior point (barrier) methods traverse the interior of the feasible set rather than its boundary. The logarithmic barrier method adds a barrier term to keep iterates strictly feasible:
minimize c^T x - (1/t) * sum_i log(-fi(x))
As t increases, solutions of these barrier problems approach the optimal solution of the original LP along the central path. Interior point methods have polynomial worst-case complexity O(n^3.5 * log(1/epsilon)) and are preferred for large, dense LPs. They also generalize to SOCP and SDP, unlike the simplex method.
7. Quadratic Programming
Quadratic programs (QPs) minimize a quadratic objective over a polyhedron. When the objective is convex (positive semidefinite Hessian), the problem is tractable and widely used in machine learning and control.
Standard Form
minimize (1/2) x^T P x + q^T x + r
subject to: Gx ≤ h, Ax = b
Convex when P is positive semidefinite. Strictly convex (unique optimum) when P is positive definite. LP is a special case (P = 0).
Active Set Methods
Active set methods maintain a working set of constraints treated as equalities. At each step they solve an equality-constrained QP (the KKT system) and update the active set by adding or removing constraints. They are efficient when the number of active constraints at the optimum is small and work well for warm-starting with an initial estimate of the active set.
Applications in Machine Learning
- Support Vector Machines (SVMs): The hard-margin SVM minimizes ||w||^2 / 2 subject to yi(w^T xi + b) ≥ 1. The dual is a QP with non-negativity constraints on the dual variables (alpha_i), which is often more tractable and enables the kernel trick.
- Soft-margin SVM: Adds slack variables xi_i: minimize ||w||^2 / 2 + C * sum xi_i subject to yi(w^T xi + b) ≥ 1 - xi_i and xi_i ≥ 0. The regularization parameter C controls the tradeoff between margin maximization and misclassification.
- Ridge regression: Minimizes ||Ax - b||^2 + lambda * ||x||^2. The closed-form solution (A^T A + lambda I)^(-1) A^T b is a QP with no inequality constraints.
8. SOCP and Semidefinite Programming
Second-Order Cone Programming (SOCP)
SOCP minimizes a linear objective subject to second-order cone constraints:
minimize f^T x
subject to: ||Ai*x + bi|| ≤ ci^T x + di, Fx = g
The second-order cone in R^(n+1) is (x, t) : ||x|| ≤ t. SOCP generalizes LP and QP: every LP and QP with PSD objective can be recast as an SOCP. Applications include robust LP, antenna array weight design, filter design, and portfolio optimization with transaction costs.
Semidefinite Programming (SDP)
SDP minimizes a linear objective subject to linear matrix inequality (LMI) constraints, meaning certain symmetric matrix expressions must remain positive semidefinite:
minimize c^T x
subject to: F(x) = F0 + x1*F1 + ... + xn*Fn is PSD
where F0, ..., Fn are symmetric matrices. SDP generalizes LP (by replacing the non-negativity constraint on a scalar with PSD on a 1x1 matrix). The feasible set is the intersection of an affine subspace with the PSD cone — a convex but not polyhedral set.
Key SDP Applications
- Eigenvalue optimization: Minimizing the largest eigenvalue of a symmetric affine matrix function is an SDP.
- Max-cut relaxation: The Goemans-Williamson SDP relaxation gives a 0.878 approximation guarantee for max-cut, better than any purely combinatorial algorithm known.
- Lyapunov stability: Finding a Lyapunov function V(x) = x^T P x for a linear system requires P to satisfy LMIs, formulated as an SDP.
- Sum-of-squares: Verifying that a polynomial is non-negative can be cast as finding an SOS (sum of squares) decomposition, which is an SDP.
9. Gradient Descent and Convergence Analysis
Gradient-based methods are the workhorses of large-scale convex and non-convex optimization, particularly in machine learning. Their convergence guarantees are well-understood for convex objectives.
Gradient Descent
x(k+1) = x(k) - alpha_k * ∇f(x(k))
Each iteration moves in the direction of steepest descent. The step size alpha_k (learning rate) controls how far to move. A fixed step size alpha = 1/L works when f is L-smooth: the gradient is Lipschitz with constant L, meaning ∇f changes at most L per unit distance.
Convergence Rate Analysis
Convex, L-smooth
With step size alpha = 1/L:
f(x(k)) - f* ≤ L * ||x(0) - x*||^2 / (2k)
O(1/k) convergence in objective value. To reach epsilon accuracy requires O(1/epsilon) iterations.
m-Strongly Convex, L-smooth
With optimal step size alpha = 2/(m+L):
||x(k) - x*||^2 ≤ ((L-m)/(L+m))^(2k) * ||x(0)-x*||^2
Linear (geometric) convergence. Condition number kappa = L/m governs the rate; larger kappa means slower convergence.
Step Size Selection
- Fixed step size: alpha = 1/L works when L is known. Simple but may be conservative.
- Backtracking line search: Start with alpha = 1, reduce by factor beta (e.g., 0.5) until the Armijo condition f(x - alpha*∇f(x)) ≤ f(x) - alpha*(alpha/2)*||∇f(x)||^2 holds. Adapts automatically and avoids needing L.
- Diminishing step sizes: For subgradient methods on non-smooth problems, step sizes alpha_k = c/sqrt(k) guarantee O(1/sqrt(k)) convergence to the optimal value.
- Barzilai-Borwein: Use a step size that approximates the inverse Hessian locally. Cheap, no line search, often dramatically faster in practice.
Nesterov Accelerated Gradient Method
Nesterov's method achieves O(1/k^2) convergence for smooth convex functions — provably optimal for first-order methods. It adds a momentum term that extrapolates beyond the current iterate:
y(k) = x(k) + (tk - 1) / t(k+1) * (x(k) - x(k-1))
x(k+1) = y(k) - (1/L) * ∇f(y(k))
t(k+1) = (1 + sqrt(1 + 4*tk^2)) / 2
The momentum coefficient (tk - 1) / t(k+1) approaches 1 as k grows. The method is sometimes called FISTA (Fast ISTA) in the proximal gradient context. It achieves the optimal O(1/k^2) rate for smooth convex problems — the same rate as conjugate gradient for quadratics.
Stochastic Gradient Descent (SGD)
For problems of the form minimize (1/n) sum_i fi(x), SGD replaces the full gradient with a stochastic gradient using a single sample or mini-batch. The update is x(k+1) = x(k) - alpha_k * ∇fi(k)(x(k)) where i(k) is a randomly chosen index. SGD has O(1/sqrt(k)) convergence for convex problems with diminishing step sizes but can be faster per iteration when n is large. Variance reduction methods (SVRG, SAGA) recover linear convergence for strongly convex problems.
10. Proximal and Splitting Methods
Many practical problems have non-smooth components (e.g., the L1 penalty in LASSO) that gradient descent cannot handle directly. Proximal methods generalize gradient descent to non-smooth objectives by replacing gradient steps with proximal operator evaluations.
The Proximal Operator
prox_(alpha*g)(v) = argmin_x (g(x) + (1/(2*alpha)) * ||x - v||^2)
The proximal operator of a function g finds the point that balances minimizing g with staying close to v. Key examples:
- L1 norm g(x) = lambda*||x||_1: prox is soft thresholding — S_lambda(v)_i = sign(vi) * max(|vi| - lambda, 0).
- Indicator of convex set C: prox is the Euclidean projection onto C: proj_C(v) = argmin_(x in C) ||x-v||.
- Nuclear norm: prox performs soft thresholding of singular values.
Proximal Gradient Method (ISTA)
For composite objectives f(x) = g(x) + h(x) where g is smooth and convex and h is convex but possibly non-smooth (and with a tractable prox), the proximal gradient method alternates a gradient step on g and a proximal step on h:
x(k+1) = prox_(alpha*h)(x(k) - alpha * ∇g(x(k)))
This achieves O(1/k) convergence for convex g, and O(1/k^2) with Nesterov acceleration (FISTA). For LASSO (g = (1/2)||Ax-b||^2, h = lambda*||x||_1), the prox step is soft thresholding and the gradient step is a residual update.
ADMM: Alternating Direction Method of Multipliers
ADMM solves problems of the form: minimize f(x) + g(z) subject to Ax + Bz = c. The augmented Lagrangian is:
L_rho = f(x) + g(z) + y^T (Ax + Bz - c) + (rho/2) * ||Ax+Bz-c||^2
The ADMM iterations are:
x(k+1) = argmin_x L_rho(x, z(k), y(k)) [x-update]
z(k+1) = argmin_z L_rho(x(k+1), z, y(k)) [z-update]
y(k+1) = y(k) + rho * (A*x(k+1) + B*z(k+1) - c) [dual update]
ADMM converges to the optimal solution under mild conditions (f and g convex). Each sub-problem involves only one block of variables, enabling distributed computation. Applications include LASSO, group LASSO, basis pursuit, distributed MPC, and consensus optimization across networks.
Douglas-Rachford Splitting
Douglas-Rachford splitting solves minimize f(x) + g(x) using the resolvent (proximal) operators of both f and g alternately. Unlike ADMM, it does not require the splitting variable z or a coupling constraint. It is particularly useful when both functions have efficiently computable proximal operators (e.g., projection onto two convex sets).
11. Applications in Machine Learning and Engineering
LASSO (L1 Regularized Regression)
minimize (1/2) * ||Ax - b||^2 + lambda * ||x||_1
The L1 penalty promotes sparsity: the solution tends to have many zero entries. This happens because the L1 ball has corners along coordinate axes where the objective gradient is most likely to align. LASSO is widely used for feature selection and compressed sensing. It can be solved via ISTA/FISTA or coordinate descent.
Portfolio Optimization
Markowitz mean-variance portfolio optimization:
minimize w^T * Sigma * w - mu^T w
subject to: 1^T w = 1, w ≥ 0 (long-only)
where Sigma is the return covariance matrix and mu is the vector of expected returns. This is a convex QP. The efficient frontier is the set of optimal (risk, return) pairs as the risk-aversion parameter varies. Extensions include transaction cost penalties (SOCP) and factor model constraints.
Compressed Sensing
Recover a sparse signal x from m measurements y = Ax (m much less than n):
minimize ||x||_1 subject to Ax = b
This is the basis pursuit problem (an LP). Under the restricted isometry property (RIP) on A, basis pursuit recovers the sparsest solution exactly. The Dantzig selector and LASSO are related convex relaxations for noisy measurements.
Control: Model Predictive Control (MPC)
MPC solves a convex QP or SOCP at each time step to compute optimal control inputs over a finite horizon, subject to state and input constraints. The quadratic cost penalizes deviations from the reference trajectory and large inputs. Only the first control action is applied; the problem is re-solved at the next step (receding horizon). ADMM and interior point methods are used for real-time MPC.
Summary Table: Applications and Problem Types
| Application | Problem Type | Typical Algorithm |
|---|---|---|
| Linear SVM | QP | SMO, Interior Point |
| LASSO | Composite Convex | ISTA/FISTA, Coord. Descent |
| Ridge Regression | Unconstrained QP | Closed form, Gradient Descent |
| Portfolio (Markowitz) | QP | Interior Point |
| Compressed Sensing | LP (Basis Pursuit) | ADMM, Interior Point |
| Network Flow | LP | Simplex, Network Simplex |
| Robust Optimization | SOCP | Interior Point |
| Lyapunov Stability | SDP | Interior Point (SDPT3, SCS) |
12. Practice Problems with Solutions
Problem 1: Verifying Convexity
Determine whether each function is convex, concave, or neither: (a) f(x) = e^x on R; (b) f(x) = log(x) on R_++; (c) f(x, y) = x^2 / y on (x, y : y greater than 0); (d) f(x) = ||x||^2 on R^n.
Show Solution
(a) f(x) = e^x: f''(x) = e^x greater than 0 for all x, so the Hessian is positive — f is strictly convex.
(b) f(x) = log(x): f''(x) = -1/x^2 less than 0 for x greater than 0, so f is strictly concave.
(c) f(x, y) = x^2 / y: This is a ratio of a quadratic and a linear function. The Hessian is H = (2/y) * [1, -x/y; -x/y, x^2/y^2]. The eigenvalues are 0 and 2(x^2 + y^2)/y^3 greater than 0, so H is PSD — f is convex. (It can also be seen as a perspective function of g(x) = x^2, which is convex.)
(d) f(x) = ||x||^2: The Hessian is 2I, which is positive definite — f is strictly convex.
Problem 2: KKT Conditions
Solve using KKT conditions: minimize x1^2 + x2^2 subject to x1 + x2 = 1. Find the optimal solution and verify the KKT conditions.
Show Solution
The Lagrangian is L(x1, x2, nu) = x1^2 + x2^2 + nu*(x1 + x2 - 1).
Stationarity: ∂L/∂x1 = 2x1 + nu = 0, so x1 = -nu/2. Similarly ∂L/∂x2 = 2x2 + nu = 0, so x2 = -nu/2. Therefore x1 = x2.
Primal feasibility: x1 + x2 = 1, combined with x1 = x2 gives x1* = x2* = 1/2.
Dual variable: nu* = -2*x1* = -1.
Optimal value: f(1/2, 1/2) = (1/4) + (1/4) = 1/2.
Verification: ∇f + nu*∇h = (1, 1) + (-1)*(1, 1) = (0, 0). KKT stationarity confirmed. The problem is convex (P = 2I is PD), so KKT is sufficient and this is the unique global optimum.
Problem 3: LP Duality
Given the LP: maximize 3x1 + 5x2 subject to x1 ≤ 4, 2x2 ≤ 12, 3x1 + 5x2 ≤ 25, x1, x2 ≥ 0. Write the dual problem and verify complementary slackness at the optimal solution.
Show Solution
Primal optimal: solve graphically. x1 ≤ 4 and 3x1 + 5x2 ≤ 25 intersect at x1 = 4, x2 = 13/5. Check 2x2 = 26/5 ≤ 12. Yes. Objective: 12 + 13 = 25/... Let us re-examine. At (4, 13/5): f = 3(4) + 5(13/5) = 12 + 13 = 25. At x1=0, x2=6: f = 30. Check 3(0)+5(6)=30 ≤ 25? No, infeasible. At x1=5/3, x2=12/2=6: check 3(5/3)+5(6)=5+30=35 greater than 25, infeasible. The binding constraints at optimum are 2x2=12 (x2=6) and 3x1+5(6)=25 gives 3x1=-5, infeasible. Corner: (0,5): f=25. Corner (4,13/5): f=25. Both give f=25. Optimal value p*=25.
Dual (minimization form): minimize 4y1 + 12y2 + 25y3 subject to y1 + 3y3 ≥ 3, 2y2 + 5y3 ≥ 5, y1, y2, y3 ≥ 0. Dual optimal: y1=0, y2=0, y3=1 gives d* = 25 = p*. Strong duality confirmed.
Complementary slackness at (x1=0, x2=5, y1=0, y2=0, y3=1): Constraint 1 slack = 4-0=4, y1=0 (satisfied). Constraint 3 slack = 25-25=0, y3=1 (active, consistent). Dual constraint 1 slack = 0+3-3=0, x1=0 (satisfied). All complementary slackness conditions hold.
Problem 4: Gradient Descent Convergence
Let f(x) = ||Ax - b||^2 / 2 where A is an m-by-n matrix with full column rank. (a) What is the gradient of f? (b) What is the Lipschitz constant L of ∇f? (c) How many iterations of gradient descent with step size 1/L are needed to achieve epsilon accuracy in objective value, given ||x(0) - x*||^2 = R^2?
Show Solution
(a) ∇f(x) = A^T(Ax - b). The Hessian is ∇^2 f(x) = A^T A.
(b) The Lipschitz constant of ∇f is the largest eigenvalue of the Hessian: L = lambda_max(A^T A) = sigma_max(A)^2, the square of the largest singular value of A.
(c) From the convergence bound for convex L-smooth functions: f(x(k)) - f* ≤ L*R^2 / (2k). Setting this ≤ epsilon gives k ≥ L*R^2 / (2*epsilon) iterations. So O(1/epsilon) iterations are needed.
Note: since A has full column rank, A^T A is positive definite, so f is strongly convex with m = lambda_min(A^T A). The condition number is kappa = L/m = lambda_max / lambda_min of A^T A. With the optimal step size, gradient descent converges linearly at rate (1 - 1/kappa).
Problem 5: Soft Thresholding / LASSO Subproblem
Compute the proximal operator of g(x) = lambda*|x| for lambda greater than 0: that is, solve prox_(lambda*|.|)(v) = argmin_x (lambda*|x| + (1/2)*(x-v)^2). Interpret the result geometrically.
Show Solution
We minimize h(x) = lambda*|x| + (1/2)(x-v)^2. Taking the subdifferential and setting 0 in the subdifferential:
Case 1: x greater than 0. Then h'(x) = lambda + (x-v) = 0, so x = v - lambda. This is positive only when v greater than lambda.
Case 2: x less than 0. Then h'(x) = -lambda + (x-v) = 0, so x = v + lambda. This is negative only when v less than -lambda.
Case 3: x = 0. Optimality requires 0 is in the subdifferential: 0 in (lambda*[-1,1] + (0-v)), meaning v in [-lambda, lambda].
Result (soft thresholding): S_lambda(v) = max(v - lambda, 0) if v greater than 0, min(v + lambda, 0) if v less than 0, equivalently sign(v)*max(|v|-lambda, 0).
Geometrically: soft thresholding shifts v toward zero by lambda; values inside [-lambda, lambda] are set to zero (the "dead zone"). This is what produces sparsity in LASSO solutions.
13. Exam Tips and Common Mistakes
High-Yield Exam Topics
- ★KKT conditions: be able to write them out for any given problem and check whether they are satisfied at a candidate point.
- ★Strong duality: know Slater's condition and be able to derive the dual of a given LP or QP.
- ★Checking convexity: use the second-order condition (Hessian PSD) for differentiable functions; use the definition for non-differentiable ones.
- ★Gradient descent convergence: know O(1/k) for convex L-smooth, linear rate for strongly convex, O(1/k^2) for Nesterov.
- ★Soft thresholding: know the formula and be able to derive it from the proximal operator definition.
Common Mistakes to Avoid
- Forgetting dual feasibility: The dual variable lambda must be non-negative for inequality constraints. Failing to impose lambda ≥ 0 is one of the most common errors in KKT problems.
- Assuming strong duality always holds: Strong duality requires a constraint qualification. Non-convex problems may have a positive duality gap even with Slater's condition violated.
- Confusing the condition number with the Lipschitz constant: L is the Lipschitz constant of the gradient; m is the strong convexity parameter; kappa = L/m is the condition number. Slow convergence is governed by kappa, not L alone.
- Applying gradient descent to non-smooth objectives: Gradient descent requires the objective to be differentiable. For L1 penalties, use proximal gradient or subgradient methods.
- Forgetting that the dual function is always concave: g(lambda, nu) is concave regardless of whether the primal is convex. The dual problem is always a concave maximization.
- Misidentifying feasibility problems: If Slater's condition fails, the primal may still be feasible and have an optimal solution — Slater's condition is about strict feasibility, which implies strong duality, not mere feasibility.
Related Topics
Linear Algebra
Eigenvalues, PSD matrices, and the singular value decomposition are central to understanding convexity, SDP, and algorithm convergence.
Real Analysis
Continuity, differentiability, and compactness theorems justify the existence of optima and the validity of convergence arguments.
Math for ML
Convex optimization powers SVMs, logistic regression, neural network training, and regularization methods throughout machine learning.
Key Takeaways
- →A convex set contains all line segments between its points; convex functions lie below their chords. Together they define a tractable optimization landscape where local optima are global.
- →The Lagrangian relaxes constraints into the objective. The dual problem always yields a lower bound (weak duality); Slater's condition closes this gap to zero (strong duality).
- →KKT conditions are the complete characterization of optimality for convex programs under Slater's condition: stationarity + primal feasibility + dual feasibility + complementary slackness.
- →LP, QP, SOCP, and SDP form a hierarchy of increasingly expressive convex problem classes, all solvable in polynomial time by interior point methods.
- →Gradient descent achieves O(1/k) for smooth convex objectives; Nesterov acceleration achieves O(1/k^2) — optimal for first-order methods. Strong convexity yields linear convergence.
- →Proximal gradient and ADMM extend these ideas to composite and distributed problems, enabling large-scale applications in machine learning, signal processing, and control.