What makes an optimization problem convex?

An optimization problem is convex when three conditions hold simultaneously: the objective function is convex, the equality constraints are affine (linear), and the inequality constraints are convex functions. For minimization problems, this means every local minimum is also a global minimum, which is the fundamental property that makes convex problems tractable.

What are the KKT conditions and when do they apply?

The Karush-Kuhn-Tucker (KKT) conditions are necessary optimality conditions for constrained optimization: (1) Stationarity: the gradient of the Lagrangian is zero at the optimal point; (2) Primal feasibility: all constraints are satisfied; (3) Dual feasibility: Lagrange multipliers for inequality constraints are non-negative; (4) Complementary slackness: each inequality multiplier times the corresponding constraint is zero. For convex problems with Slater's condition satisfied, the KKT conditions are also sufficient.

What is strong duality and when does it hold?

Strong duality means the optimal value of the primal problem equals the optimal value of the dual problem, so the duality gap is zero. It holds for convex problems when a constraint qualification is satisfied. The most common qualification is Slater's condition: there exists a strictly feasible point that satisfies all inequality constraints strictly. Strong duality is the foundation for dual decomposition and many optimization algorithms.

What is the convergence rate of gradient descent for convex functions?

For L-smooth convex functions with step size 1/L, gradient descent achieves O(1/k) convergence in objective value after k iterations. For strongly convex functions (with parameter m), it achieves linear (geometric) convergence at rate (1 - m/L)^k. Nesterov's accelerated gradient method improves the smooth convex rate to O(1/k^2), which is optimal for first-order methods on this function class.

How does the simplex method work for linear programming?

The simplex method exploits the fact that for a bounded LP, an optimal solution exists at a vertex of the feasible polytope. Starting from an initial basic feasible solution (vertex), the method moves to adjacent vertices along edges of the polytope, increasing the objective at each step. A pivot operation swaps one variable into and one out of the basis. Although exponential in the worst case, the simplex method is highly efficient in practice.

What is ADMM and where is it used?

ADMM (Alternating Direction Method of Multipliers) is an operator splitting algorithm that solves problems of the form: minimize f(x) + g(z) subject to Ax + Bz = c. It alternates between minimizing over x and z separately, then updates a dual variable. ADMM combines the decomposability of dual ascent with the convergence properties of the method of multipliers. It is widely used in distributed optimization, statistical learning (LASSO, group LASSO), and image processing.

What is semidefinite programming and what problems does it solve?

Semidefinite programming (SDP) optimizes a linear objective over the cone of positive semidefinite matrices subject to linear matrix inequality constraints. SDPs generalize linear programming and can model a wide range of problems including: eigenvalue optimization, maximum cut relaxations, control design (finding Lyapunov functions), combinatorial optimization relaxations, and sum-of-squares polynomial optimization. Interior point methods solve SDPs in polynomial time.

Advanced Mathematics

Convex Optimization: Theory, Algorithms, and Applications

Convex optimization is the study of minimizing convex functions over convex sets. Its power lies in a fundamental guarantee: every local minimum is a global minimum. This makes convex problems tractable and explains why they appear throughout machine learning, engineering, finance, and operations research.

Learning Objectives

✓Define convex sets and verify convexity using standard operations
✓Characterize convex functions via epigraphs, Jensen's inequality, and second-order conditions
✓State and apply the Lagrangian, weak duality, and strong duality
✓Derive and verify KKT conditions for constrained problems
✓Understand the simplex method and interior point methods for LP
✓Analyze convergence of gradient descent and Nesterov acceleration
✓Apply proximal gradient and ADMM to composite objectives
✓Recognize convex structure in SVMs, LASSO, and portfolio optimization

1. Convex Sets

A set C in R^n is convex if the line segment between any two points in C lies entirely within C. This geometric condition underpins all of convex optimization theory.

Formal Definition

A set C is convex if for all x, y in C and all t in [0, 1]:

t*x + (1-t)*y is in C

The point t*x + (1-t)*y is called a convex combination of x and y. More generally, a convex combination of k points x1, ..., xk uses non-negative weights t1, ..., tk summing to 1.

Examples of Convex Sets

Always Convex

Affine sets (lines, planes, hyperplanes)
Halfspaces: (x : a^T x ≤ b)
Polyhedra: (x : Ax ≤ b, Cx = d)
Euclidean balls: (x : ||x - c|| ≤ r)
Ellipsoids: (x : (x-c)^T P^(-1) (x-c) ≤ 1)
Positive semidefinite cone S^n_+

Not Convex

Non-convex: the union of two disjoint balls
Non-convex: the set of rank-1 matrices
Non-convex: the integer lattice Z^n
Non-convex: the complement of a convex set
Non-convex: annuli (rings between two circles)

Operations Preserving Convexity

Convexity is preserved under several fundamental operations, which allows building complex convex sets from simple ones:

Intersection: If C1 and C2 are convex, so is C1 intersect C2 (and any arbitrary intersection of convex sets).
Affine image: If C is convex and f(x) = Ax + b is affine, then f(C) is convex. Likewise, the preimage f^(-1)(C) of a convex set under an affine map is convex.
Sum (Minkowski sum): C1 + C2 = (x + y : x in C1, y in C2) is convex when both sets are convex.
Perspective image: The perspective map P(z, t) = z/t (t greater than 0) maps convex sets to convex sets.
Linear fractional map: f(x) = (Ax + b) / (c^T x + d) preserves convexity on domains where c^T x + d is positive.

Convex Hull

The convex hull of a set S, written conv(S), is the smallest convex set containing S. Equivalently, it is the set of all convex combinations of points in S. For a finite set of points v1, ..., vk, the convex hull is the polytope formed by all points t1*v1 + ... + tk*vk where each ti ≥ 0 and the sum of ti equals 1.

Supporting Hyperplanes

A supporting hyperplane of a convex set C at a boundary point x0 is a hyperplane H = (x : a^T x = a^T x0) such that C lies entirely in one halfspace: a^T x ≤ a^T x0 for all x in C. The supporting hyperplane theorem guarantees that every boundary point of a convex set has at least one supporting hyperplane. This result underlies the separation theorem and duality theory.

Separating Hyperplane Theorem

If C and D are disjoint convex sets, there exists a hyperplane that separates them: there exist a and b such that a^T x ≤ b for all x in C and a^T x ≥ b for all x in D. Strong separation (with strict inequalities) holds when both sets are closed and at least one is compact.

2. Convex Functions

A function f: R^n to R is convex if its domain is a convex set and for all x, y in the domain and t in [0, 1]:

f(t*x + (1-t)*y) ≤ t*f(x) + (1-t)*f(y)

Geometrically: the chord between any two points on the graph lies above or on the graph. If the inequality is strict for x ≠ y and t in (0,1), the function is strictly convex.

The Epigraph

The epigraph of f is the set of points on and above its graph: epi(f) = (x, t) : f(x) ≤ t. A function is convex if and only if its epigraph is a convex set. This geometric characterization connects convex functions to convex sets and allows analysis tools from set theory to apply to functions.

Jensen's Inequality

For a convex function f and a random variable X with finite expectation:

f(E[X]) ≤ E[f(X)]

This is one of the most important inequalities in mathematics. Applications include: proving the AM-GM inequality (f(x) = -log x is convex), establishing information-theoretic bounds, and deriving the log-sum inequality used in EM algorithms.

First-Order Condition

For a differentiable function f, convexity is equivalent to the first-order condition: f lies above all its tangent planes. Formally, f is convex if and only if for all x, y in the domain:

f(y) ≥ f(x) + ∇f(x)^T (y - x)

This means the gradient ∇f(x) provides a global underestimate of f. A critical point (where ∇f(x) = 0) is automatically a global minimum for convex functions — this is the key property exploited by optimization algorithms.

Second-Order Condition

For a twice-differentiable function f on R^n, f is convex if and only if its Hessian matrix is positive semidefinite everywhere on its domain:

∇²f(x) is positive semidefinite (PSD) for all x in dom(f)

Meaning: for all x and all directions v, v^T ∇²f(x) v ≥ 0. If the Hessian is strictly positive definite (all eigenvalues positive), then f is strictly convex. Common examples: f(x) = x^2 has Hessian 2 (PSD); f(x, y) = x^2 + y^2 has Hessian 2I (PSD); f(x) = log(x) has Hessian -1/x^2 (negative), so it is concave.

Strong Convexity

A function f is m-strongly convex (m greater than 0) if:

f(y) ≥ f(x) + ∇f(x)^T (y - x) + (m/2) * ||y - x||²

Strong convexity means f curves up at least as fast as a quadratic. Equivalent characterization: ∇²f(x) is at least mI in the PSD order. Strong convexity guarantees a unique global minimum and linear convergence for gradient descent.

Operations Preserving Convexity of Functions

Non-negative sum: if f and g are convex, so is alpha*f + beta*g for alpha, beta ≥ 0.
Affine composition: f(Ax + b) is convex when f is convex.
Pointwise maximum: max(f, g) is convex; more generally, the supremum of any collection of convex functions is convex.
Partial minimization: g(x) = inf_y f(x, y) is convex in x when f is jointly convex in (x, y).
Composition rules: h(f(x)) is convex when h is convex non-decreasing and f is convex, or when h is convex non-increasing and f is concave.

3. Optimization Problem Standard Forms

A convex optimization problem has the standard form: minimize f0(x) subject to fi(x) ≤ 0 for i = 1, ..., m and hi(x) = 0 for i = 1, ..., p, where f0 and all fi are convex and all hi are affine.

Feasibility and Optimal Value

Feasible Set

The feasible set D is the set of all points satisfying the constraints. An optimization problem is feasible if D is non-empty; infeasible otherwise. A problem is bounded below if f0 has a finite infimum over D. The optimal value p* is the infimum of f0 over D, which may be -infinity (unbounded) or not attained.

Optimality

A feasible point x* is optimal if f0(x*) = p*. For convex problems, the set of optimal points is convex (and is a singleton when f0 is strictly convex). Local minima are global minima — this is what makes convex problems solvable efficiently.

The Lagrangian

The Lagrangian L(x, lambda, nu) incorporates the constraints into the objective using multipliers:

L(x, lambda, nu) = f0(x) + sum_i lambda_i * fi(x) + sum_j nu_j * hj(x)

where lambda_i ≥ 0 are the dual variables (Lagrange multipliers) for the inequality constraints, and nu_j are the dual variables for the equality constraints (which can be any sign). The Lagrangian provides a lower bound on the optimal value: for lambda ≥ 0 and any feasible x, L(x, lambda, nu) ≤ f0(x).

Optimality Conditions (Unconstrained)

For unconstrained minimization of a differentiable convex f, x* is optimal if and only if ∇f(x*) = 0. For a non-differentiable convex function, x* is optimal if and only if 0 is in the subdifferential of f at x*, where the subdifferential is the set of all subgradients — vectors g satisfying f(y) ≥ f(x) + g^T (y - x) for all y.

4. Lagrangian Duality

Duality theory provides a systematic way to derive lower bounds on the optimal value, verify optimality, and sometimes solve problems more efficiently via their dual formulations.

The Lagrange Dual Function

The dual function g(lambda, nu) is the infimum of the Lagrangian over all x:

g(lambda, nu) = inf_x L(x, lambda, nu)

The dual function is always concave (as the infimum of affine functions of (lambda, nu)), even when the primal problem is non-convex. For any lambda ≥ 0 and any nu, we have g(lambda, nu) ≤ p* — the dual function provides a lower bound on the optimal primal value.

Weak and Strong Duality

Weak Duality

Always holds: d* ≤ p*, where d* is the optimal dual value (the supremum of g over lambda ≥ 0 and all nu). The difference p* - d* ≥ 0 is called the duality gap. Weak duality applies to any optimization problem, convex or not.

Strong Duality

The duality gap is zero: d* = p*. For convex problems, strong duality holds when a constraint qualification is satisfied. The most common is Slater's condition. Strong duality means solving the dual yields the primal optimum.

Slater's Condition

Slater's condition holds if there exists a strictly feasible point: a point x in the interior of the domain with fi(x) less than 0 for all i (strict inequality for all inequality constraints) and hi(x) = 0 for all equality constraints.

When Slater's condition holds for a convex problem, strong duality holds and the dual optimal is attained. This is the most commonly used condition in practice — most well-posed convex programs satisfy it automatically.

Complementary Slackness

If strong duality holds and x*, (lambda*, nu*) are primal and dual optimal, then complementary slackness holds:

lambda_i* * fi(x*) = 0 for all i = 1, ..., m

This means for each inequality constraint: either the constraint is active (fi(x*) = 0) or the dual variable is zero (lambda_i* = 0) — they cannot both be nonzero simultaneously. This condition is used to identify which constraints are active at the optimum and to verify optimality.

5. KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions are the cornerstone of constrained optimization theory. They unify primal feasibility, dual feasibility, stationarity, and complementary slackness into a single system.

The Four KKT Conditions

1. Primal feasibility

fi(x*) ≤ 0 for all i, hj(x*) = 0 for all j

2. Dual feasibility

lambda_i* ≥ 0 for all i

3. Complementary slackness

lambda_i* * fi(x*) = 0 for all i

4. Stationarity

∇f0(x*) + sum_i lambda_i* ∇fi(x*) + sum_j nu_j* ∇hj(x*) = 0

Necessity and Sufficiency

Necessary Conditions

If x* is a local minimum of a smooth problem and a constraint qualification holds (e.g., LICQ: the gradients of active constraints are linearly independent), then there exist multipliers such that all four KKT conditions hold at x*. Without a constraint qualification, KKT may fail even at an optimum.

Sufficient Conditions

For convex problems, the KKT conditions are both necessary and sufficient (under Slater's condition). If x*, (lambda*, nu*) satisfy all four KKT conditions and the problem is convex, then x* is globally optimal. This makes KKT a complete characterization of optimality for convex programs.

KKT Example: Equality-Constrained QP

Minimize (1/2) x^T P x + q^T x subject to Ax = b (P PSD). The Lagrangian is L(x, nu) = (1/2) x^T P x + q^T x + nu^T (Ax - b). Stationarity gives Px + q + A^T nu = 0. Combining with Ax = b:

[ P A^T ] [ x* ] [ -q ]

[ A 0 ] [ nu* ] = [ b ]

This KKT system is a linear system in x* and nu*. When P is positive definite, the system has a unique solution.

6. Linear Programming

Linear programming (LP) minimizes a linear objective over a polyhedron defined by linear constraints. It is the simplest and most widely applied class of convex optimization.

Standard Form

minimize c^T x

subject to: Ax = b, x ≥ 0

Any LP can be converted to this form. The feasible set is a polytope (bounded) or polyhedron (unbounded). If an optimal solution exists for a bounded LP, it occurs at a vertex.

LP Duality

The dual of the standard-form LP is:

maximize b^T y

subject to: A^T y ≤ c

Strong duality always holds for LP (when both primal and dual are feasible). The optimal primal and dual values are equal. This is proved via the fundamental theorem of linear programming and is a special case of the general duality theory.

The Simplex Method

The simplex method is the classical algorithm for LP. It exploits the fact that the optimal solution (if finite) is attained at a vertex of the feasible polytope. A basic feasible solution corresponds to a vertex — it is determined by n - m basic variables (in an m-by-n constraint matrix) set to specific values while m variables are zero (non-basic). The algorithm:

1. Start at an initial basic feasible solution (vertex).
2. Compute reduced costs for non-basic variables.
3. If all reduced costs are non-negative, current solution is optimal.
4. Otherwise, select an entering variable (negative reduced cost).
5. Perform a pivot: move along an edge to an adjacent vertex.
6. Repeat from step 2.

Simplex Complexity

In the worst case, the simplex method visits exponentially many vertices (Klee-Minty examples). However, in practice it typically terminates in O(m) to O(3m) pivots, making it highly efficient. Degeneracy (multiple optimal vertices) requires anti-cycling rules such as Bland's rule.

Interior Point Methods for LP

Interior point (barrier) methods traverse the interior of the feasible set rather than its boundary. The logarithmic barrier method adds a barrier term to keep iterates strictly feasible:

minimize c^T x - (1/t) * sum_i log(-fi(x))

As t increases, solutions of these barrier problems approach the optimal solution of the original LP along the central path. Interior point methods have polynomial worst-case complexity O(n^3.5 * log(1/epsilon)) and are preferred for large, dense LPs. They also generalize to SOCP and SDP, unlike the simplex method.

7. Quadratic Programming

Quadratic programs (QPs) minimize a quadratic objective over a polyhedron. When the objective is convex (positive semidefinite Hessian), the problem is tractable and widely used in machine learning and control.

Standard Form

minimize (1/2) x^T P x + q^T x + r

subject to: Gx ≤ h, Ax = b

Convex when P is positive semidefinite. Strictly convex (unique optimum) when P is positive definite. LP is a special case (P = 0).

Active Set Methods

Active set methods maintain a working set of constraints treated as equalities. At each step they solve an equality-constrained QP (the KKT system) and update the active set by adding or removing constraints. They are efficient when the number of active constraints at the optimum is small and work well for warm-starting with an initial estimate of the active set.

Applications in Machine Learning

Support Vector Machines (SVMs): The hard-margin SVM minimizes ||w||^2 / 2 subject to yi(w^T xi + b) ≥ 1. The dual is a QP with non-negativity constraints on the dual variables (alpha_i), which is often more tractable and enables the kernel trick.
Soft-margin SVM: Adds slack variables xi_i: minimize ||w||^2 / 2 + C * sum xi_i subject to yi(w^T xi + b) ≥ 1 - xi_i and xi_i ≥ 0. The regularization parameter C controls the tradeoff between margin maximization and misclassification.
Ridge regression: Minimizes ||Ax - b||^2 + lambda * ||x||^2. The closed-form solution (A^T A + lambda I)^(-1) A^T b is a QP with no inequality constraints.

8. SOCP and Semidefinite Programming

Second-Order Cone Programming (SOCP)

SOCP minimizes a linear objective subject to second-order cone constraints:

minimize f^T x

subject to: ||Ai*x + bi|| ≤ ci^T x + di, Fx = g

The second-order cone in R^(n+1) is (x, t) : ||x|| ≤ t. SOCP generalizes LP and QP: every LP and QP with PSD objective can be recast as an SOCP. Applications include robust LP, antenna array weight design, filter design, and portfolio optimization with transaction costs.

Semidefinite Programming (SDP)

SDP minimizes a linear objective subject to linear matrix inequality (LMI) constraints, meaning certain symmetric matrix expressions must remain positive semidefinite:

minimize c^T x

subject to: F(x) = F0 + x1*F1 + ... + xn*Fn is PSD

where F0, ..., Fn are symmetric matrices. SDP generalizes LP (by replacing the non-negativity constraint on a scalar with PSD on a 1x1 matrix). The feasible set is the intersection of an affine subspace with the PSD cone — a convex but not polyhedral set.

Key SDP Applications

Eigenvalue optimization: Minimizing the largest eigenvalue of a symmetric affine matrix function is an SDP.
Max-cut relaxation: The Goemans-Williamson SDP relaxation gives a 0.878 approximation guarantee for max-cut, better than any purely combinatorial algorithm known.
Lyapunov stability: Finding a Lyapunov function V(x) = x^T P x for a linear system requires P to satisfy LMIs, formulated as an SDP.
Sum-of-squares: Verifying that a polynomial is non-negative can be cast as finding an SOS (sum of squares) decomposition, which is an SDP.

9. Gradient Descent and Convergence Analysis

Gradient-based methods are the workhorses of large-scale convex and non-convex optimization, particularly in machine learning. Their convergence guarantees are well-understood for convex objectives.

Gradient Descent

x(k+1) = x(k) - alpha_k * ∇f(x(k))

Each iteration moves in the direction of steepest descent. The step size alpha_k (learning rate) controls how far to move. A fixed step size alpha = 1/L works when f is L-smooth: the gradient is Lipschitz with constant L, meaning ∇f changes at most L per unit distance.

Convergence Rate Analysis

Convex, L-smooth

With step size alpha = 1/L:

f(x(k)) - f* ≤ L * ||x(0) - x*||^2 / (2k)

O(1/k) convergence in objective value. To reach epsilon accuracy requires O(1/epsilon) iterations.

m-Strongly Convex, L-smooth

With optimal step size alpha = 2/(m+L):

||x(k) - x*||^2 ≤ ((L-m)/(L+m))^(2k) * ||x(0)-x*||^2

Linear (geometric) convergence. Condition number kappa = L/m governs the rate; larger kappa means slower convergence.

Step Size Selection

Fixed step size: alpha = 1/L works when L is known. Simple but may be conservative.
Backtracking line search: Start with alpha = 1, reduce by factor beta (e.g., 0.5) until the Armijo condition f(x - alpha*∇f(x)) ≤ f(x) - alpha*(alpha/2)*||∇f(x)||^2 holds. Adapts automatically and avoids needing L.
Diminishing step sizes: For subgradient methods on non-smooth problems, step sizes alpha_k = c/sqrt(k) guarantee O(1/sqrt(k)) convergence to the optimal value.
Barzilai-Borwein: Use a step size that approximates the inverse Hessian locally. Cheap, no line search, often dramatically faster in practice.

Nesterov Accelerated Gradient Method

Nesterov's method achieves O(1/k^2) convergence for smooth convex functions — provably optimal for first-order methods. It adds a momentum term that extrapolates beyond the current iterate:

y(k) = x(k) + (tk - 1) / t(k+1) * (x(k) - x(k-1))

x(k+1) = y(k) - (1/L) * ∇f(y(k))

t(k+1) = (1 + sqrt(1 + 4*tk^2)) / 2

The momentum coefficient (tk - 1) / t(k+1) approaches 1 as k grows. The method is sometimes called FISTA (Fast ISTA) in the proximal gradient context. It achieves the optimal O(1/k^2) rate for smooth convex problems — the same rate as conjugate gradient for quadratics.

Stochastic Gradient Descent (SGD)

For problems of the form minimize (1/n) sum_i fi(x), SGD replaces the full gradient with a stochastic gradient using a single sample or mini-batch. The update is x(k+1) = x(k) - alpha_k * ∇fi(k)(x(k)) where i(k) is a randomly chosen index. SGD has O(1/sqrt(k)) convergence for convex problems with diminishing step sizes but can be faster per iteration when n is large. Variance reduction methods (SVRG, SAGA) recover linear convergence for strongly convex problems.

10. Proximal and Splitting Methods

Many practical problems have non-smooth components (e.g., the L1 penalty in LASSO) that gradient descent cannot handle directly. Proximal methods generalize gradient descent to non-smooth objectives by replacing gradient steps with proximal operator evaluations.

The Proximal Operator

prox_(alpha*g)(v) = argmin_x (g(x) + (1/(2*alpha)) * ||x - v||^2)

The proximal operator of a function g finds the point that balances minimizing g with staying close to v. Key examples:

L1 norm g(x) = lambda*||x||_1: prox is soft thresholding — S_lambda(v)_i = sign(vi) * max(|vi| - lambda, 0).
Indicator of convex set C: prox is the Euclidean projection onto C: proj_C(v) = argmin_(x in C) ||x-v||.
Nuclear norm: prox performs soft thresholding of singular values.

Proximal Gradient Method (ISTA)

For composite objectives f(x) = g(x) + h(x) where g is smooth and convex and h is convex but possibly non-smooth (and with a tractable prox), the proximal gradient method alternates a gradient step on g and a proximal step on h:

x(k+1) = prox_(alpha*h)(x(k) - alpha * ∇g(x(k)))

This achieves O(1/k) convergence for convex g, and O(1/k^2) with Nesterov acceleration (FISTA). For LASSO (g = (1/2)||Ax-b||^2, h = lambda*||x||_1), the prox step is soft thresholding and the gradient step is a residual update.

ADMM: Alternating Direction Method of Multipliers

ADMM solves problems of the form: minimize f(x) + g(z) subject to Ax + Bz = c. The augmented Lagrangian is:

L_rho = f(x) + g(z) + y^T (Ax + Bz - c) + (rho/2) * ||Ax+Bz-c||^2

The ADMM iterations are:

x(k+1) = argmin_x L_rho(x, z(k), y(k)) [x-update]

z(k+1) = argmin_z L_rho(x(k+1), z, y(k)) [z-update]

y(k+1) = y(k) + rho * (A*x(k+1) + B*z(k+1) - c) [dual update]

ADMM converges to the optimal solution under mild conditions (f and g convex). Each sub-problem involves only one block of variables, enabling distributed computation. Applications include LASSO, group LASSO, basis pursuit, distributed MPC, and consensus optimization across networks.

Douglas-Rachford Splitting

Douglas-Rachford splitting solves minimize f(x) + g(x) using the resolvent (proximal) operators of both f and g alternately. Unlike ADMM, it does not require the splitting variable z or a coupling constraint. It is particularly useful when both functions have efficiently computable proximal operators (e.g., projection onto two convex sets).

11. Applications in Machine Learning and Engineering

LASSO (L1 Regularized Regression)

minimize (1/2) * ||Ax - b||^2 + lambda * ||x||_1

The L1 penalty promotes sparsity: the solution tends to have many zero entries. This happens because the L1 ball has corners along coordinate axes where the objective gradient is most likely to align. LASSO is widely used for feature selection and compressed sensing. It can be solved via ISTA/FISTA or coordinate descent.

Portfolio Optimization

Markowitz mean-variance portfolio optimization:

minimize w^T * Sigma * w - mu^T w

subject to: 1^T w = 1, w ≥ 0 (long-only)

where Sigma is the return covariance matrix and mu is the vector of expected returns. This is a convex QP. The efficient frontier is the set of optimal (risk, return) pairs as the risk-aversion parameter varies. Extensions include transaction cost penalties (SOCP) and factor model constraints.

Compressed Sensing

Recover a sparse signal x from m measurements y = Ax (m much less than n):

minimize ||x||_1 subject to Ax = b

This is the basis pursuit problem (an LP). Under the restricted isometry property (RIP) on A, basis pursuit recovers the sparsest solution exactly. The Dantzig selector and LASSO are related convex relaxations for noisy measurements.

Control: Model Predictive Control (MPC)

MPC solves a convex QP or SOCP at each time step to compute optimal control inputs over a finite horizon, subject to state and input constraints. The quadratic cost penalizes deviations from the reference trajectory and large inputs. Only the first control action is applied; the problem is re-solved at the next step (receding horizon). ADMM and interior point methods are used for real-time MPC.

Summary Table: Applications and Problem Types

Application	Problem Type	Typical Algorithm
Linear SVM	QP	SMO, Interior Point
LASSO	Composite Convex	ISTA/FISTA, Coord. Descent
Ridge Regression	Unconstrained QP	Closed form, Gradient Descent
Portfolio (Markowitz)	QP	Interior Point
Compressed Sensing	LP (Basis Pursuit)	ADMM, Interior Point
Network Flow	LP	Simplex, Network Simplex
Robust Optimization	SOCP	Interior Point
Lyapunov Stability	SDP	Interior Point (SDPT3, SCS)

12. Practice Problems with Solutions

Problem 1: Verifying Convexity

Determine whether each function is convex, concave, or neither: (a) f(x) = e^x on R; (b) f(x) = log(x) on R_++; (c) f(x, y) = x^2 / y on (x, y : y greater than 0); (d) f(x) = ||x||^2 on R^n.

Show Solution

(a) f(x) = e^x: f''(x) = e^x greater than 0 for all x, so the Hessian is positive — f is strictly convex.

(b) f(x) = log(x): f''(x) = -1/x^2 less than 0 for x greater than 0, so f is strictly concave.

(c) f(x, y) = x^2 / y: This is a ratio of a quadratic and a linear function. The Hessian is H = (2/y) * [1, -x/y; -x/y, x^2/y^2]. The eigenvalues are 0 and 2(x^2 + y^2)/y^3 greater than 0, so H is PSD — f is convex. (It can also be seen as a perspective function of g(x) = x^2, which is convex.)

(d) f(x) = ||x||^2: The Hessian is 2I, which is positive definite — f is strictly convex.

Problem 2: KKT Conditions

Solve using KKT conditions: minimize x1^2 + x2^2 subject to x1 + x2 = 1. Find the optimal solution and verify the KKT conditions.

Show Solution

The Lagrangian is L(x1, x2, nu) = x1^2 + x2^2 + nu*(x1 + x2 - 1).

Stationarity: ∂L/∂x1 = 2x1 + nu = 0, so x1 = -nu/2. Similarly ∂L/∂x2 = 2x2 + nu = 0, so x2 = -nu/2. Therefore x1 = x2.

Primal feasibility: x1 + x2 = 1, combined with x1 = x2 gives x1* = x2* = 1/2.

Dual variable: nu* = -2*x1* = -1.

Optimal value: f(1/2, 1/2) = (1/4) + (1/4) = 1/2.

Verification: ∇f + nu*∇h = (1, 1) + (-1)*(1, 1) = (0, 0). KKT stationarity confirmed. The problem is convex (P = 2I is PD), so KKT is sufficient and this is the unique global optimum.

Problem 3: LP Duality

Given the LP: maximize 3x1 + 5x2 subject to x1 ≤ 4, 2x2 ≤ 12, 3x1 + 5x2 ≤ 25, x1, x2 ≥ 0. Write the dual problem and verify complementary slackness at the optimal solution.

Show Solution

Primal optimal: solve graphically. x1 ≤ 4 and 3x1 + 5x2 ≤ 25 intersect at x1 = 4, x2 = 13/5. Check 2x2 = 26/5 ≤ 12. Yes. Objective: 12 + 13 = 25/... Let us re-examine. At (4, 13/5): f = 3(4) + 5(13/5) = 12 + 13 = 25. At x1=0, x2=6: f = 30. Check 3(0)+5(6)=30 ≤ 25? No, infeasible. At x1=5/3, x2=12/2=6: check 3(5/3)+5(6)=5+30=35 greater than 25, infeasible. The binding constraints at optimum are 2x2=12 (x2=6) and 3x1+5(6)=25 gives 3x1=-5, infeasible. Corner: (0,5): f=25. Corner (4,13/5): f=25. Both give f=25. Optimal value p*=25.

Dual (minimization form): minimize 4y1 + 12y2 + 25y3 subject to y1 + 3y3 ≥ 3, 2y2 + 5y3 ≥ 5, y1, y2, y3 ≥ 0. Dual optimal: y1=0, y2=0, y3=1 gives d* = 25 = p*. Strong duality confirmed.

Complementary slackness at (x1=0, x2=5, y1=0, y2=0, y3=1): Constraint 1 slack = 4-0=4, y1=0 (satisfied). Constraint 3 slack = 25-25=0, y3=1 (active, consistent). Dual constraint 1 slack = 0+3-3=0, x1=0 (satisfied). All complementary slackness conditions hold.

Problem 4: Gradient Descent Convergence

Let f(x) = ||Ax - b||^2 / 2 where A is an m-by-n matrix with full column rank. (a) What is the gradient of f? (b) What is the Lipschitz constant L of ∇f? (c) How many iterations of gradient descent with step size 1/L are needed to achieve epsilon accuracy in objective value, given ||x(0) - x*||^2 = R^2?

Show Solution

(a) ∇f(x) = A^T(Ax - b). The Hessian is ∇^2 f(x) = A^T A.

(b) The Lipschitz constant of ∇f is the largest eigenvalue of the Hessian: L = lambda_max(A^T A) = sigma_max(A)^2, the square of the largest singular value of A.

(c) From the convergence bound for convex L-smooth functions: f(x(k)) - f* ≤ L*R^2 / (2k). Setting this ≤ epsilon gives k ≥ L*R^2 / (2*epsilon) iterations. So O(1/epsilon) iterations are needed.

Note: since A has full column rank, A^T A is positive definite, so f is strongly convex with m = lambda_min(A^T A). The condition number is kappa = L/m = lambda_max / lambda_min of A^T A. With the optimal step size, gradient descent converges linearly at rate (1 - 1/kappa).

Problem 5: Soft Thresholding / LASSO Subproblem

Compute the proximal operator of g(x) = lambda*|x| for lambda greater than 0: that is, solve prox_(lambda*|.|)(v) = argmin_x (lambda*|x| + (1/2)*(x-v)^2). Interpret the result geometrically.

Show Solution

We minimize h(x) = lambda*|x| + (1/2)(x-v)^2. Taking the subdifferential and setting 0 in the subdifferential:

Case 1: x greater than 0. Then h'(x) = lambda + (x-v) = 0, so x = v - lambda. This is positive only when v greater than lambda.

Case 2: x less than 0. Then h'(x) = -lambda + (x-v) = 0, so x = v + lambda. This is negative only when v less than -lambda.

Case 3: x = 0. Optimality requires 0 is in the subdifferential: 0 in (lambda*[-1,1] + (0-v)), meaning v in [-lambda, lambda].

Result (soft thresholding): S_lambda(v) = max(v - lambda, 0) if v greater than 0, min(v + lambda, 0) if v less than 0, equivalently sign(v)*max(|v|-lambda, 0).

Geometrically: soft thresholding shifts v toward zero by lambda; values inside [-lambda, lambda] are set to zero (the "dead zone"). This is what produces sparsity in LASSO solutions.

13. Exam Tips and Common Mistakes

High-Yield Exam Topics

★KKT conditions: be able to write them out for any given problem and check whether they are satisfied at a candidate point.
★Strong duality: know Slater's condition and be able to derive the dual of a given LP or QP.
★Checking convexity: use the second-order condition (Hessian PSD) for differentiable functions; use the definition for non-differentiable ones.
★Gradient descent convergence: know O(1/k) for convex L-smooth, linear rate for strongly convex, O(1/k^2) for Nesterov.
★Soft thresholding: know the formula and be able to derive it from the proximal operator definition.

Common Mistakes to Avoid

Forgetting dual feasibility: The dual variable lambda must be non-negative for inequality constraints. Failing to impose lambda ≥ 0 is one of the most common errors in KKT problems.
Assuming strong duality always holds: Strong duality requires a constraint qualification. Non-convex problems may have a positive duality gap even with Slater's condition violated.
Confusing the condition number with the Lipschitz constant: L is the Lipschitz constant of the gradient; m is the strong convexity parameter; kappa = L/m is the condition number. Slow convergence is governed by kappa, not L alone.
Applying gradient descent to non-smooth objectives: Gradient descent requires the objective to be differentiable. For L1 penalties, use proximal gradient or subgradient methods.
Forgetting that the dual function is always concave: g(lambda, nu) is concave regardless of whether the primal is convex. The dual problem is always a concave maximization.
Misidentifying feasibility problems: If Slater's condition fails, the primal may still be feasible and have an optimal solution — Slater's condition is about strict feasibility, which implies strong duality, not mere feasibility.

Key Takeaways

→A convex set contains all line segments between its points; convex functions lie below their chords. Together they define a tractable optimization landscape where local optima are global.
→The Lagrangian relaxes constraints into the objective. The dual problem always yields a lower bound (weak duality); Slater's condition closes this gap to zero (strong duality).
→KKT conditions are the complete characterization of optimality for convex programs under Slater's condition: stationarity + primal feasibility + dual feasibility + complementary slackness.
→LP, QP, SOCP, and SDP form a hierarchy of increasingly expressive convex problem classes, all solvable in polynomial time by interior point methods.
→Gradient descent achieves O(1/k) for smooth convex objectives; Nesterov acceleration achieves O(1/k^2) — optimal for first-order methods. Strong convexity yields linear convergence.
→Proximal gradient and ADMM extend these ideas to composite and distributed problems, enabling large-scale applications in machine learning, signal processing, and control.

Convex Optimization: Theory, Algorithms, and Applications

Learning Objectives

1. Convex Sets

Formal Definition

Examples of Convex Sets

Operations Preserving Convexity

Convex Hull

Supporting Hyperplanes

2. Convex Functions

The Epigraph

Jensen's Inequality

First-Order Condition

Second-Order Condition

Strong Convexity

Operations Preserving Convexity of Functions

3. Optimization Problem Standard Forms

Feasibility and Optimal Value

The Lagrangian

Optimality Conditions (Unconstrained)

4. Lagrangian Duality

The Lagrange Dual Function

Weak and Strong Duality

Slater's Condition

Complementary Slackness

5. KKT Conditions

The Four KKT Conditions

Necessity and Sufficiency

KKT Example: Equality-Constrained QP

6. Linear Programming

Standard Form

LP Duality

The Simplex Method

Interior Point Methods for LP

7. Quadratic Programming

Standard Form

Active Set Methods

Applications in Machine Learning

8. SOCP and Semidefinite Programming

Second-Order Cone Programming (SOCP)

Semidefinite Programming (SDP)

Key SDP Applications

9. Gradient Descent and Convergence Analysis

Gradient Descent

Convergence Rate Analysis

Step Size Selection

Nesterov Accelerated Gradient Method

Stochastic Gradient Descent (SGD)

10. Proximal and Splitting Methods

The Proximal Operator

Proximal Gradient Method (ISTA)

ADMM: Alternating Direction Method of Multipliers

Douglas-Rachford Splitting

11. Applications in Machine Learning and Engineering

LASSO (L1 Regularized Regression)

Portfolio Optimization

Compressed Sensing

Control: Model Predictive Control (MPC)

Summary Table: Applications and Problem Types

12. Practice Problems with Solutions

13. Exam Tips and Common Mistakes

High-Yield Exam Topics

Common Mistakes to Avoid

Related Topics

Key Takeaways