This is summary of Boyd convex optimization. Steepest descent method is a convex optimization algorithm. The normalized steepest descent direction \(x_{nsd}\) is a vector of unit ball of a norm that extends in the direction \(-\nabla f(x)\). The inner product of \(x_{nsd}\) and \(-\nabla f(x)\) is maximized. The first order Taylor approximation of \(f(x+v) = f(x) + \nabla f(x)^{T} v\) is most efficient when \(v = x_{nsd}\).
The \(x_{nsd}\) is unnormalized into \(x_{sd}\). The normalization is ralated with unit ball of norm. When \(x_{nsd}\) is scaled with dual norm of \(-\nabla f(x)\), the second term of Taylor approximation \(\nabla f(x)^{T} x_{sd}\) becomes convex (squre of dual norm of gradient of \(f(x)\)). The unnormalized \(x_{sd}\) the amount of movement of approximation because the inner product of gradient of \(f(x)\) and unnormalized steepest descent direction is squre of dual norm of gradient.
The dual norm of gradient \(\lVert \nabla f(x) \rVert\) is main subject of this post. The simplest dual is a complement of a set. The \((C^c)^c\) is \(C\). If \(C\) is small, \(C^C\) is large and vice versa. The dual cone is related to inner product and non-negativity. Let \(K\) be a cone, The set \(K^{*} = \{y|x^{T}y \geq 0\) for all \(x \in K\}\). If \(K\) is large, \(K^{*}\) is small and vice versa.
The dual norm \(\left\lVert x \right\rVert _{*}\) is \(\sup \{\, x^{T}y \mid \lVert y \rVert \leq 1 \,\}\). If \(x\) direction is long axis of ellypsoid norm, the norm of \(x\) is small. But the dual norm is large because \(\lVert y \rVert _{2}\) large and vice versa.
The main points are the first order Taylor approximation is most efficient with ellypsoid norm when the linear approximation is \(\sup\{\nabla f(x) ^{T} x_{sd}\}\) which is dual norm of gradient of \(f(x)\).