There has been a lot of profession of “breaking the state of the art” within the generative adversarial networks community lately. Sometimes, it has been difficult to keep apart hyperbole from genuine innovations in GANs. However, in recent weeks an OpenReview preprint has caught the attention of Ian Goodfellow, the inventor of GANs. Goodfellow was impressed that the authors’ method managed to generate samples from all 1000 ImageNet classes simultaneously, the first demonstration of this achievement. The lack of diversity is a serious problem for GANs, and therefore this is a major milestone.

SN-GAN CIFAR-10 Samples (my implementation)

Samples from my PyTorch implementation of spectral normalization GANs.

Spectral normalization is a deceptively simple concept, so let’s go through the argument outlined in the paper.

The centrality of Lipschitz continuity in GANs

Definition of Lipschitz continuity

Lipschitz continuity sounds like an arcane term only introduced in a mathematical analysis course, but it is actually a straightforward “nice” property of functions—even simpler to explain, perhaps, than the usual epsilon-delta continuity.

Suppose we have a GAN discriminator \(D : I \rightarrow \mathbb{R}\), where \(I\) is the space of images (e.g., \(\mathbb{R}^{32\times32}\)). Because both the domain and codomain of this function have inner products, we have a natural metric (distance function) in both spaces: the L2 distance.

If our discriminator is \(K\)-Lipschitz continuous, then for all \(x\) and \(y\) in \(I\),

\begin{equation*} ||D(x) - D(y)|| \leq K ||x - y|| \end{equation*}

where \(|\cdot|\) is the L2 norm. Here, if \(K\) is a minimum, then it is called the Lipschitz constant of the discriminator.

(It’s easy to see that if \(I=\mathbb{R}\), then Lipschitz continuity implies epsilon-delta continuity.)

Let’s look at the 1D case to illustrate Lipschitz continuity geometrically. Suppose \(D(x) = \sin(x)\). If \(D\) is Lipschitz continuous, then we can draw a cone centered at every point on its graph such that the graph lies outside of this cone. (Drag the sliders below to verify this for yourself.)

It’s clear that \(\sin\) is 1-Lipschitz continuous. What about ReLU, our favorite activation function?

This is also obviously 1-Lipschitz continuous.

In fact, Leaky ReLU (with a leak value less than one) is 1-Lipschitz continuous. However, it is easy to come up with a counterexample:

The function \(D(x)=x^2\) is not Lipschitz continuous at all.

One interesting fact to notice from these examples is that if a 1D function is differentiable, then its Lipschitz constant is just the maximum value of its derivative. This gives a slightly more rigorous explanation of why \(x^2\) is not Lipschitz continuous: its derivative, \(2x\), is unbounded.

So, what is the justification for requiring that our discriminator be \(K\)-Lipschitz? If we are training a Wasserstein GAN, then Kantorovich-Rubinstein duality requires it. However, when we’re training using the standard KL-divergence loss, there’s a looser but still intuitive explanation. Just as batch normalization ensures that the signals within a neural net have well-behaved statistics—controlling the exploding gradient problem that plagues RNNs and deep feedforward networks—mandating Lipschitz continuity bounds the gradients in our discriminator.

The multidimensional case

Suppose we have a linear function \(A : \mathbb{R}^{n}\rightarrow\mathbb{R}^m\). This function could be the pre-activation operation of one layer in a multilayer perceptron. We can compute the spectral norm of this function, which is defined as the largest singular value of the matrix \(A\), i.e., the square root of the largest eigenvalue of \(A^T A\).

What is the Lipschitz constant of this function, if it exists? Since \(A\) is linear, we can set our point of reference \(y\) to zero (place our “cone” at zero). In other words, if \(A\) is \(K\)-Lipschitz at zero, then it is \(K\)-Lipschitz everywhere. This simplifies the Lipschitz continuity requirement to

\begin{equation*} ||A x|| \leq K ||x|| \end{equation*}

for all \(x\in I\). This is equivalent to the statement

\begin{equation*} \langle A x, A x \rangle \leq K^2 \langle x, x \rangle, \forall x\in I \end{equation*} which in turn is equivalent to \begin{equation*} \langle (A^T A - K^2) x, x \rangle \leq 0 , \forall x\in I. \end{equation*} If we expand \(x\) in the orthonormal basis of eigenvectors of \(A^T A\) (i.e., \(x = \sum_i x_i v_i\)), we can then write out this inner product explicitly:

\begin{align*} \langle (A^T A - K^2) x, x \rangle &= \langle (A^T A - K^2) \sum_i x_i v_i, \sum_j x_j v_j \rangle\\& = \sum_i \sum_j x_i x_j \langle (A^T A - K^2) v_i, v_j \rangle\\& = \sum_i (\lambda_i - K^2) x_i^2 \leq 0\\& \implies \sum_i (K^2 - \lambda_i) x_i^2 \geq 0. \end{align*}

Note that since \(A^T A\) is positive semidefinite, all the \(\lambda_i\)s must be nonnegative. To guarantee the above sum to be nonnegative, each term must be nonnegative, so we have

\begin{equation*} K^2 - \lambda_i \geq 0 \text{ for all } i = 1 \ldots n. \end{equation*}

Since we choose \(K\) to be the minimum value satisfying the above constraints, we immediately see that \(K\) is the square root of the largest eigenvalue of \(A^T A\). Therefore, the Lipschitz constant of a linear function is its largest singular value, or its spectral norm.

Composition of functions

Analogously to the 1D case, it is easy to observe that the Lipschitz constant of a general differentiable function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m \) is the maximum spectral norm (maximum singular value) of its gradient over its domain. \begin{equation*} ||f||_\text{Lip} = \sup_{x} \sigma(\nabla f(x)) \end{equation*}

Now, let’s introduce a function \(g : \mathbb{R}^m \rightarrow \mathbb{R}^l\). The underlying premise of the spectral normalization paper is that if we can find, or at least bound, the Lipschitz constant of the composition of \(f\) and \(g\), then we can obtain a bound on the Lipschitz constant of an arbitrary multilayer discriminator, which is just a composition of linear maps and componentwise nonlinearities.

According to the chain rule, we have

\begin{equation*} \nabla (g\circ f)(x) = \nabla g(f(x)) \nabla f(x). \end{equation*}

Remember that these gradients are just matrices being multiplied together. This suggests a promising approach: to find the spectral norm of a composition of functions, express it in terms of the spectral norm of the matrix product of its gradients. We can write the spectral norm (maximum singular value) in another convenient form:

\begin{equation} \sigma(\nabla f(x)) = \sup_{||v||\leq 1} ||[\nabla f(x)] v|| \end{equation} \begin{equation} \sigma(\nabla(g\circ f)(x)) = \sup_{||v||\leq 1} ||[\nabla g(f(x))][\nabla f(x)] v|| \end{equation} Since the supremum in (1) is convex, we can bound the result in (2) as follows:

\begin{equation*} \sup_{||v||\leq 1} ||[\nabla g(f(x))][\nabla f(x)] v|| \leq \sup_{||u||\leq 1} ||[\nabla g(f(x))]u||\sup_{||v||\leq 1} ||[\nabla f(x)] v||. \end{equation*}

In other words, \begin{equation*} ||g\circ f||_\text{Lip} \leq ||g||_\text{Lip} ||f||_\text{Lip}. \end{equation*} This bound is provides the theoretical justification for spectral normalization outlined in the paper. The solution goes like this: if we can fix each of the factors in the right-hand side of this inequality to 1, then we can ensure that the discriminator is at most 1-Lipschitz. If we are training a Wasserstein GAN, this guarantees that Kantarovich-Rubenstein duality holds.

Spectral normalization

Fixing the spectral norm of a layer is as straightforward as it sounds. Spectral normalization, proposed in this paper, simply replaces every weight \(W\) with \(W / \sigma(W)\). But how do we efficiently compute \(\sigma(W)\), the largest singular value of \(W\)? The answer is a cheap and effective technique called power iteration.

Suppose we have a linear map \(W : \mathbb{R}^n \rightarrow \mathbb{R}^m\). Suppose we have a random vector in the domain of our matrix, \(v\in \mathbb{R}^n\), and a vector in the codomain, \(u\in\mathbb{R}^m\).

Let’s first consider \(v\). We can form the square matrix \(W^T W : \mathbb{R}^n \rightarrow \mathbb{R}^n\). Power iteration involves computing the recurrence relation \begin{equation*} v_{t+1} = \frac{W^T W v_t}{||W^T W v_t||} \end{equation*}

we can unroll this recurrence relation: \begin{equation*} v_{t} = \frac{(W^T W)^t v}{||(W^T W)^t v||} \end{equation*}

By the spectral theorem, we can write \(v\) in an orthonormal basis of eigenvectors of \(W^T W\). Let’s denote \(\lambda_1 \ldots \lambda_n\) as the descending eigenvalues of \(W^T W\) and \(e_1 \ldots e_n\) the corresponding eigenvectors.

Power iteration then consists of computing the following:

\begin{align*} v_{t}& = \frac{(W^T W)^t \sum_i v_i e_i}{||(W^T W)^t \sum_i v_i e_i||}\\& = \frac{\sum_i v_i \lambda_i^t e_i}{||\sum_i v_i \lambda_i^t e_i||}\\& = \frac{v_1 \lambda_1^t \sum_i \frac{v_i}{v_1} \left(\frac{\lambda_i}{\lambda_1}\right)^t e_i}{||v_1 \lambda_1^t \sum_i \frac{v_i}{v_1} \left(\frac{\lambda_i}{\lambda_1}\right)^t e_i||}. \end{align*}

Now note that since \(\lambda_1\) is the largest eigenvalue of \(W^T W\), upon power iteration \(\lim\limits_{t\rightarrow\infty}\frac{\lambda_i}{\lambda_1} = 0\) for \(i>1\). So, after many iterations of the above procedure, \(v_t\) converges to \(e_1\). We call the intermediate computation \(\frac{W v_t}{||W v_t||} = u_t\). The power iteration procedure becomes:

\begin{align*} &u_{t+1} = W v_t\\& v_{t+1} = W^T u_{t+1}. \end{align*}

Since the singular values of \(W^T\) and \(W\) are the same, it must be that the spectral norm is \(\sigma(W) = \sqrt{\lambda_1} = ||W v||\). Since \(||u||\) is of unit length, we can conveniently compute the spectral norm as follows: \begin{equation*} \sigma(W) = ||W v|| = u^T W v. \end{equation*}

Now, the algorithm of spectral normalization should appear simple. For every weight in our network, we randomly initialize vectors \(u\) and \(v\). Because the weights change slowly, we only need to perform a single power iteration on the current version of these vectors for each step of learning; this is why spectral normalization is more computationally efficient than, say, gradient penalty.

I’ve implemented a simple spectral normalization wrapper module in PyTorch. Feel free to try it out and see the effectiveness of spectral normalization GANs yourself.