we simplWe can learn a lot about Why Deep Learning Works by studying the properties of the layer weight matrices of pre-trained neural networks.   And, hopefully, by doing this, we can get some insight into what a well trained DNN looks like–even without peaking at the training .

One broad question we can ask is:

How is information concentrated in Neural Network (DNNs)?  

To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch.

In a previous post, we formed the Singular Value Decomposition (SVD) of the weight matrices of the linear, or fully (FC) layers. And we saw that nearly all the FC Layers display Power Law behavior.  And, in fact, this behavior is Universal across models both ImageNet and NLP models.

But this only part of the story.  Here, we ask related question–do well trained DNNs weight matrices lose Rank ?

Matrix Rank:

Lets say mathbf{W} is an Ntimes M matrix.  We can form the Singular Value Decomposition (SVD):


nu_{i}=Sigma_{ii},;;i=1cdots M

The Matrix Rank mathcal{R}(mathbf{W}), or Hard Rank, is simply the number of non-zero singular values (nu_{i}>0)


which express the decrease in Full Rank M.


Notice the Hard Rank of the rectangular matrix mathbf{W} is the dimension of the square correlation matrix mathbf{X}=mathbf{W}^{T}mathbf{W}.

In python, this can be computed using

rank = numpy.linalg.matrix_rank(W)

Of course, being a numerical , we really mean the number of singular values above some tolerance (nu_{i}>text{tol})…and we can get different results depending on if we use

  • the default python tolerance
  • the numerical recipes tolerance, which is tighter

See the numpy documentation on matrix_rank for details.

Here, we will compute the rank ourselves, and use an extremely loose bound, and consider any nu_{i}< 0.001.  As we shall see, DNNs are so at concentrating information that it will not matter

Rank and Regularization

If all the singular values are non-zero, we say mathbf{W} is Full Rank. If one or more nu_{i}sim 0, then we say mathbf{W} is Singular.  It has lost expressiveness, and the model has undergone Rank collapse.

When a model undergoes Rank Collapse, it traditionally needs to be regularized. Say we are solving a simple linear system of equations / linear regression


The simple solution is to use a little linear algebra to get the optimal values for the unknown mathbf{W}


But when mathbf{W} is Singular, we can not form the matrix inverse.  To fix this, we simply add some small constant gamma to diagonal of mathbf{W}



So that all the singular values will now be greater than zero, and we can form a generalized pseudo-inverse, called the Moore-Penrose Inverse


This procedure is also called Tikhonov Regularization.  The constant, or Regularizer, gamma sets the Noise Scale for the model.    The information in mathbf{W} is concentrated in  the singular vectors associated with larger singular values nu_{i}>gamma , and the noise is left over in the those associated with smaller singular values nu_{i}^{2}<gamma:

  • Information:  vectors where  nu_{i}>gamma
  • Noise: vectors where nu_{i}<gamma

In cases where mathbf{W} is Singular, regularization is absolutely necessary.  But even when it is not singular, Regularization can be useful in traditional machine learning.  (Indeed, VC theory tells us that Regularization is a first class concept)

But we know that Understanding deep learning requires rethinking generalization. Which leads to the question ?

Do the weight matrices of well trained DNNs undergo Rank Collapse ?

Answer: They DO NOT — as we now see:

Analyzing Pre-Trained pyTorch Models

We can easily examine the numerous pre-trained models available in PyTorch.  We simply need to get the layer weight matrices and compute the SVD.  We then compute the minimum singular value nu_{min} and compute a histogram of the minimums across different models.

for im, m in enumerate(model.modules()):
  if isinstance(m, torch.nn.Linear):
    W = np.array(m.weight.data.clone().cpu())
    M, N = np.min(W.shape), np.max(W.shape)
    _, svals, _ = np.linalg.svd(W)

We do this here for numerous models trained on ImageNet and available in pyTorch, such as AlexNet, VGG16, VGG19, ResNet, DenseNet201,  etc.– as shown in this Jupyter Notebook.

We also examine the NLP models available in AllenNLP.   This is a little bit trickier; we have to install AllenNLP from source, then create an analyze.py command class, and rebuild AllenNLP. Then, to analyze, say, the AllenNLP pre-trained NER model, we run

allennlp analyze https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.04.26.tar.gz

This print out the ranks (and other information, like power law fits), and then plot the results.  The code for all this is here.

Notice that many of the AllenNLP models include Attention matrices, which can be quite large and very rectangular (i.e. = sim16,000times 500), as compared to the smaller (and less rectangular) weight matrices used in the ImageNet models (i.e. sim4,000times 500),.

Note:  We restrict our analysis to rectangular layer weight matrices with an aspect ratio Q=N/M>1, and really larger then 1.1.   This is because the Marchenko Pastur (MP) Random Matrix Theory (RMT) tells us that  nu_{min}>0 only whenQ>1.   We will review this in a future blog.

Minimum Singular Values of Pre-Trained Models


For the ImageNet models, most fully connected (FC) weight matrices have a large minimum singular value  nu_{min}gg 0 . Only 6 of the 24 matrices looked at have nu_{min}sim 0–and we have not carefully tested the numerical threshold–we are just eyeballing it here.

For the AllenNLP models, none of the FC matrices show any evidence of Rank Collapse.  All of the singular values for every linear weight matrix are non-zero.

It is conjectured that fully optimized DNNs–those with the best generalization accuracy–will not show Rank Collapse in any of their linear weight matrices.

If you are training your own model and you see Rank Collapse, you are probably over-regularizing.

Inducing Rank Collapse is easy–just over-regularize

it is, in fact, very easy to induce Rank Collapse.   We can do this in a Mini version of AlexNet, coded in Keras 2, and  available here.

Mini AlexNet
A Mini version of AlexNet, trained on CIFAR10, used to explore regularization and rank collapse in DNNs.

To induce rank collapse in our Fc weight matrices, we can add large weight norm constraints to the linear layers, using the kernel_initializer=0.001

model.add(Dense(384, kernel_initializer='glorot_normal',
   bias_initializer=Constant(0.1),activation='relu', kernel_regularizer=l2(1e-3))
model.add(Dense(192, kernel_initializer='glorot_normal',
   bias_initializer=Constant(0.1),activation='relu'), kernel_regularizer=l2(1e-3))

We train this smaller MiniAlexnet model on CIFAR10 for 20 epochs, save the final weight matrix, and plot a histogram of the eigenvalues  of the weight correlation matrix mathbf{X}=mathbf{W}^{T}mathbf{W}.

Rank Collapse Induced in a Mini AlexNet model, caused by adding weight norm constraints of 0.001

Recall that the eigenvalues are simply the square of the singular values.  Here, we have most of them are nearly 0

lambda_{i}:=nu_{i}^{2}sim 0.

Adding too much regularization causes nearly all of the eigenvalues/singular values to collapse to zero.

Well trained Deep Neural Networks do not display Rank Collapse


We believe this is a unique property of DNNs, and related to how Regularization works in these models.  We will discuss this and more in an upcoming paper


Source link
thanks you RSS link
( https://calculatedcontent.com/2018/09/21/rank-collapse-in-deep-learning/)


Please enter your comment!
Please enter your name here