Frequently Asked Questions#
In this “appendix” to our little online book, we collect frequently asked questions about machine learning.
Q: Doesn’t the gradient descent get stuck in local minima?
A: Yes, in principle this can happen. However, the pictures we draw are a bit misleading, with a single coordinate axis for the parameters. In truth, we are in a high-dimensional parameter space. In this space, most fixed points (where the gradient vanishes) are actually saddle points, with some directions of positive curvature and some of negative curvature. They can still slow down learning, but at least they are not local minima. Also, in many situations one observes that several training runs end up with very different parameter configurations, but only slightly different values of the loss function, performing approximately equally well.
Q: Isn’t this all just curve fitting?
A: In principle, yes. However, it is “curve fitting” in an entirely new regime, with 100s or millions of trainable parameters. This is very different from usual nonlinear curve fitting, and the behaviour therefore is often quite unexpected. For example, people have found that one can get very good performance even in regimes where the number of parameters exceeds the number of training data points, which would be a very bad regime for usual curve fitting.
Q: Are there rules of how to pick the corect number of layers and correct number of neurons in each layer?
A: There are no precise rules. However, empirical rules work for simple problems. Imagine mapping an n-dimensional input vector to an m-dimensional output vector. Then you might start with a few (2-3) hidden layers, with their neuron numbers on the order of 100 or so. This will likely already perform reasonably and then you can optimize from there. The more important message is that ML with neural networks is surprisingly robust and often does not depend very much on the detailed choices you make (unless you work on cutting edge problems where no one has managed to get good results and you explore completely new territory).
Q: How important is the activation function?
A: As long as you choose a nonlinear monotonic function, every choice is probably good. People have had very good success even with the simplest possible choice, a piecewise linear function (the “relu”, rectified linear unit).
Q: How large should my training data set be?
A: That’s a tricky question. A good general rule of thumb for relative standard problems is that you should have 1000s of training samples. More precisely, the answer depends on the variety of data you want to process (if you have N categories of images, probably you want tens or 100s of images for each of these categories). It also depends on how much the structure of the neural network has been tailored to the type of data. Convolutional networks are better for images than fully connected networks and need less training data. On a more advanced level, for special applications, one may inject some physics knowledge into constructing a more elaborate network.
Q: OK, but I really do not have that much data in my experiment. What can I do?
A: One option is to pretrain on e.g. simulated trainig samples. Even if they are a bit different from the experimental data, they will already help the network learn. Afterwards, train on the experimental data. Alternatively, in some cases, one can generate new fake training data from existing one (e.g. distorting / rotating images etc.), if one knows the correct output also for the fake data.
(FAQ to be continued, you are welcome to post questions via the “open issues” link in the github button at the top of this page!)