A technical issue I came across when researching neural networks, and especially recurrent neural networks and LSTMs, where researchers encountered a phenomenon known as the ‘exploding or vanishing gradient problem’ (in fact one of the causes of the invention of the LSTM architecture): when you try to make your network remember information from before, e.g. information from earlier in a text or other sequences, that information is stored as numbers and added/multiplied to the current state. In that process, what often happens is either that values become very large or very small (in fact, in the first RNNs older information had the tendency of weighing very little as compared to recent one, which is why the LSTM, long short term memory, network, was invented).

I started seeing code where the Xavier initializer is used, and found videos of introductions to that subject.