Towards the Next Generation of Artificial Neural Network

Min Jean Cho

In this article, 1) I’ll first talk about how sigmoidal activation function makes artificial neurons mimic biological neurons more than any other activation functions. This discussion will entail to other sub-topics as well: 2) the pitfall of sigmoid activation in the state-of-the-art deep learning- the vanishing gradient problem; 3) why other non-linear activation functions (e.g., ReLU, LeakyReLU) are ad-hoc solutions; 4) why residual neural network (ResNet) is one step closer to biological neural networks in the brain; 5) the next step for artificial neural network.

This article is based on my thoughts, so please take it with a grain of salt.

Let’s first discuss about why sigmoidal activation makes artificial neurons mimic biological neurons more than any other activation functions. A neuron receives signals in the form of neurotransmitters from other neurons and release the signal also in the form of neurotransmitters to other neurons (for more details, see this article).

In this signal-receiving and releasing process, there are some important points we should note about the amount of neurotransmitters released from axon terminal. First, there is a threshold for neurotransmitters to generate action potential. Second, the strength of input signals is in the form of the frequency of pulses (e.g., bit per millisecond). Third, the relationship between the rate of neurotransmitter release and the concentration of intracellular calcium ion is sigmoidal (that is, the release rate has its maximum). Forth, released neurotransmitters rapidly diffuse away and are subjected to metabolic degradation. Thus, the amount of neurotransmitters released by activated neuron (\(y\)) is not proportional to the total amount of neurotransmitters received (\(\mathbf{\Sigma} = \mathbf{wx}\)). Even at the maximum rate of the release, the concentration of neurotransmitters in the synaptic cleft has its maximum at every unit time step due to the diffusion and degradation. The best model for the amount of neurotransmitters release could be sigmoidal functions such as logistic function,

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

or scaled hyperbolic tangent (tanh) function (tanh is a rescaled logistic function),

\[y = 0.5[1+\text{tanh}(\Sigma)], \text{tanh}(\Sigma) = \frac{e^{\Sigma}-e^{-\Sigma}}{e^{\Sigma}+e^{-\Sigma}} = 2\sigma(2\Sigma)-1\]

both of which have a maximum or saturation when \(\Sigma\) approaches positive infinity. In other words, neurons in the brain almost never transmit signals as they are (i.e., linear activation function).

In fact, biological systems including neurons and enzymes almost never have linearity. For example, even if we feed someone an extremely large amount of food, there’s still an upper limit to how much energy the person gets from the food; the person certainly won’t be able to lift an airplane regardless of eating a ton of food. Most biological systems actually display similar activation as the sigmoid function. Then what was the pitfall of the sigmoid function in the state-of-the-art deep learning?

The saturation of sigmoid function is a double-edged sword. When \(x\) is a large positive or large negative, the derivative, \(\frac{d\sigma(x)}{dx}\), is zero. In other words, the network can’t update (or learn) weights. This is called the vanishing gradient problem. So other non-linear activation functions (e.g., ReLU, LeakyReLU) have been used to remedy the vanishing gradient problem.

The derivative of LeakyReLU function is always non-zero, but there is no upper bound (i.e., no saturation) causing the exploding gradient problem. Although LeakyReLU doesn’t suffer from the vanishing or gradient problem, does it solve the fundamental problem? I don’t think so; I think that these non-linear activations without upper bound are ad-hoc solutions. As said earlier, biological systems almost never display linearity (e.g., no upper bound). Then why is the vanishing gradient a problem in the first place? As shown in another article, Introduction to Deep Learning, the loss is backpropogated through each layer. In other words, there is a sequential order of layers (e.g., input layer → hidden layer → ... → hidden layer → output layer). However, the connectivity of neurons in the brain is certainly far more complex than that in a sequential manner. It could be very difficult to model the actual connectivity of biological neurons (is it even possible?) and that’s probably why artificial neural network architecture uses sequentially ordered layers of neurons. However, there is ResNet which I think is one step closer to being more like the neural networks of brain.

ResNet is a neural network with residual blocks as shown above. In addition to the \(i^{th}\) layer feeding directly to the following \((i+1)^{th}\) layer, \(i^{th}\) layer also feeds into \(\{(i+k)^{th}\}k>1\) layers that are few hops away. Therefore, residual blocks break the sequential nature of neural network architecture. But still, residual block is not enough to model the connectivity of neurons in the brain. Besides the forward skip connection in residual block, is there anything else that biological systems commonly display? What about feedback connection such as used for the fine regulation of enzyme activities?

I think that such feedback connections that connects layers to the past will be the next step in deep learning that will make neural networks more like the brain. Now to answer the question asked earlier of why vanishing gradient is a problem in the state-of-the-art deep learning (other than the reason that the derivative becomes zero and the model stops learning), the fundamental reason is that the state-of-the-art neural network is sequentially ordered and gradient needs to backpropogate layer by layer. However, this sequential nature is not at all like the connectivity of neurons in the brain. If there are feedback connections in addition to feedfoward skip connections, is the neural network sequential anymore? I would like to call the models shown below a fully connected 'volume', which could be a possible architecture of the next generation neural network.