We often use tanh activation function in rnn or lstm. However, we can not use relu in these model. Why? In this tutorial, we will explain it to you.
As to rnn
The output can be defined as:
\[h_t = f(Wx_t+Uh_{t-1}+b)\]
Where \(f\) is the activation function. If f = relu, we may get vary large value in \(h_t\).
In paper:
A Simple Way to Initialize Recurrent Networks of Rectified Linear Units
We also can find this sentence:
At first sight, ReLUs seem inappropriate for RNNs because they can have very large outputs so they might be expected to be far more likely to explode than units that have bounded values.
However, if \(W\) and \(U\) is identity matrix, relu can be used in rnn.