An Empirical Study on Switching Activation Functions in Shallow and Deep Neural Networks

Apoorva Vinod; Archana Mathur; Snehanshu Saha

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

An Empirical Study on Switching Activation Functions in Shallow and Deep Neural Networks

Authors: Apoorva Vinod, Archana Mathur, Snehanshu Saha

Abstract:

Though there exists a plethora of Activation Functions (AFs) used in single and multiple hidden layer Neural Networks (NN), their behavior always raised curiosity, whether used in combination or singly. The popular AFs – Sigmoid, ReLU, and Tanh – have performed prominently well for shallow and deep architectures. Most of the time, AFs are used singly in multi-layered NN, and, to the best of our knowledge, their performance is never studied and analyzed deeply when used in combination. In this manuscript, we experiment on multi-layered NN architecture (both on shallow and deep architectures; Convolutional NN and VGG16) and investigate how well the network responds to using two different AFs (Sigmoid-Tanh, Tanh-ReLU, ReLU-Sigmoid) used alternately against a traditional, single (Sigmoid-Sigmoid, Tanh-Tanh, ReLU-ReLU) combination. Our results show that on using two different AFs, the network achieves better accuracy, substantially lower loss, and faster convergence on 4 computer vision (CV) and 15 Non-CV (NCV) datasets. When using different AFs, not only was the accuracy greater by 6-7%, but we also accomplished convergence twice as fast. We present a case study to investigate the probability of networks suffering vanishing and exploding gradients when using two different AFs. Additionally, we theoretically showed that a composition of two or more AFs satisfies Universal Approximation Theorem (UAT).

Keywords: Activation Function, Universal Approximation function, Neural Networks, convergence.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 161

References:

[1] M. Mohammadzaheri, L. Chen, A. Ghaffari, and J. Willison, “A combination of linear and nonlinear activation functions in neural networks for modeling a de-superheater,” Simulation Modelling Practice and Theory, vol. 17, no. 2, pp. 398–407, 2009. Available at https: //www.sciencedirect.com/ science/ article/ abs/ pii/S1569190X08001913.
[2] J. Feng and S. Lu, “Performance analysis of various activation functions in artificial neural networks,” in Journal of physics: conference series, vol. 1237, p. 022030, IOP Publishing, 2019. Available at https: // iopscience.iop.org/ article/10.1088/1742-6596/1237/2/022030.
[3] P. Sibi, S. A. Jones, and P. Siddarth, “Analysis of different activation functions using back propagation neural networks,” Journal of theoretical and applied information technology, vol. 47, no. 3, pp. 1264–1268, 2013. Available at http://www.jatit.org/volumes/ Vol47No3/61Vol47No3.pdf .
[4] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation functions: Comparison of trends in practice and research for deep learning,” arXiv preprint arXiv:1811.03378, 2018. Available at https: // arxiv.org/ abs/1811.03378.
[5] B. Karlik and A. V. Olgac, “Performance analysis of various activation functions in generalized mlp architectures of neural networks,” International Journal of Artificial Intelligence and Expert Systems, vol. 1, no. 4, pp. 111–122, 2011. Available at https://www.researchgate.net/ publication/228813985 Performance Analysis of Various Activation Functions in Generalized MLP Architectures of Neural Networks.
[6] S. Saha, A. Mathur, A. Pandey, and H. Arun Kumar, “Diffact: A unifying framework for activation functions,” in 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, 2021.
[7] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989. Available at https://www.sciencedirect.com/ science/ article/ abs/ pii/0893608089900208.
[8] J. Makhoul, R. Schwartz, and A. El-Jaroudi, “Classification capabilities of two-layer neural nets,” in International Conference on Acoustics, Speech, and Signal Processing,, pp. 635–638, IEEE, 1989. Available at https://www.semanticscholar.org/paper/ Classification-capabilities-of-two-layer-neural-Makhoul-Schwartz/ c42c1305ce33c628ffc5401d5de2b0347f50ac78.
[9] G. R. Brightwell, C. Mathieu, and H. Paugam-Moisy, “Multilayer neural networks: One or two hidden layers?,” in NIPS, 1996.
[10] W. Wan, S. Mabu, K. Shimada, K. Hirasawa, and J. Hu, “Enhancing the generalization ability of neural networks through controlling the hidden layers,” Appl. Soft Comput., vol. 9, pp. 404–414, 2009.
[11] S. Saha, N. Nagaraj, A. Mathur, R. Yedida, and S. H. R, “Evolution of novel activation functions in neural network training for astronomy data: habitability classification of exoplanets,” The European Physical Journal. Special Topics, vol. 229, pp. 2629 – 2738, 2020.
[12] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010.
[13] R. Yedida and S. Saha, “Beginning with machine learning: a comprehensive primer,” The European Physical Journal Special Topics, 2021.
[14] S. Saha, A. Mathur, K. Bora, S. Agrawal, and S. Basak, “A new activation function for artificial neural net based habitability classification,” 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1781–1786, 2018.
[15] L. F. Guilhoto, “An overview of artificial neural networks for mathematicians,” 2018.
[16] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989. Available at https:// link.springer.com/article/ 10. 1007/BF02551274.
[17] Z. Li and S. Arora, “An exponential learning rate schedule for deep learning,” 2019.
[18] S. S. Dhavala and S. Saha, “Adaswarm: Augmenting gradient-based optimizers in deep learning with swarm intelligence,” Available at https:// ieeexplore.ieee.org/ abstract/document/9472873.