Automated learning rate search using batch-level cross-validation

Duygu Kabakçı; Emre Akbaş

doi:10.35377/saucis...935353

Research Article

Year 2021, Volume: 4 Issue: 3, 312 - 325, 31.12.2021

Duygu Kabakçı Emre Akbaş

https://doi.org/10.35377/saucis...935353

Cited By: 1

Abstract

References

[1] K. Anand, Z. Wang, M. Loog, and J. van Gemert, “Black magic in deep learning: How human skill impacts network training,” in British Machine Vision Conference, 2020.
[2] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ai,” Communications of the ACM, vol. 63, p. 54–63, Nov. 2020.
[3] G. E. Hinton, N. Srivastava, and K. Swersky, “Neural Networks for Machine Learning,” COURSERA: Neural Networks for Machine Learning, 2012.
[4] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient descent,” in ICLR: International Conference on Learning Representations, 2015.
[5] “Neural networks (maybe) evolved to make adam the best optimizer – parameter-free learning and optimization algorithms.” https://parameterfree.com/2020/12/06/neural-network-maybe-evolved-to-make-adam-the-best-optimizer/. (Accessed on 03/01/2021).
[6] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Marginal Value of Adaptive Gradient Methods in Machine Learning,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 4148–4158, Curran Associates, Inc., 2017.
[7] J. Zhang, I. Mitliagkas, and C. R´e, “YellowFin and the Art of Momentum Tuning,” CoRR, vol. abs/1706.0, 2017.
[8] D. Kabakci, “Automated learning rate search using batch-level cross-validation,” Master’s thesis, Middle East Technical University, Ankara, Turkey, July 2019. https://open.metu.edu.tr/handle/11511/43629.
[9] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.
[10] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012.
[11] L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 3 2017.
[12] L. N. Smith and N. Topin, “Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates,” CoRR, vol. abs/1708.0, 2017.
[13] I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” in ICLR: International Conference on Learning Representations, pp. 1–16, 2017.
[14] T. Schaul, Z. Sixin, and Y. LeCun, “No More Pesky Learning Rates,” in Proceedings of the 30th International Conference on Machine Learning, 2013.
[15] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online Learning Rate Adaptation with Hypergradient Descent,” in International Conference on Learning Representations, 2018.
[16] N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628, 2017.
[17] M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, “Adaptive Methods for Nonconvex Optimization,” in Advances in Neural Information Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), pp. 9793–9803, Curran Associates, Inc., 2018.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.
[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6 2018.
[21] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. Le, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism.,” in Advances in Neural Information Processing Systems (NIPS), 2019.
[22] S. Jenni and P. Favaro, “Deep bilevel learning,” in Proceedings of the European conference on computer vision (ECCV), pp. 618–633, 2018.
[23] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[25] A. Krizhevsky, G. Hinton, et al., “Learning Multiple Layers of Features from Tiny Images,” tech. rep., Department of Computer Science, University of Toronto, 2009.
[26] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y Ng, “Reading Digits in Natural Images with Unsupervised Feature Learning,” in Advances in Neural Information Processing Systems (NIPS), 2011.
[27] E. Eidinger, R. Enbar, and T. Hassner, “Age and Gender Estimation of Unfiltered Faces,” IEEE Transactions on Information Forensics and Security, vol. 9, pp. 2170–2179, 12 2014.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in ICLR: International Conference on Learning Representations, pp. 1–14, 2015.

Automated learning rate search using batch-level cross-validation

Year 2021, Volume: 4 Issue: 3, 312 - 325, 31.12.2021

Duygu Kabakçı Emre Akbaş

https://doi.org/10.35377/saucis...935353

Cited By: 1

Abstract

Deep learning researchers and practitioners have accumulated a significant amount of experience on training a wide variety of architectures on various datasets. However, given a network architecture and a dataset, obtaining the best model (i.e. the model giving the smallest test set error) while keeping the training time complexity low is still a challenging task. Hyper-parameters of deep neural networks, especially the learning rate and its (decay) schedule, highly affect the network's final performance. The general approach is to search the best learning rate and learning rate decay parameters within a cross-validation framework, a process that usually requires a significant amount of experimentation with extensive time cost. In classical cross-validation (CV), a random part of the dataset is reserved for the evaluation of model performance on unseen data. This technique is usually run multiple times to decide learning rate settings with random validation sets. In this paper, we explore batch-level cross-validation as an alternative to the classical dataset-level, hence macro, CV. The advantage of batch-level or micro CV methods is that the gradient computed during training is re-used to evaluate several different learning rates. We propose an algorithm based on micro CV and stochastic gradient descent with momentum, which produces a learning rate schedule during training by selecting a learning rate per epoch, automatically. In our algorithm, a random half of the current batch (of examples) is used for training and the other half is used for validating several different step sizes or learning rates. We conducted comprehensive experiments on three datasets (CIFAR10, SVHN and Adience) using three different network architectures (a custom CNN, ResNet and VGG) to compare the performances of our micro-CV algorithm and the widely used stochastic gradient descent with momentum in a early-stopping macro-CV setup. The results show that, our micro-CV algorithm achieves comparable test accuracy to macro-CV with a much lower computational cost.

Keywords

deep learning , neural networks , learning rate , hyper-parameter search , adaptive learning rate , cross-validation

References

[1] K. Anand, Z. Wang, M. Loog, and J. van Gemert, “Black magic in deep learning: How human skill impacts network training,” in British Machine Vision Conference, 2020.
[2] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ai,” Communications of the ACM, vol. 63, p. 54–63, Nov. 2020.
[3] G. E. Hinton, N. Srivastava, and K. Swersky, “Neural Networks for Machine Learning,” COURSERA: Neural Networks for Machine Learning, 2012.
[4] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient descent,” in ICLR: International Conference on Learning Representations, 2015.
[5] “Neural networks (maybe) evolved to make adam the best optimizer – parameter-free learning and optimization algorithms.” https://parameterfree.com/2020/12/06/neural-network-maybe-evolved-to-make-adam-the-best-optimizer/. (Accessed on 03/01/2021).
[6] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Marginal Value of Adaptive Gradient Methods in Machine Learning,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 4148–4158, Curran Associates, Inc., 2017.
[7] J. Zhang, I. Mitliagkas, and C. R´e, “YellowFin and the Art of Momentum Tuning,” CoRR, vol. abs/1706.0, 2017.
[8] D. Kabakci, “Automated learning rate search using batch-level cross-validation,” Master’s thesis, Middle East Technical University, Ankara, Turkey, July 2019. https://open.metu.edu.tr/handle/11511/43629.
[9] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.
[10] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012.
[11] L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 3 2017.
[12] L. N. Smith and N. Topin, “Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates,” CoRR, vol. abs/1708.0, 2017.
[13] I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” in ICLR: International Conference on Learning Representations, pp. 1–16, 2017.
[14] T. Schaul, Z. Sixin, and Y. LeCun, “No More Pesky Learning Rates,” in Proceedings of the 30th International Conference on Machine Learning, 2013.
[15] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online Learning Rate Adaptation with Hypergradient Descent,” in International Conference on Learning Representations, 2018.
[16] N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628, 2017.
[17] M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, “Adaptive Methods for Nonconvex Optimization,” in Advances in Neural Information Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), pp. 9793–9803, Curran Associates, Inc., 2018.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.
[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6 2018.
[21] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. Le, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism.,” in Advances in Neural Information Processing Systems (NIPS), 2019.
[22] S. Jenni and P. Favaro, “Deep bilevel learning,” in Proceedings of the European conference on computer vision (ECCV), pp. 618–633, 2018.
[23] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[25] A. Krizhevsky, G. Hinton, et al., “Learning Multiple Layers of Features from Tiny Images,” tech. rep., Department of Computer Science, University of Toronto, 2009.
[26] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y Ng, “Reading Digits in Natural Images with Unsupervised Feature Learning,” in Advances in Neural Information Processing Systems (NIPS), 2011.
[27] E. Eidinger, R. Enbar, and T. Hassner, “Age and Gender Estimation of Unfiltered Faces,” IEEE Transactions on Information Forensics and Security, vol. 9, pp. 2170–2179, 12 2014.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in ICLR: International Conference on Learning Representations, pp. 1–14, 2015.

There are 28 citations in total.

Details

Primary Language	English
Subjects	Artificial Intelligence
Journal Section	Research Article
Authors	Duygu Kabakçı 0000-0001-6636-813X Emre Akbaş 0000-0002-3760-6722
Submission Date	May 10, 2021
Acceptance Date	November 4, 2021
Publication Date	December 31, 2021
Published in Issue	Year 2021 Volume: 4 Issue: 3

Cite

APA	Kabakçı, D., & Akbaş, E. (2021). Automated learning rate search using batch-level cross-validation. Sakarya University Journal of Computer and Information Sciences, 4(3), 312-325. https://doi.org/10.35377/saucis...935353
AMA	Kabakçı D, Akbaş E. Automated learning rate search using batch-level cross-validation. SAUCIS. December 2021;4(3):312-325. doi:10.35377/saucis.935353
Chicago	Kabakçı, Duygu, and Emre Akbaş. “Automated Learning Rate Search Using Batch-Level Cross-Validation”. Sakarya University Journal of Computer and Information Sciences 4, no. 3 (December 2021): 312-25. https://doi.org/10.35377/saucis. 935353.
EndNote	Kabakçı D, Akbaş E (December 1, 2021) Automated learning rate search using batch-level cross-validation. Sakarya University Journal of Computer and Information Sciences 4 3 312–325.
IEEE	D. Kabakçı and E. Akbaş, “Automated learning rate search using batch-level cross-validation”, SAUCIS, vol. 4, no. 3, pp. 312–325, 2021, doi: 10.35377/saucis...935353.
ISNAD	Kabakçı, Duygu - Akbaş, Emre. “Automated Learning Rate Search Using Batch-Level Cross-Validation”. Sakarya University Journal of Computer and Information Sciences 4/3 (December2021), 312-325. https://doi.org/10.35377/saucis. 935353.
JAMA	Kabakçı D, Akbaş E. Automated learning rate search using batch-level cross-validation. SAUCIS. 2021;4:312–325.
MLA	Kabakçı, Duygu and Emre Akbaş. “Automated Learning Rate Search Using Batch-Level Cross-Validation”. Sakarya University Journal of Computer and Information Sciences, vol. 4, no. 3, 2021, pp. 312-25, doi:10.35377/saucis. 935353.
Vancouver	Kabakçı D, Akbaş E. Automated learning rate search using batch-level cross-validation. SAUCIS. 2021;4(3):312-25.

Cited By

Geleceğe Yönelik Elektrikli Araç ve Şarj istasyonu Sayılarının LSTM VE GRU Derin Öğrenme Yöntemleri Kullanılarak Tahmin Edilmesi: Kocaeli İli Örneği

Politeknik Dergisi

https://doi.org/10.2339/politeknik.1674525

Article Files

Full Text

INDEXING & ABSTRACTING & ARCHIVING

29070 The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License