Araştırma Makalesi
BibTex RIS Kaynak Göster

Machine Learning Supported Diabetes Prediction with Apache Spark

Yıl 2022, Cilt: 10 Sayı: 3, 1107 - 1117, 31.07.2022
https://doi.org/10.29130/dubited.999048

Öz

Diabetes is one of the critical health problems that affect the organs of the human body. Therefore, diabetes is recognized as a global health problem in the 21st century. To avoid the problems that arise as a result of the diabetes and to treat it before it worsen, there is a need for a system that can predict and process diabetes. In recent years, various technological tools and applications have been used for the early diagnosis of many diseases in the field of health. One of these applications is to perform analyzes for early diagnosis of the disease with the help of data mining and machine learning techniques. In this study, diabetes analyzes are carried out with Apache Spark technology, which has been very popular in big data processing recently. So, the performances of five different machine learning classification algorithms in the Apache Spark MLlib library used for prediction in the analysis are compared and it is seen that the Random Forest (RO) algorithm has the best performance. The results of the analyzes show that the Apache Spark technology used can be used to detect such health problems.

Kaynakça

  • [1] World Health Organization. (2021, June 15). WHO Diabetes Program [Online]. Erişim: https://www.who.int/health-topics/diabetes
  • [2] Apache Flink. (2021, June 15). Apache Flink [Online]. Erişim: https://flink.apache.org/
  • [3] Apache Hadoop. (2021, June 15). Apache Hadoop [Online]. Erişim: https://hadoop.apache.org/
  • [4] Apache Spark. (2021, June 15). Apache Spark [Online]. Erişim: https://spark.apache.org/
  • [5] J. Han, J.C. Rodriguez, J.C., and M. Beheshti, “Discovering decision tree based diabetes prediction model,” in Advances in Software Engineering, 1st ed., Hainan Island, China: Springer, 2008, pp. 99-109.
  • [6] P.S. Kumar, and S. Pranavi, “Performance analysis of machine learning algorithms on diabetes dataset using big data analytics,” International Conference on Infocom Technologies and Unmanned Systems, Dubai, UAE, 2017, pp. 508-513.
  • [7] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, and H. Tang, “Predicting diabetes mellitus with machine learning techniques,” Frontiers in Genetics, vol. 9, no. 515, pp. 1-10, 2018.
  • [8] N.H. Barakat, A.P. Bradley, and M.N. Barakat, “Intelligible support vector machines for diagnosis of diabetes mellitus,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 4, pp. 1114-1120, 2010.
  • [9] A. Mir, and S.N. Dhage, “Diabetes disease prediction using machine learning on big data of healthcare,” 4th International Conference on Computing Communication Control and Automation, Pune, India, 2018, pp. 1-6.
  • [10] F. Hassan and M.E. Shaheen, “Predicting diabetes from health-based streaming data using social media, machine learning and stream processing technologies,” International Journal of Engineering Research and Technology, vol. 13, no. 8, pp. 1957-1967, 2020.
  • [11] Kaggle. (2021, June 15). Pima Indians Diabetes Database [Online], Erişim: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.
  • [12] M. Zaharia, M. Chowdhury, T. Das, A Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing,” 9th Symposium on Networked Systems Design and Implementation, California, USA, 2012, pp. 15-28.
  • [13] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D.B. Tsai, M. Amde, S. Owen and D. Xin, “MLlib: machine learning in apache spark,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.
  • [14] S. Ameer, M.A. Shah, A. Khan, H. Song, C. Maple, S. Islam, and M.N. Asghar. “Comparative analysis of machine learning techniques for predicting air quality in smart cities,” IEEE Access, vol. 7, no. 2019, pp. 128325-128338, 2019.
  • [15] K. Kucuk, C. Bayilmis, A.F. Sonmez, and S. Kacar. “Crowd sensing aware disaster framework design with IoT Technologies,” Journal of Ambient Intelligence and Humanized Computing, vol. 11, no. 4, pp. 1709-1725, 2020.
  • [16] X. Tian, R. Han, L. Wang, G. Lu, and J. Zhan. “Latency critical big data computing in finance,” The Journal of Finance and Data Science, vol. 1, no. 1, pp. 33-41, 2015.
  • [17] L.R. Nair, S.D. Shetty, and S.D. Shetty. “Applying spark based machine learning model on streaming big data for health status prediction,” Computers & Electrical Engineering, vol. 65, no. 393-399, 2018.
  • [18] M. Alber, “Masterarbeit: big data and machine learning: a case study with bump boost”, Department of Smart Systems and Robotics, Master Thesis, Freie University, Berlin. Germany, 2014.
  • [19] J.K. Basu, D. Bhattacharyya and T.H. Kim, “Use of artificial neural network in pattern recognition,” International Journal of Software Engineering and Its Applications, vol. 4, no. 2, pp. 23-34, 2010.
  • [20] B. E. Boser, I. M. Guyon, and V.N. Vapnik, “A training algorithm for optimal margin classifiers,” 5th Annual ACM Workshop on Computational Learning Theory, Pittsburgh, ABD, 1992, pp. 144-152.
  • [21] G. Zhu, and D. G. “Blumberg. classification using aster data and svm algorithms; the case study of beer sheva, israel,” Remote Sensing of Environment, vol. 80, no. 2, pp. 233-240, 2002.
  • [22] D.W. Hosmer Jr, S. Lemeshow and R.X. Sturdivant, “Introduction to the logistic regression model”, Applied Logistic Regression, 3rd ed., New Jersey, USA: John Wiley & Sons, 2013, vol. 398, pp. 1-35.
  • [23] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
  • [24] P. Langley, W. Iba, and K. Thompson, “An analysis of bayesian classifiers,” Proceedings of The Tenth National Conference on Articial Intelligence, California, USA, 1992, pp. 223-228.

Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini

Yıl 2022, Cilt: 10 Sayı: 3, 1107 - 1117, 31.07.2022
https://doi.org/10.29130/dubited.999048

Öz

Diyabet rahatsızlığı, insan vücudunun organlarını etkileyen kritik sağlık sorunlarından biridir. Bu nedenle, diyabet, 21. yüzyılda küresel bir sağlık sorunu olarak kabul edilmektedir. Bu rahatsızlığın sonucu olarak ortaya çıkan sorunlardan kaçınmak ve onları ağırlaşmadan önce tedavi etmek için diyabet rahatsızlığını tahmin edip işleyebilen bir sisteme ihtiyaç duyulmaktadır. Son yıllarda, sağlık alanında birçok rahatsızlığın erken teşhisi için çeşitli teknolojik araçlar ve uygulamalar kullanılmaktadır. Bu uygulamalardan birisi de veri madenciliği ve makine öğrenmesi teknikleri yardımıyla hastalığın erken teşhisi için analizlerin gerçekleştirilmesidir. Bu araştırmada, son zamanlarda büyük veri işlemede oldukça popüler olan Apache Spark teknolojisi ile diyabet rahatsızlığı analizleri gerçekleştirilmektedir. Aynı zamanda analizlerde tahmin için kullanılan Apache Spark MLlib kütüphanesindeki beş farklı makine öğrenmesi sınıflandırma algoritmalarının performansları karşılaştırılmış ve Rasgele Orman (RO) algoritmasının en iyi performansa sahip olduğu görülmektedir. Gerçekleştirilen analizler sonucunda kullanılan Apache Spark teknolojisinin bu tarz rahatsızlıkların belirlenmesinde kullanılabileceğini göstermektedir.

Kaynakça

  • [1] World Health Organization. (2021, June 15). WHO Diabetes Program [Online]. Erişim: https://www.who.int/health-topics/diabetes
  • [2] Apache Flink. (2021, June 15). Apache Flink [Online]. Erişim: https://flink.apache.org/
  • [3] Apache Hadoop. (2021, June 15). Apache Hadoop [Online]. Erişim: https://hadoop.apache.org/
  • [4] Apache Spark. (2021, June 15). Apache Spark [Online]. Erişim: https://spark.apache.org/
  • [5] J. Han, J.C. Rodriguez, J.C., and M. Beheshti, “Discovering decision tree based diabetes prediction model,” in Advances in Software Engineering, 1st ed., Hainan Island, China: Springer, 2008, pp. 99-109.
  • [6] P.S. Kumar, and S. Pranavi, “Performance analysis of machine learning algorithms on diabetes dataset using big data analytics,” International Conference on Infocom Technologies and Unmanned Systems, Dubai, UAE, 2017, pp. 508-513.
  • [7] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, and H. Tang, “Predicting diabetes mellitus with machine learning techniques,” Frontiers in Genetics, vol. 9, no. 515, pp. 1-10, 2018.
  • [8] N.H. Barakat, A.P. Bradley, and M.N. Barakat, “Intelligible support vector machines for diagnosis of diabetes mellitus,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 4, pp. 1114-1120, 2010.
  • [9] A. Mir, and S.N. Dhage, “Diabetes disease prediction using machine learning on big data of healthcare,” 4th International Conference on Computing Communication Control and Automation, Pune, India, 2018, pp. 1-6.
  • [10] F. Hassan and M.E. Shaheen, “Predicting diabetes from health-based streaming data using social media, machine learning and stream processing technologies,” International Journal of Engineering Research and Technology, vol. 13, no. 8, pp. 1957-1967, 2020.
  • [11] Kaggle. (2021, June 15). Pima Indians Diabetes Database [Online], Erişim: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.
  • [12] M. Zaharia, M. Chowdhury, T. Das, A Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing,” 9th Symposium on Networked Systems Design and Implementation, California, USA, 2012, pp. 15-28.
  • [13] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D.B. Tsai, M. Amde, S. Owen and D. Xin, “MLlib: machine learning in apache spark,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.
  • [14] S. Ameer, M.A. Shah, A. Khan, H. Song, C. Maple, S. Islam, and M.N. Asghar. “Comparative analysis of machine learning techniques for predicting air quality in smart cities,” IEEE Access, vol. 7, no. 2019, pp. 128325-128338, 2019.
  • [15] K. Kucuk, C. Bayilmis, A.F. Sonmez, and S. Kacar. “Crowd sensing aware disaster framework design with IoT Technologies,” Journal of Ambient Intelligence and Humanized Computing, vol. 11, no. 4, pp. 1709-1725, 2020.
  • [16] X. Tian, R. Han, L. Wang, G. Lu, and J. Zhan. “Latency critical big data computing in finance,” The Journal of Finance and Data Science, vol. 1, no. 1, pp. 33-41, 2015.
  • [17] L.R. Nair, S.D. Shetty, and S.D. Shetty. “Applying spark based machine learning model on streaming big data for health status prediction,” Computers & Electrical Engineering, vol. 65, no. 393-399, 2018.
  • [18] M. Alber, “Masterarbeit: big data and machine learning: a case study with bump boost”, Department of Smart Systems and Robotics, Master Thesis, Freie University, Berlin. Germany, 2014.
  • [19] J.K. Basu, D. Bhattacharyya and T.H. Kim, “Use of artificial neural network in pattern recognition,” International Journal of Software Engineering and Its Applications, vol. 4, no. 2, pp. 23-34, 2010.
  • [20] B. E. Boser, I. M. Guyon, and V.N. Vapnik, “A training algorithm for optimal margin classifiers,” 5th Annual ACM Workshop on Computational Learning Theory, Pittsburgh, ABD, 1992, pp. 144-152.
  • [21] G. Zhu, and D. G. “Blumberg. classification using aster data and svm algorithms; the case study of beer sheva, israel,” Remote Sensing of Environment, vol. 80, no. 2, pp. 233-240, 2002.
  • [22] D.W. Hosmer Jr, S. Lemeshow and R.X. Sturdivant, “Introduction to the logistic regression model”, Applied Logistic Regression, 3rd ed., New Jersey, USA: John Wiley & Sons, 2013, vol. 398, pp. 1-35.
  • [23] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
  • [24] P. Langley, W. Iba, and K. Thompson, “An analysis of bayesian classifiers,” Proceedings of The Tenth National Conference on Articial Intelligence, California, USA, 1992, pp. 223-228.
Toplam 24 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Mühendislik
Bölüm Makaleler
Yazarlar

Emre Yıldırım 0000-0002-9072-9780

Ali Çalhan 0000-0002-5798-3103

Yayımlanma Tarihi 31 Temmuz 2022
Yayımlandığı Sayı Yıl 2022 Cilt: 10 Sayı: 3

Kaynak Göster

APA Yıldırım, E., & Çalhan, A. (2022). Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini. Düzce Üniversitesi Bilim Ve Teknoloji Dergisi, 10(3), 1107-1117. https://doi.org/10.29130/dubited.999048
AMA Yıldırım E, Çalhan A. Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini. DÜBİTED. Temmuz 2022;10(3):1107-1117. doi:10.29130/dubited.999048
Chicago Yıldırım, Emre, ve Ali Çalhan. “Apache Spark Ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini”. Düzce Üniversitesi Bilim Ve Teknoloji Dergisi 10, sy. 3 (Temmuz 2022): 1107-17. https://doi.org/10.29130/dubited.999048.
EndNote Yıldırım E, Çalhan A (01 Temmuz 2022) Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini. Düzce Üniversitesi Bilim ve Teknoloji Dergisi 10 3 1107–1117.
IEEE E. Yıldırım ve A. Çalhan, “Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini”, DÜBİTED, c. 10, sy. 3, ss. 1107–1117, 2022, doi: 10.29130/dubited.999048.
ISNAD Yıldırım, Emre - Çalhan, Ali. “Apache Spark Ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini”. Düzce Üniversitesi Bilim ve Teknoloji Dergisi 10/3 (Temmuz 2022), 1107-1117. https://doi.org/10.29130/dubited.999048.
JAMA Yıldırım E, Çalhan A. Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini. DÜBİTED. 2022;10:1107–1117.
MLA Yıldırım, Emre ve Ali Çalhan. “Apache Spark Ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini”. Düzce Üniversitesi Bilim Ve Teknoloji Dergisi, c. 10, sy. 3, 2022, ss. 1107-1, doi:10.29130/dubited.999048.
Vancouver Yıldırım E, Çalhan A. Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini. DÜBİTED. 2022;10(3):1107-1.