Prediction of water quality for the Selangor rivers using data mining approach


Citation

Nurain Ibrahim, . and Hezlin Aryani Abd Rahman, . and Ahmad Arifuddin Azran, . and Muhammad Aiman Mohd Faddillah, . and Muhammad Adhwa’ Qayyum Mohd Qamarudin, . (2023) Prediction of water quality for the Selangor rivers using data mining approach. Journal of Sustainability Science and Management (Malaysia), 18 (9). pp. 171-183. ISSN 2672-7226

Abstract

Few studies using the data mining approach to assess the quality of water, especially for Selangor rivers. This study assesses the water quality using data mining techniques and identified the most significant variables that affect water quality. Machine learning techniques used are Decision Tree (Gini) and Decision Tree (Entropy), Logistic Regression Enter, Backward Elimination and Forward Selection and Artificial Neural Network with 4 and 8 hidden nodes. This study revealed that Logistic Regression Enter is the best model since it is neither underfit nor overfit with the sensitivity, specificity, accuracy, mean squared error and misclassification rate values of 92.51%, 97.45%, 96.36%, 0.028 and 3.64% respectively. There are other two best models: Decision Tree (Gini) and Artificial Neural Network with 4 hidden nodes. According to the variable importance output based on Decision Tree (Gini), the most important variable effect on the water quality is Biochemical Oxygen Demand (BOD) with the highest value of 0.2284, followed by Chemical Oxygen Demand with a value of 0.1471 respectively.


Download File

Full text available from:

Abstract

Few studies using the data mining approach to assess the quality of water, especially for Selangor rivers. This study assesses the water quality using data mining techniques and identified the most significant variables that affect water quality. Machine learning techniques used are Decision Tree (Gini) and Decision Tree (Entropy), Logistic Regression Enter, Backward Elimination and Forward Selection and Artificial Neural Network with 4 and 8 hidden nodes. This study revealed that Logistic Regression Enter is the best model since it is neither underfit nor overfit with the sensitivity, specificity, accuracy, mean squared error and misclassification rate values of 92.51%, 97.45%, 96.36%, 0.028 and 3.64% respectively. There are other two best models: Decision Tree (Gini) and Artificial Neural Network with 4 hidden nodes. According to the variable importance output based on Decision Tree (Gini), the most important variable effect on the water quality is Biochemical Oxygen Demand (BOD) with the highest value of 0.2284, followed by Chemical Oxygen Demand with a value of 0.1471 respectively.

Additional Metadata

[error in script]
Item Type: Article
AGROVOC Term: rivers
AGROVOC Term: water quality
AGROVOC Term: biochemical oxygen demand
AGROVOC Term: chemical oxygen demand
AGROVOC Term: data mining
AGROVOC Term: machine learning
AGROVOC Term: statistical methods
AGROVOC Term: research
AGROVOC Term: environmental monitoring
Geographical Term: Malaysia
Uncontrolled Keywords: Water quality, decision tree, logistic regression, Artificial Neural Network
Depositing User: Nor Hasnita Abdul Samat
Date Deposited: 21 May 2025 06:16
Last Modified: 21 May 2025 06:16
URI: http://webagris.upm.edu.my/id/eprint/1931

Actions (login required)

View Item View Item