چكيده لاتين
Risk score models are a type of linear classifier that allow users to make quick predictions by performing simple arithmetic operations on features. These models are widely used in applications where humans ultimately make decisions and are easy to understand and evaluate. Despite their widespread use in the real world, risk scores are still constructed using a combination of statistical methods, heuristic approaches, and expert judgments. Such methods can reduce performance and make it difficult for professionals to develop practical and accepted risk scores.
In this thesis, an interpretable classification method based on the RiskSlim approach has been developed, which is based on feature selection, selection of small integer coefficients, and operational constraints. The proposed machine learning method has high interpretability and is easily scorable, to the extent that it does not require a computer or even a calculator. Three new capabilities have been added to the proposed method: first, an adjustable and independent tuning parameter for each feature coefficient. Second, automatic selection of cut-off points for binarizing features has been made possible. Third, feature dimension reduction is performed in the preprocessing stage.
To validate the proposed method, two case studies are described for predicting COVID-19 diagnosis and employee satisfaction with IT services at Mobarakeh Steel Company in Isfahan. Compared to the RiskSlim method, the computational cost in the proposed method was reduced by approximately 13 times and became resilient to incorrect data and protecting sensitive information without losing performance. It was also shown that despite COVID-19 having symptoms similar to other diseases like influenza and cold, which can make its diagnosis through self-report questionnaires difficult, the proposed risk score showed good performance in the test set, achieving an AUROC of 82% and an AUPRC of 27%. These results were compared to the logistic regression model with penalty, which had an AUROC of 85% and an AUPRC of 22%, and the random forest model, which had an AUROC of 86% and an AUPRC of 23%. Although the AUROC of the proposed method was slightly worse than the other two models, its AUPRC was better. Additionally, it can fully automate the process of selecting operational constraints by domain experts, making the proposed method perform well in terms of accuracy and speed.
The results indicate that the risk scores developed by this method can successfully be used for the early identification of COVID-19 cases. These results help companies take appropriate measures to effectively prevent the spread of the virus among their employees. Additionally, the results show that risk scores can be developed in a practical and accurate manner using this proposed method. This method is not only efficient in facing the challenges of predicting and diagnosing COVID-19 but can also be used as a powerful tool for addressing other health and work-related challenges. The evaluation of results in this research shows that the proposed method is competitive in terms of accuracy and speed with well-known machine learning methods like random forests and penalized logistic regression, as well as the optimal RiskSlim method.