Classification Of Malaria Types Using Naïve Bayes Classification

ABSTRACT


Introduction
Malaria is a disease caused by inflammation of protozoa of the genus Plasmodium and is easily recognised by signs of heat, cold, chills, and continuous chills (Dinata, 2018).Malaria is one of the most widespread mosquito-borne diseases (Madhusudan, 2020).Disease caused by inflammation of protozoa from the genus Plasmodium is transmitted through the intermediaries of various vector genera Anopheles (Alviyanil'Izzah et al., 2021).Malaria is still a threat to public health status, especially to people living in remote areas.This is reflected in the issuance of Presidential Regulation Number: 2 of 2015 concerning the National Medium-Term Development Plan for 2015 -2019, where malaria is a priority disease that needs to be overcome and in RPJMN IV for 2020-2024 it is also stated that the prevalence of major infectious diseases, one of which is malaria is still high accompanied by the threat of emerging diseases due to high population mobility so that it affects the degree of public health (Ramadhan & Khoirunnisa, 2021).This commitment to malaria control is expected to be of concern to all of us nationally, regionally, and globally, as produced at the 60th World Health Assembly (WHA) meeting in Geneva in 2007 on malaria elimination (Prajarini, 2016).
To the World Health Organization (World Health Organization), malaria can be classified into 5, namely plasmodium falciparum, which causes tropical malaria; plasmodium vivax, which causes malaria Persian; plasmodium ovale, which causes maria ovale; plasmodium malaria According According According to causes quaternary malaria, and plasmodium knowlesi causes malaria (Madhusudan, 2020).Malaria is categorised as one of the diseases with effects and a reasonably large mortality rate.The World Health Organization (World Health Organization) recorded 229 million malaria problems and 409.000 deaths were registered in 2019.Areas at risk are mainly in Africa, but Southeast Asia, the Western Pacific, and the Mediterranean are also listed as areas at risk.Each country strives to overcome malaria cases by referring to the comprehensive commitment in the 60th World Health Assembly (WHA) in 2007 regarding malaria elimination (Jiang et al., 2021).The objectives of this study are: 1. Knowing the level of accuracy of the naïve Bayes classification method in determining the group of types of malaria.2. Knowing how many results are accurate and the performance of malaria types using the naïve Bayes algorithm.3. Prove whether the naïve Bayes classification method effectively classifies malaria types.

Research Benefits
With the research that will be held, several hopes for the results of this research can be helpful and play an essential role in adding insight into science.The benefits obtained by conducting this research are as follows: 1. Mitigating and assisting the performance of medical professionals in classifying types of malaria.2. Provide information on the level of accuracy in the process of classifying malaria.3. Adding insight for readers who want to learn naïve Bayes classification.

Research Methods
Researchers use quantitative research, a process of mathematical calculations, to achieve the desired results.In this case, the dataset was compared with the Naïve Bayes algorithm to find the most malaria-related impacts in each Puskesmas in Irian Jaya.

Nature of Research
The nature of the research carried out is experimental.It conducts a research experiment to obtain accurate results or parameters by comparing the Naïve Bayes algorithm.The accuracy results obtained from the comparison can be used to make decisions about determining the feasibility of lending.

Research Approach
This research approach is quantitative, and researchers conduct research by the stages or lines of research that have been made.

Data Collection Methods
The data used in this study is obtained directly from the Darun Nahdla Capita Sharia Cooperative and includes private data that has not been used in previous studies.The data used in this study is from datasets from cooperative customer data from 2020 to 2022, totalling 166 data points with 10 variables: gender, marital status, occupation, dependents, income, loan amount, term, interest, instalments, and categories.

Data Analysis Methods
The data analysis method for this study is quantitative, while the data analysis method follows the stages in the knowledge discovery in database (kdd) process used in this study using Excel software tools and orange tools as follows:

Preprocessing Data
The data preprocessing stage is carried out to clean duplicate data, missing values, and outliers in the dataset so that they are valid during the data processing.At this stage, data transformation is also carried out by analysing variables that do not have contributive information to make predictions and converting object-type data into integer form to facilitate the data processing process.The following data preprocessing process uses Jupyter Notebook software with Python programming language (Lestari et al., 2018).
The first step is to import the library that will be used to display the dataset using the numpy and Pandas methods, which can be seen in the code below.import numpy as np import pandas as PD import matplotlib.pyplotas plt import seaborn as sns The second step is to call the CSV format dataset into the data frame with the PD.read_csv function and display the dataset, code and output results, as shown in Figure 2 below.filecsv='Dataset_Patient_Malaria.CSV teks = pd.read_csv(files,header = 0, delimiter= ';', encoding='utf-8') df=pd.DataFrame(teks) print(df) df.head() output: The third step deletes the columns not needed for the next process and the columns to be deleted.columns = ['No.','Provinsi ', 'Kabupaten','Fasyankes','Nama Pasien'] copy = df dfClean = dfCopy.drop(columns,inplace=True,axis=1) list(df.columns)After deleting the columns that are not needed, the following columns will be used for the following process: type of discovery, number, month/year, gender, pregnant / not pregnant, hamlet address, village kelurahan, type of parasite, symptoms1, symptoms2, symptoms3, symptoms4, symptoms5, symptoms6, symptoms7, symptoms8, symptoms9, symptoms10, livestock sheds, leaving the house at night, use of mosquito repellent, ventilation gauze, puddles, history of living in endemic areas, the use of mosquito nets, walls, the state of the house sky, mosquito breeding grounds, air temperature (°C), humidity (%), rainfall (mm), malaria diagnosis (Shofia, Putri, & Arwan, 2017).
The fourth step separates variables into category and number variables using the following code command: #untuk define category variables categorical = [var for var in pdf.columns if df [var].

df[categorical].isnull().sum() df[numerical].isnull().sum()
Next, define the dependent and independent variables on the dataset.The dependent variables selected are type of discovery, number, month/year, gender, pregnant / not pregnant, hamlet address, village kelurahan, type of parasite, symptom1, symptom2, symptom3, symptom4, symptom5, symptom6, symptom7, symptom8, symptom9, symptom10, livestock shed, leaving the house at night, use of mosquito repellent, ventilation gauze, puddles, history of living in endemic areas, use of mosquito nets, walls, state of the house sky, mosquito breeding site, air temperature (°C), humidity (%), rainfall (mm) as independent variables with the ILOC method to select dependent and independent variables based on column/variable index.In this case, it will use x, which contains all dependent variables, and y, which contains the independent or target variable.The code and output results can be seen in Figures 5 and 6   The output above shows that the dependent variable (X) consists of 31 variables for the independent variable (Y), namely the malaria diagnosis.

Correlation of the independent variable to the dependent variable
The correlation of the dependent variable to the independent variable is carried out to determine how much influence the dependent variable/predictor has on the independent / target variable (Shofia et al., 2017).The correlation of independent variables based on the dependent variable/predictor can be seen in Table 1 below.Based on Table 1 above, it can be seen that the variables of discovery type, month/year, gender, hamlet Address, village, symptom 1, symptom 2, symptom 5, symptom 7, symptom 9, symptom 10, livestock drums, leaving the house at night, use of mosquito repellent, ventilation gauze, puddles, history of living in endemic areas, use of mosquito nets, mosquito breeding sites and rainfall (mm) do not affect the dependent variable or target variable.Based on the calculation results, the correlation value obtained is negative, so it can be said that the variable does not strongly influence the dependent variable or target (Setiawan & Prihandono, 2019).

Model Testing
The model used to perform testing on the research dataset is the naïve Bayes algorithm model.Model testing is performed to display the classification report of the model used to see the value of classification evaluation metrics such as precision, recall, F1-score, and accuracy.

Naïve bayes Algorithm Model Testing
Testing on datasets is carried out using the Naïve Bayes algorithm to determine the classification report and accuracy in making classifications or predictions.The following testing process uses Jupyter Notebook software with Python programming language.
Testing the naïve Bayes algorithm with split or 90/10 data sharing for code and output results can be seen below.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 0) y_pred = gnb.predict(X_test)from sklearn import metrics from sklearn.metrics import classification_report cr1 = classification_report(y_test, y_pred) akurasi = metrics.accuracy_score(y_test,y_pred) rint (cr1) print ('The value of accuracy possessed by the model: %0.2f ' %(akurasi*100),'%') The above results can be explained.Precision is the ratio of correctly predicted positive observations to predicted positive totals.The precision for the Ovale class is 1.00, which means all class data predicted as the Ovale class is correct.The precision for the Quartana class is 1.00, which means all class data predicted as the Quartana class is correct.The precision for the Tertiana class is 0.98, which means that 98% of the class data predicted as the Tertiana class is the Tertiana class.The precision for the Tropica class is 1.00, which means all class data predicted as the Tropica class is correct.Recall is the ratio of correctly predicted positive observations to all actual positives.The recall for the Ovale, Quartana, and Tropica classes is 1.00, indicating that the model correctly identifies all instances of those classes.The recall for the Tertiana class is 0.98, which means the model manages to capture 98% of the actual instances of the Tertiana class.The F1-Score is a weighted average of precision and recall.The range is from 0 to 1, where 1 is the best F1-Score.The F1-Score for the Ovale and Tropica classes is 0.97, reflecting a good balance between precision and recall for the Ovale and Tropica classes.The F1-score for the Quartana class is 0.92, and the Tertiana class is 0.99, indicating a somewhat lower balance between precision and recall for the Quartana class Tertiana class compared to the Ovale class and Tropica class.Support indicates the actual number of class occurrences in the specified dataset.There are 5 Ovale class data, 7 Quartana class data, 61 Tertiana class data and 56 Tropica class data.The overall accuracy is 99.22%, representing the ratio of correctly predicted class data to total class data.Overall, the model performs well, especially for Ovale-class, Tertiana-class and Tropica-class data, achieving high precision and recall.For the Quartana class, the precision is perfect, but the recall is slightly lower, showing some difficulty in capturing all the data for the Quartana class (Shen & Shafiq, 2020).
Testing the naïve Bayes algorithm with split or 80/20 data division for code and output results can be seen below.Based on the results above, 80% of training and 20% of testing data sharing can be explained.The precision for an Ovale class is 0.95, which means that 95% of the class data predicted to be an Ovale class is an Ovale class.The precision for the Quartana, Tertiana, and Tropica classes is 1.00, meaning all class data is predicted as correct.The recall for the Ovale, Quartana, and Tropica classes is 1.00, indicating that the model correctly identifies all instances of those classes.The recall for the Tertiana class is 0.99, which means the model captures 99% of the actual instances of the Tertiana class.The F1-Score is a weighted average of precision and recall.The range is from 0 to 1, where 1 is the best F1-Score.The F1-Score for the Quartana, Tertiana and Tropica classes is 1.00, reflecting a good balance between precision and recall for the Quartana Tropica and Tertiana classes.The F1-score for the Ovale class is 0.97, indicating a somewhat lower balance between precision and recall for the Ovale class compared to the Quartana Tropica and Tertiana classes.Support indicates the actual number of class occurrences in the specified dataset.There are 18 Ovale class data, 15 Quartana class data, 118 Tertiana class data and 106 Tropica class data.The overall accuracy is 99.61%, representing the ratio of correctly predicted class data to total class data.Overall, the model performs well, especially for Tropica-class and Quartana-class data, achieving high precision and recall.For the Tertiana class, the precision is perfect, but the recall is slightly lower, showing some difficulty in capturing all the data of the Tertiana class.
Testing the naïve Bayes algorithm with split or 70/30 data division for code and output results can be seen below.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0) y_pred = gnb.predict(X_test)from sklearn import metrics from sklearn.metrics import classification_report cr1 = classification_report(y_test, y_pred)akurasi = metrics.accuracy_score(y_test,y_pred) print(cr1) print ('Nilai akurasi yang dimiliki oleh model: %0.2f ' %(akurasi*100),'%') Based on the results above, 70% of training and 30% of testing data sharing can be explained.The precision for an Ovale class is 0.96, which means 96% of the class data predicted as an Ovale class is an Ovale class.The precision for the Quartana, Tertiana, and Tropica classes is 1.00, meaning all class data is predicted as correct.The recall for the Quartana and Tropica classes is 1.00, indicating that the model correctly identifies all instances of those classes.The recall for the Tertiana class is 0.99, which means the model captures 99% of the actual instances of the Tertiana class.The F1-Score is a weighted average of precision and recall.The range is from 0 to 1, where 1 is the best F1-Score.The F1-Score for the Quartana, Tertiana and Tropica classes is 1.00, reflecting a good balance between precision and recall for the Quartana, Tertiana and Tropica classes.Support indicates the actual number of class occurrences in the specified dataset.There are 27 Ovale class data, 18 Quartana class data, 178 Tertiana class data and 163 Tropica class data.The overall accuracy is 99.74%, representing the ratio of correctly predicted class data to total class data.The model performs well, especially for Quartana and Tropica class data, where high precision and recall are achieved.For the Tertiana class, the precision is perfect, but the recall is slightly lower, showing some difficulty in capturing all the data of the Tertiana class.
Testing the naïve Bayes algorithm with split or 60/40 data division for code and output results can be seen below.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40, random_state = 0) y_pred = gnb.predict(X_test)from sklearn import metrics from sklearn.metrics import classification_report cr1 = classification_report(y_test, y_pred) akurasi = metrics.accuracy_score(y_test,y_pred) print(cr1) print('The value of accuracy possessed by the model: %0.2f ' %(akurasi*100),'%') precision recall f1-score support Ovale 0.97 1.00 0.99 Quartana 1.00 1.00 1.00 23 Tertiana 1.00 1.00 1.00 242 Tropica 1.00 1.00 1.00 213 accuracy 1.00 514 macro avg 0.99 1.00 1.00 514 weighted avg 1.00 1.00 1.00 514 The accuracy value possessed by the model is 99.81 % Based on the above results, 60% training and 40% testing can be explained with data sharing.The precision for an Ovale class is 0.97, which means 97% of the class data predicted as an Ovale class is an Ovale class.The precision for the Quartana, Tertiana, and Tropica classes is 1.00, meaning all class data is predicted as correct.The recall for classes Ovale, Quartana, Tertiana, and Tropica is 1.00, indicating that the model correctly identifies all instances of those classes.The F1-Score is a weighted average of precision and recall.The range is from 0 to 1, where 1 is the best F1-Score.The F1-Score for the Quartana, Tertiana and Tropica classes is 1.00, reflecting a good balance between precision and recall for the Quartana, Tertiana and Tropica classes.Support indicates the actual number of class occurrences in the specified dataset.There are 36 Ovale class data, 23 Quartana class data, 242 Tertiana class data and 213 Tropica class data.The overall accuracy is 99.81%, representing the ratio of correctly predicted class data to total class data.Overall, the model performs well, especially for the Quartana-class, Tertiana-class and Tropica-class data, achieving high precision and recall.
Testing the naïve Bayes algorithm with split or 50/50 data division for code and output results can be seen below.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state = 0) y_pred = gnb.predict(X_test)from sklearn import metrics from sklearn.metrics import classification_report cr1 = classification_report(y_test, y_pred) akurasi = metrics.accuracy_score(y_test,y_pred) print(cr1) print ('The value of accuracy possessed by the model: %0.2f ' %(akurasi*100),'%') The data sharing of 50% training and 50% testing can be explained based on the results above.The precision for an Ovale class is 0.98, which means that 98% of the class data predicted as an Ovale class is an Ovale class.The precision for the Quartana, Tertiana, and Tropica classes is 1.00, meaning all class data is predicted as correct.The recall for classes Ovale, Quartana, Tertiana, and Tropica is 1.00, indicating that the model correctly identifies all instances of those classes.The F1-Score is a weighted average of precision and recall.The range is from 0 to 1, where 1 is the best F1-Score.The F1-Score for the Quartana, Tertiana and Tropica classes is 1.00, reflecting a good balance between precision and recall for the Quartana, Tertiana and Tropica classes.Support indicates the actual number of class occurrences in the specified dataset.There are 44 Ovale class data, 31 Quartana class data, 299 Tertiana class data and 268 Tropica class data.The overall accuracy is 99.84%, representing the ratio of correctly predicted class data to total class data.Overall, the model performs well, especially for the Quartana-class, Tertiana-class and Tropica-class data, achieving high precision and recall.
Based on the classification results of the Naïve Bayes algorithm, it can be concluded that the results of the classification report on the algorithm show that the Quartana, Tertiana and Tropica categories are more dominant than the Ovale category because the precision, recall and f1-score values in the Quartana, Tertiana and Tropica categories are higher than the precision, recall and f1-score values in the Ovale category.Then, the highest accuracy value was obtained by the naïve Bayes algorithm in the fifth test with a 50/50 data division of 99.84%.More details can be seen in Table 2 below.In Table 2 above, it can be seen that the highest value obtained by the naïve Bayes algorithm in the fifth test, whose accuracy value was 99.84%, with a 50/50 data division.

Evaluation
At this stage, the Naïve Bayes algorithm was evaluated using the Confusion Matrix method and the Receiver Operating Characteristic (ROC) curve.To find out the model's performance on each algorithm with the help of jupyter notebook software Python programming language.
Based on the results of the confusion matrix model evaluation, it can be seen that the performance accuracy of the naïve Bayes algorithm model is 0.992, and the classification error is 0.008.Furthermore, evaluation of the naïve Bayes algorithm model was carried out using ROC to visually measure the performance of the classification model, focusing on True Positive Rate and False Positive Rate at one point to provide information on the performance of the naïve Bayes algorithm model in general.
Based on the figure above, the evaluation results of the naïve Bayes algorithm, which compares the performance of data classification with the Area Under Curve (AUC) technique of 0.976, are included in the excellent classification.
Based on the results of the confusion matrix model evaluation, it can be seen that the performance accuracy of the naïve Bayes algorithm model is 0.998, and the classification error is 0.002.Furthermore, an evaluation of the naïve Bayes algorithm model was carried out using the ROC curve to visually measure the performance of the classification model, focusing on the True Positive Rate and False Positive Rate at one point to be able to provide information on the performance of the naïve Bayes algorithm model in general.
Table 3 below shows the results of the performance evaluation of the Naïve Bayes algorithm model using the confusion matrix and the ROC curve.

Figure 2
Figure 2 Import Research Dataset

Figure 3
Figure 3 Check the Dataset missing value category variable.

Figure 4
Figure 4 Check the Dataset missing value variable number.

Figure
Figure 5 Dependent Variables