Temporal Prediction on Students’ Graduation using Naïve Bayes and K-Nearest Neighbor Algorithm

waktu


INTRODUCTION
Accreditation is a form of evaluation or assessment of the quality and feasibility of a higher education institution or study program conducted by an organization or the National Accreditation Board for Higher Education (BAN-PT) [1]. Higher education institutions that have better accreditation will be an option for prospective students. One of the National Accreditation Board for Higher Education accreditation assessments is the percentage of on-time graduation for each program [1]. Therefore, the passing rate is vital for accreditation assessments.
The percentage of graduation students on-time in college can be predicted with learning analytics. Learning analytics is a way to process data obtained from learning [2]. Learning analytics are starting to be used in higher education to improve the quality of education. It can provide benefits, namely increasing graduation, curriculum development, improving lecturer performance, time of admission after graduation, and increasing research in the field of education. One method that is often used in analytic learning is classification.
This research combined the naïve Bayes and the k-nearest neighbor algorithm. Naïve bayes is used because it has the advantage of fast and simple calculations [3]. Naïve bayes also have a weakness the probability that the attribute cannot measure a prediction's accuracy [3]. The k-nearest neighbor algorithm is used because it can process large data but has the disadvantage of requiring high computation time [4]. To overcome the weaknesses and optimize the strengths of both algorithms, this study will combine the Naïve Bayes and the K-Nearest Neighbor algorithm. Combined method is done by selecting the most influential attributes with naïve bayes. Then the knearest neighbor algorithm includes only the selected attributes to build the classification model.
On this reseach the prediction is made temporally; it is done annually. Temporal prediction aims to be able to early predict either a student will graduate on time or not. So that it can take precautions against students who are at risk of not graduating on time or drop out. The dataset used for prediction is the grade index of students in Informatics Engineering Degree, Telkom University, class of 2008-2011. Using the attributes of courses and GPA, the classification will use the naïve bayes algorithm and the k-nearest neighbor.
Previous research [5] examined student graduation prediction using decision tree algorithms and artificial neural networks. On study, the decision tree algorithm's accuracy results are compared with the artificial neural network. It was found that the classification with artificial neural network has an accuracy of 79.74% and the decision tree has an accuracy of 74.51%. The conclusion in this study is that artificial neural network has a higher level of accuracy when compared to the decision tree method.
Another research [6] performed a combination of the naïve bayes algorithm and the k-nearest neighbor. The combined method on the study found the probability of the data attribute with naïve bayes and then classified the selected data using k-nearest neighbor. The study obtained better accuracy results and faster system running time when combining the nearest neighbor with naïve bayes compared to the algorithm without a combination.
Another approached has been discussed on other research [7]. It selected data attributes with bernoulli and multinomial naïve bayes. The study compared attribute selection with the recursive feature elimination (RFE) algorithm, l1-regularized support vector machine (SVM), SVM with RFE, LASSO. The selection of features with bernoulli naive bayes will be implemented in this study.

System Overview
This research will create a system that combines the naïve bayes algorithm with the k-nearest neighbor to predict student graduation temporally. To run this system, the dataset needs to go through data preprocessing so that the data can be entered into the model that has been built. Table 1 lists the dataset used on our research. The table above is a description of the dataset used. The dataset consists of 4 different class, every class has a different number of students. Each class has 4 datasets from 1 st year, 2 nd year, 3 rd year, 4 th year that contain attributes (courses and GPA).

Figure 1. Flowchart System
The flowchart above is the process of the whole system made in this research. Starting from the collection and data input of Telkom University S1 Informatics Engineering student batch 2008-2011. Then process the preprocessing data with Pentaho Data Integration and Microsoft Excel. After that, run the naïve bayes algorithm, select the attributes of the courses that have been run naïve bayes with a limit of 40% of the total attributes. Finally, do the classification using the k-nearest neighbor algorithm and get the percentage of student graduation results.

Learning Analytics
Learning Analytics is the use of intelligent data, learned information, and analysis models to predict people's learning and get ideas, and explore information and social relationships [8]. Learning Analytics focuses on education-related data such as student interactions with course material, grades, and lecturers with the aim of improving the quality of education [9]. This technique main goal is to improve learning outcomes and reduce attrition, especially among at-risk students. Learning analytics helps educators to 'zoom in' on individual students who need additional assistance.

Classification
Classification is a method for finding a model that explains and sort data classes by dividing the data into training data and test data. Classification is useful for predicting a class from a dataset that does not yet have a class [10]. There are many examples of classification algorithms: artificial neural networks, decision trees, naïve bayes classifiers, statistical analysis, rule-based methods, genetic algorithms, memory-based reasoning, rough sets, knearest neighbor, and support vector machines (SVM).
Classification is included in supervised learning because the class label in each tuple has been provided. The algorithm that implements the classification method is called a classifier. The algorithm used in this study uses the naïve bayes algorithm and k-nearest neighbor.

K-Nearest Neighbor
K-nearest neighbor is a classification method for new data based on the closest distance to existing data. At initiation, it will determine K, and this K is the number of points closest to the data points being tested. Finding the best K value is a process that can improve the classification model. Generally, a higher K value reduces the effect of noise but makes the boundaries between each classification even further [11]. The sensitivity of this algorithm is high because it is very susceptible to noise in training data. There are many formulas for calculating distance, namely types of formulas, namely euclidean, mahalobins distance, cosine, City block, chebychev, correlation (correction), hamming, jaccard, minkowski, seuclidean, and spearman [12]. The k-nearest neighbor algorithm usually uses the euclidean distance formula in calculations for test data and training data. The following is the euclidean distance formula: ( , ) = √∑ =1 ( − ) 2 (1)

Naïve Bayes
Naïve bayes is a simple classification algorithm that calculates probability based on the number of occurrences and the combination of values from an existing dataset. This algorithm strongly assumes the independence of attributes or not depending on the value of the class variable. Naïve bayes is a classification algorithm with probability science and statistics derived from the Bayes theorem. The naïve bayes algorithm has a very minimum error rate [13] and is known for its simple, fast, and highly accurate calculations [14]. The equation of the Bayes theorem is [15]:

Results of Testing the Most Influential Attributes with Naïve Bayes
The results of this test were obtained from the students' final data entered into the naïve Bayes algorithm to select attributes. The selection of this attribute uses a bernoulli naive bayes algorithm, so the data is first converted into binary form (0 & 1) by using the median dataset used. These results were obtained by using the 2008 student dataset level 1 to 4; this dataset was chosen because it has more student data than other datasets. These test results are in the form of subject attributes that have the most influence on student graduation. Here are the five attributes that most influence graduation. Algorithm Analysis Design Based on the results of testing the table above to find the attributes that most influence student graduation. There are five course attributes that most influence student graduation. The most influence graduation attributes are Probability and Statistics, Object Oriented Engineering, Software Engineering (RPL), Computer Architecture Organization, Algorithm Analysis Design. Probability and Statistics courses significantly affect graduation because these courses become one of the most difficult courses.

Attribute Selection Result Testing on Accuracy
This test compares the accuracy obtained between the algorithm that passes the attribute selection with the algorithm without passing the attribute selection. This test displays the k-nearest neighbor algorithm classification percentage results that pass attribute selection with an attribute limit of 40% and the k-nearest neighbor algorithm without passing attribute selection. The following is a table of comparison test results obtained. In the graph of the test results above, the results obtained between the algorithm that passes the attribute selection with the algorithm without passing the attribute selection are not much different. Classification by selecting attributes with naïve Bayes only by using 40% of its attributes results in approximately the same accuracy as the algorithm without passing attribute selection. This model can be used as a more effective and efficient model because the KNN algorithm does not need to calculate the entire dataset in a shorter time.

CONCLUSION
The conclusion from the implementation and analysis of the overall combination model of the naïve bayes and the k-nearest neighbor algorithm shows that the attributes of the courses have the most influence on student graduation ( Table 1). There are five course attributes that most influence student graduations are Probability and Statistics, Object Oriented Engineering, Software Engineering (RPL), Computer Architecture Organization, Algorithm Analysis Design. Each attribute of this course is expected to help Telkom University's Undergraduate Informatics study program provide lessons to students to get maximum and better results in courses to increase the percentage of student graduation. The combination algorithm of naïve bayes and k-nearest neighbor is more effective and efficient than the algorithm without attribute selection. The classification percentage has increased every time it rises. From the temporal prediction model of student graduation from class 2008 to 2011, the most considerable percentage results were obtained at level 1 75.40%, level 2 82.08%, level 3 81.91%, and level 4 90.42%. With these results, it is hoped that the study program can predict student graduation early to know the possibility of graduating on time or not to take preventive action if it is possible that students who are at risk of not graduating on time. Suggestions for further research from all the implementation and analysis can consider other attributes such as elective courses, non-academic attributes, and curriculum. Create a model with a newer dataset, use a classification method other than k-nearest neighbor, and use another selection attribute method.