MACHINE LEARNING-BASED MALWARE DETECTION
A. Detection Data and Feature SelectionTwo types of data have been extracted for malware classifi-cation. The first, Android application Permissions, have beenextracted from the manifest of APK files. The second, SystemCalls, are collected at runtime as the System Calls are made tothe Android Linux kernel by a singular application. The aimfor the data analysis module is to take the classified examplesand develop a classification model with the capability toclassify new examples through predictive analysis and patternrecognition. The structure of the data has been laid out insuch a way that a single file contained the class value foreach observation. An observation can be thought of as beingsynonymous with an APK file or instance of the correspondingapplication. Each observation is mapped to features in aseparate but related data set. The features that comprise anapplication give insight as to determine the class value for theobservation. Multiple features can belong to many differentapplications. The nature of this portion of the research was figuring out how to determine which features in applicationsmake up attributes that point to being malicious.The resources that an application accesses on a device,for example, are obtained through permissions that allowactivities to take place. A list of the permissions required foran application to run successfully on the device is stored inthe application manifest, and can be extracted from the APKfile. Likewise, the system calls made by an application can beused to detect malicious activities. Many times, maliciouslybehaving applications are able to conduct their harmful activitybecause permission has been granted by the user of thedevice, obviously unaware that the application is malicious.The permissions that an application requires can be viewed bythe user, but the system calls made by the application requiremore of an involved understanding and effort to retrieve. Though there are dozens of permissions that are availableto the developer to request and require for installation of theapplication, these permissions are finite. In regards to thesystem, there are likewise system calls that are made fromthe application to the operating system. These system calls,ranging from file management calls to information gathering,are also finite in number. The malicious test data used in thispaper were derived from APK files supplied by the AndroidMalware Genome Project.
B. Machine Learning MethodsIn our study, we consider the following four machinelearning methods.
1) ZeroR:ZeroR is a simple classification method. Theclass value with the highest number of instances is taken as themajority class. The baseline can be visualized with a frequencytable. For example, if there were 100 instances of mobileapplications and of that 100, 70 were malicious instances, theratio of malicious to benign would be 70 to 30. The maliciousclass would be the dominant class and the base line would beestablished as a 70 % learning accuracy. In this example usingZeroR, the probability of a new unclassified instance beingmalicious is then 70 %, which is the base line. Notice thatthe other learning methods are expected to provide a higheraccuracy than the baseline due to the complexity enhancement.The baseline is established by taking a survey of the statisticsof the class value as it is. ZeroR was used to find the baseline,which is a straight forward snapshot of the classification resultsbefore running any machine learning algorithm. This numbercomes from taking a tally of all instances and their correspond-ing class value. Whichever class value has the most instancesin its group becomes the “correct” group and the others fallinto an incorrectly classified category. In sections to come,there will be percentage splits (training/testing) implementedto allow ZeroR to be used in predicting the classification oftest data.
2) OneR:The “One Rule” method is a slight step higher than ZeroR in complexity, yet still very simple in comparison to the other learning algorithms. In most cases, learning accuracy improves when comparing ZeroR and OneR. Themethod takes a single attribute, determines how many possible(distinct) values this attribute can be set to, and counts how many times each (distinct) value of the attribute appears in the target class. The most frequent class is taken into account, and this class is assigned to this (distinct) value of the attribute.Errors for each assignment of a class are calculated, and the attribute with the least number of errors is taken as the predictor for the data set. This method can be visualized with a frequency table for each attribute.Each attribute is examined for how it relates to the final classification. The error is the lesser of the options that are attached to each feature. For example, if1and0are the two options for the attribute “X” (either the instance has the feature or does not). To compute the error for this attribute (X), the count of benign instances that do not possessX(Xior0)and the count of benign instances that do possessX(Xjor1) are compared to the count of malicious instances that do not possessX(Xior0) and the count of malicious instances that do possessX(Xjor1). This is done for each attribute,and once the lowest error rate is found, it is used as the basis for the percentage of accuracy in classification
3) Naïve Bayes:Based on Bayes Theorem, this method is used for computing the probability of a class given a distinct value of an attribute. The Naïve Bayes process is different from OneR in that there is no single attribute singled out. All the attributes play an equal role in the prediction effort. Knowing the value of one attribute does not tell you anything about the value of the other attributes. Bayes theoremis computing the probability of a hypothesis (class) given some evidence (value of attributes). In this method, a priori probability of His taken into account for the calculation and the a posteriori probability of His the outcome. The notion of class conditional independence assumes that the effect of the value of an attribute on a class is independent of the values of other attributes. To find the posterior probability, you take the likelihood (attribute value) and multiply by the probability of the class, and then all of this is divided by the attributes prior probability.
4) Decision Tree:For the Decision Tree, J48, a tree classifier heuristic method based on the C4.8 system was selected.This is a decision support method that employs tree-like graph options in a cascading manner, moving from general to more specific in many instances of detailed classification based on attributes alone. The results of a decision tree can be converted to rules. Each path from the root of a tree to a leaf is a rule. To find the attribute with the most information gain in this particular case, there must be an assessment of the gain ratio (which is based on information theory). In discussing information gain, entropy has to be established in order to have a level of understanding of the uncertainty of a variable.
C. Evaluation and Results In our evaluation, we use one known data mining and predictive analytics tool, Weka , which was employed to mine the Android application data for detection. The data was converted to the.arffdata type to be able to be used with Weka.Here, the observation and class value must be combined, along with the mappings of the features. Once denormalized, each row of data consisted of the information from the observation with a class value appended to the end of the attributes/features of that application. In our experiments, we used 241 malware and 241 benign apps for the detection based on Permission data and used 91 malware and 95 benign apps for the detection based on System Call data.A row of Permission data would contain a 0 or 1 value for each feature, with the overall determination of the application as being either malicious or benign, also represented with a ‘M’ or ‘B’. Each example then became an independent example with a class value attached. Each row contained data representing 80 features that were representations of the existence of attributes of the application. From a relational standpoint, each feature served as a column in the data set. The features were permissions of the respective mobile application.Unlike the permission data, a row of System Call data would contain a weighted decimal value for each feature (system call)to depict how often the feature appeared in the application instance. Altogether, a row (application instance) will have values for each of the possible system calls that can be made by an application. Since the data is weighted, the total usage for each of the system calls by an application will add up to1. This data was numeric so it had to be translated in such away that would give it the effect of being nominal. To achieve the translation, a binning scheme was adopted. In the binning scheme, a range would be assigned to section the data into groups.
The bin size would determine the granularity in the data. For example, a binary attribute has a bin size of 2 since the only possible values for the attribute are true and false. In the case of the system call data, the bin sizes that were used were 5, 10, 15, and 20. This is significant because a bin size of5 will divide the infinite range of values between 0 and 1 into 5equal width groups. Thus, a value of an attribute in its numeric form would fall into one of these five groups depending on the upper and lower limits of the range of values determining the particular group that the value resides. The higher the bin size, the higher the granularity of the data.In the evaluation of any data for classification and prediction, there must be an adjustable level for the training to testing ratio in order to accurately find the best combination possible to classify new data sets from the work done in the learning effort. Each of the classification methods introduced previously were tested with four different training ratios. The four ratios were 20, 40, 60, and 80 percent training to testing splits of the data. For example, if a 20 % split was used, then 20 %of all the data would be used to train on, and the remaining80 % would be the test group. From this level of detail, the best method for predicting the classification of new instances would be evident in terms of performance.The data in Figure 3 (left) shows the detection rate of permission-based data and the data in Figure 3 (right) shows the false positives at each of the corresponding training ratios.From the figure, we can see the detection rate was measured at 20, 40, 60, and 80 percent for each of the four learning algorithms. This totaled up to 16 different tests. Of the four algorithms used, the best detection rate was achieved with both OneR and J48 equally reaching 100 % detection accuracy when 80 % of the total data set was used for training. This means that the leaning from the 80 % allowed the remaining20 % to be classified correctly. The poorest performing method was predictably ZeroR, which is to be expected since it is the least sophisticated of the algorithms and does not actually learn, but will instead classify based on majority percentages.Outside of ZeroR, which reached only 49.7 %, the next worst performing algorithms were OneR and J48 at a 20 % training ratio. However, the performance was still 94.6 % accuracy,which is certainly still acceptable.
From the examination of the false positives for each of the algorithms at the specific training ratios, it was determined that both OneR and J48 again had the best performance, with 0 false positives at an 80 % training ratio. Also of note, ZeroR had no false positives, but this is to be expected for this algorithm. The worst false positives was with Naive Bayes at 20 % training yielding a total of 4 false positives.The data in Figure 4 shows the detection rate of system call data with respect to a bin size of 5 (left) and 10 (right). From the figure, we can see that in analyzing the system call data,the number of tests increased due to the added binning scheme needed for translating the numeric data to that of a nominal nature. Binning levels of 5, 10, 15, and 20 were used for the tests at each of the training ratios of 20, 40, 60, and 80 percent.Figure 4 is showing the results data from both a bin size of5 and 10. In analyzing the 64 tests, the best detection rate of94.59 % was achieved using the Naïve Bayes algorithm at a40 % training ratio with 10 bins. It should be noted that the two next best accuracy levels were also achieved with NaïveBayes and were equal to or higher than the best level achieved with any of the other algorithms. The worst single performance(excluding the baseline) in terms of detection, proved to beJ48 using 20 bins at a training ratio of 20 percent. Figure 5shows the false positives data for two bin sizes of 5 (left) and10 (right). From the figure, we can see the best non-ZeroR algorithm proved to be J48 at only 1 false positive using a binsize of 20 and a training ratio of 60 %. The worst showing was OneR at a bin size of 20, a training ratio of 20 %, yielding18 false positives.