MACHINE LEARNING-BASED MALWARE DETECTION

MACHINE LEARNING-BASED MALWARE DETECTION

A. Detection Data and Feature SelectionTwo types of data have been extracted for malware classifi-cation. The first, Android application Permissions, have beenextracted from the manifest of APK files. The second, SystemCalls, are collected at runtime as the System Calls are made tothe Android Linux kernel by a singular application. The aimfor the data analysis module is to take the classified examplesand develop a classification model with the capability toclassify new examples through predictive analysis and patternrecognition. The structure of the data has been laid out insuch a way that a single file contained the class value foreach observation. An observation can be thought of as beingsynonymous with an APK file or instance of the correspondingapplication. Each observation is mapped to features in aseparate but related data set. The features that comprise anapplication give insight as to determine the class value for theobservation. Multiple features can belong to many differentapplications. The nature of this portion of the research was figuring out how to determine which features in applicationsmake up attributes that point to being malicious.The resources that an application accesses on a device,for example, are obtained through permissions that allowactivities to take place. A list of the permissions required foran application to run successfully on the device is stored inthe application manifest, and can be extracted from the APKfile. Likewise, the system calls made by an application can beused to detect malicious activities. Many times, maliciouslybehaving applications are able to conduct their harmful activitybecause permission has been granted by the user of thedevice, obviously unaware that the application is malicious.The permissions that an application requires can be viewed bythe user, but the system calls made by the application requiremore of an involved understanding and effort to retrieve. A Food Wastage Reduction Though there are dozens of permissions that are availableto the developer to request and require for installation of theapplication, these permissions are finite. In regards to thesystem, there are likewise system calls that are made fromthe application to the operating system. These system calls,ranging from file management calls to information gathering,are also finite in number. The malicious test data used in thispaper were derived from APK files supplied by the AndroidMalware Genome Project.

B. Machine Learning MethodsIn our study, we consider the following four machinelearning methods.

1) ZeroR:ZeroR is a simple classification method. Theclass value with the highest number of instances is taken as themajority class. The baseline can be visualized with a frequencytable. For example, if there were 100 instances of mobileapplications and of that 100, 70 were malicious instances, theratio of malicious to benign would be 70 to 30. The maliciousclass would be the dominant class and the base line would beestablished as a 70 % learning accuracy. In this example usingZeroR, the probability of a new unclassified instance beingmalicious is then 70 %, which is the base line. Notice thatthe other learning methods are expected to provide a higheraccuracy than the baseline due to the complexity enhancement.The baseline is established by taking a survey of the statisticsof the class value as it is. ZeroR was used to find the baseline,which is a straight forward snapshot of the classification resultsbefore running any machine learning algorithm. This numbercomes from taking a tally of all instances and their correspond-ing class value. Whichever class value has the most instancesin its group becomes the “correct” group and the others fallinto an incorrectly classified category. In sections to come,there will be percentage splits (training/testing) implementedto allow ZeroR to be used in predicting the classification oftest data.

2) OneR:The “One Rule” method is a slight step higher than ZeroR in complexity, yet still very simple in comparison to the other learning algorithms. In most cases, learning accuracy improves when comparing ZeroR and OneR. Themethod takes a single attribute, determines how many possible(distinct) values this attribute can be set to, and counts how many times each (distinct) value of the attribute appears in the target class. The most frequent class is taken into account, and this class is assigned to this (distinct) value of the attribute.Errors for each assignment of a class are calculated, and the attribute with the least number of errors is taken as the predictor for the data set. This method can be visualized with a frequency table for each attribute.Each attribute is examined for how it relates to the final classification. The error is the lesser of the options that are attached to each feature. For example, if1and0are the two options for the attribute “X” (either the instance has the feature or does not). To compute the error for this attribute (X), the count of benign instances that do not possessX(Xior0)and the count of benign instances that do possessX(Xjor1) are compared to the count of malicious instances that do not possessX(Xior0) and the count of malicious instances that do possessX(Xjor1). This is done for each attribute,and once the lowest error rate is found, it is used as the basis for the percentage of accuracy in classification

3) Naïve Bayes:Based on Bayes Theorem, this method is used for computing the probability of a class given a distinct value of an attribute. The Naïve Bayes process is different from OneR in that there is no single attribute singled out. All the attributes play an equal role in the prediction effort. Knowing the value of one attribute does not tell you anything about the value of the other attributes. Bayes theoremis computing the probability of a hypothesis (class) given some evidence (value of attributes). In this method, a priori probability of His taken into account for the calculation and the a posteriori probability of His the outcome. The notion of class conditional independence assumes that the effect of the value of an attribute on a class is independent of the values of other attributes. To find the posterior probability, you take the likelihood (attribute value) and multiply by the probability of the class, and then all of this is divided by the attributes prior probability.

4) Decision Tree:For the Decision Tree, J48, a tree classifier heuristic method based on the C4.8 system was selected.This is a decision support method that employs tree-like graph options in a cascading manner, moving from general to more specific in many instances of detailed classification based on attributes alone. The results of a decision tree can be converted to rules. Each path from the root of a tree to a leaf is a rule. To find the attribute with the most information gain in this particular case, there must be an assessment of the gain ratio (which is based on information theory). In discussing information gain, entropy has to be established in order to have a level of understanding of the uncertainty of a variable.

C. Evaluation and Results In our evaluation, we use one known data mining and predictive analytics tool, Weka [17], which was employed to mine the Android application data for detection. The data was converted to the.arffdata type to be able to be used with Weka.Here, the observation and class value must be combined, along with the mappings of the features. Once denormalized, each row of data consisted of the information from the observation with a class value appended to the end of the attributes/features of that application. In our experiments, we used 241 malware and 241 benign apps for the detection based on Permission data and used 91 malware and 95 benign apps for the detection based on System Call data.A row of Permission data would contain a 0 or 1 value for each feature, with the overall determination of the application as being either malicious or benign, also represented with a ‘M’ or ‘B’. Each example then became an independent example with a class value attached. Each row contained data representing 80 features that were representations of the existence of attributes of the application. From a relational standpoint, each feature served as a column in the data set. The features were permissions of the respective mobile application.Unlike the permission data, a row of System Call data would contain a weighted decimal value for each feature (system call)to depict how often the feature appeared in the application instance. Altogether, a row (application instance) will have values for each of the possible system calls that can be made by an application. Since the data is weighted, the total usage for each of the system calls by an application will add up to1. This data was numeric so it had to be translated in such away that would give it the effect of being nominal. To achieve the translation, a binning scheme was adopted. In the binning scheme, a range would be assigned to section the data into groups.

The bin size would determine the granularity in the data. For example, a binary attribute has a bin size of 2 since the only possible values for the attribute are true and false. In the case of the system call data, the bin sizes that were used were 5, 10, 15, and 20. This is significant because a bin size of5 will divide the infinite range of values between 0 and 1 into 5equal width groups. Thus, a value of an attribute in its numeric form would fall into one of these five groups depending on the upper and lower limits of the range of values determining the particular group that the value resides. The higher the bin size, the higher the granularity of the data.In the evaluation of any data for classification and prediction, there must be an adjustable level for the training to testing ratio in order to accurately find the best combination possible to classify new data sets from the work done in the learning effort. Each of the classification methods introduced previously were tested with four different training ratios. The four ratios were 20, 40, 60, and 80 percent training to testing splits of the data. For example, if a 20 % split was used, then 20 %of all the data would be used to train on, and the remaining80 % would be the test group. From this level of detail, the best method for predicting the classification of new instances would be evident in terms of performance.The data in Figure 3 (left) shows the detection rate of permission-based data and the data in Figure 3 (right) shows the false positives at each of the corresponding training ratios.From the figure, we can see the detection rate was measured at 20, 40, 60, and 80 percent for each of the four learning algorithms. This totaled up to 16 different tests. Of the four algorithms used, the best detection rate was achieved with both OneR and J48 equally reaching 100 % detection accuracy when 80 % of the total data set was used for training. This means that the leaning from the 80 % allowed the remaining20 % to be classified correctly. The poorest performing method was predictably ZeroR, which is to be expected since it is the least sophisticated of the algorithms and does not actually learn, but will instead classify based on majority percentages.Outside of ZeroR, which reached only 49.7 %, the next worst performing algorithms were OneR and J48 at a 20 % training ratio. However, the performance was still 94.6 % accuracy,which is certainly still acceptable.

From the examination of the false positives for each of the algorithms at the specific training ratios, it was determined that both OneR and J48 again had the best performance, with 0 false positives at an 80 % training ratio. Also of note, ZeroR had no false positives, but this is to be expected for this algorithm. The worst false positives was with Naive Bayes at 20 % training yielding a total of 4 false positives.The data in Figure 4 shows the detection rate of system call data with respect to a bin size of 5 (left) and 10 (right). From the figure, we can see that in analyzing the system call data,the number of tests increased due to the added binning scheme needed for translating the numeric data to that of a nominal nature. Binning levels of 5, 10, 15, and 20 were used for the tests at each of the training ratios of 20, 40, 60, and 80 percent.Figure 4 is showing the results data from both a bin size of5 and 10. In analyzing the 64 tests, the best detection rate of94.59 % was achieved using the Naïve Bayes algorithm at a40 % training ratio with 10 bins. It should be noted that the two next best accuracy levels were also achieved with NaïveBayes and were equal to or higher than the best level achieved with any of the other algorithms. The worst single performance(excluding the baseline) in terms of detection, proved to beJ48 using 20 bins at a training ratio of 20 percent. Figure 5shows the false positives data for two bin sizes of 5 (left) and10 (right). From the figure, we can see the best non-ZeroR algorithm proved to be J48 at only 1 false positive using a binsize of 20 and a training ratio of 60 %. The worst showing was OneR at a bin size of 20, a training ratio of 20 %, yielding18 false positives.

.net 2019 2019 android apps 2019-2020 2020 Android android (operating system) android (software) android 10 android 10 features android 2019 android 6.0 sdk android app android app bundle android app development android app development kit android app development tutorial Android app ideas Android app ideas 2019 Android app ideas 2020 Android app ideas for beginners Android app ideas for college project Android app ideas for students android app making Android app project Android app project ideas Android app projects android app tutorial android apps android apps 2019 android based os Android based projects android development android lollipop Android mini project topics Android Mini Projects android nougat android on pc android oreo android os android os on pc android phone android platform architecture Android Project Android Project Ideas Android project ideas 2019 Android project ideas for beginners Android project ideas for computer science Android project ideas for students Android Project Ideas Of 2019 Android Project Ideas Of 2020 Android project ideas with source code Android Project Titles Android project topics Android project with source code Android project with source code for students Android Projects Android Projects For Final Year Android Projects Ideas Android projects list Android Projects Topics Android Projects With Source Code android q beta android sdk android studio android studio app development Android Studio Project Android Studio Tutorial android tutorial android versions android vs android vs apple android vs ios android vs iphone android x86 app app development apps for android best android best android apps best android apps 2019 best android os best android os for pc best apps for android best free android apps Capstone Project Titles Create Android Project EEE Final Year Android Project Titles Final Year Android Projects free android app free android apps free php projects Hosur How To Create New Android Studio Project 2019 2020 how to make an app how to make android apps how to make android apps for beginners Ieee Projects Ieee Projects Php In Your Android Project ios vs android iphone vs android Java Kumbakonam learn android development learn php learning android app development make android app Mannargudi Mayiladuthurai Mca Android Projects Mca final year projects Mca final year projects titles Mca mini project titles with abstract Mca project ideas Mca project titles Mca project topics Mca projects in android Mca projects in php Mca Projects Titles migliori app android Mini project topics for mca Mini projects for mca 5th sem new android apps new php project ideas nodejs vs php os php php 2019 php 7 php agency php books php college project php first project php for beginners php in 2019 php language php mysql php programming php project php project code php project ideas php project ideas 2018 php project source code Php Project Titles Php project topics Php project topics for mca php project tutorial php projects php projects download php projects for students php projects with source code php school project php tutorial php tutorial for beginners php tutorial for beginners full php website project php7 project Project center in hosur Project center in kumbakonam Project center in mannargudi Project center in mayiladuthurai Project center in thanjavur Project center in trichy Project Ideas projects projects on php Thanjavur top 10 android top android apps top android apps 2019 Trichy

Leave a Reply

Your email address will not be published. Required fields are marked *