Android Applications Categorization Using Bayesian Classification

Android Applications Categorization Using Bayesian Classification

The rapid growing of the Android application and malware has increased the usage of the application category in Android malware detection and application searching. However, the defects of management of Android Market lead to a great deal of applications miscategorization. Therefore, it’s helpful for both organizing the Android Market and Android malware detection to give an approach that can automatically distinguish different categories of the applications. In this paper, we present an effective approach for automatically categorizing Android applications based on Bayesian classification. Considering the category of the application is determined by its function, we extracted the used permissions and strings that can reflect the application function from the application itself and Android Market as classification features. Finally, we conduct experiments with 13005 applications that are composed of 18 categories with Naive Bayes. The evaluation results show that our approach can achieve better accuracy and performance than previous coarse-grained feature extraction methods.

In order to validate the effect of our application categorization model, we introduce two standard metrics, namely Accuracy, and AUC. As shown in Equation (2), TP is the number of positive category applications that are correctly classified; FN is the number of negative applications that are misclassified; TN is the number of negative applications that are correctly classified; and FP is the number of positive applications that are misclassified. Concretely, FPR presents the rate of the wrong positive instances, and Accuracy presents the rate of all the correctly classified instances in the all instances. AUC is the area of the ROC curve which is obtained by plotting the TPR against the FPR.

We used 10 different sets of features ranging from 10 to 100 to obtain the results depicted in Fig.3. We also compared the coarse-grained features extraction method with our new method. From Fig.3, it can be seen that the accuracy of the traditional coarse-grained method is 88%, and our accuracy is 94%. There is little doubt our new method can improve the accuracy of the classification. Besides, the accuracy of our method reaches the highest when the feature number is 50, 20 dimensional less than the coarse-grained method. From this result, we can see that our method can remove the irrelevant features and reduce the dimension of the features. Nonetheless, the accuracy of both coarse-grained methods and our method will decrease when the feature number exceeds a certain threshold. The reason for this situation is that some irrelevant features are added to the feature sets with the increase of the feature number.

In this paper, we proposed a method to classify Android applications using machine-learning techniques. The features we used are extracted from the applications and Android Market. Different from the traditional previous coarse-grained feature extraction methods, we analyzed the relationship between features and categories in detail. In order to overcome the circumstances that the permissions declared in the manifest.xml file are not actually used, we extracted the used permission by static analyzing APIS the application calls. Meanwhile, we only extracted the strings which can reflect the function of the application, eliminating the negative effect that the irrelevant strings bring to the classification results. The results presented in the paper shows that our model can achieve better accuracy than previous coarse-grained features extraction method. As a future work we would like to investigate accuracy and performance improvement by sorting out the different strings that represent the same function.