DetReduce: Minimizing Android GUI Test Suites

Code Shoppy – DetReduce: Minimizing Android GUI Test Suites

Evaluation of DetReduce

The first part shows thefollowing information about the test suites to be minimized: totalbranch coverage (#br.), total screen coverage (#s.), total number oftransitions (#act.), and total number of traces (#tr.) of each initialtest suite. The second and third parts of the table show informationabout the test suites generated after running the first and secondphases, respectively, of DetReduce. The fourth part shows im-portant statistics summarizing the experiment results: the runningtime of each phase of the algorithm (tp1andtp2), the execution time(tr) of the resulting reduced regression test suite, and the ratio ofthe execution time of the resulting regression suite to the executiontime of the original test suite in percentage (tr/t). We make thefollowing observations from the data shown in the tables.•RQ1: The execution time of the reduced test suites (tr)issev-eral orders of magnitude shorter than that of the original testsuites (8 hours). This shows that DetReduce is highly effectivein minimizing the test suites for the benchmark apps. Regardingthe sizes of test suites, phase 1 of DetReduce removes 91.27%of transitions (median) and 90.5% of restarts (median). Phase 2of DetReduce further removes 33.07% of transitions and 31.81%of restarts from the test suites obtained after Phase 1. These twophases of DetReduce cumulatively remove 93.84% of transitionsand 93.52% of restarts. We also found that the rate of reductionis higher for test suites generated from Random. This is becausethese test suites have lower test coverage and more redundancies.•RQ2: The running time of the algorithm is within a factor of 6×of the execution time of the original test suites generated by thetest generation algorithms. More than half of the running timewas spent in detecting and eliminating loops in phase 1 (note thatDetReduce spends no time removing redundant traces becausesuch traces do not require any execution). The time spent inphase 1 is reasonable because the phase searches for a minimizedtest suite while eliminating redundant loops from each trace.Note that these experiments employed a conservative parameter(ten) for the number of re-executions to perform to check tracereplayability, and the running cost of DetReduce can be furtherreduced by setting this parameter to eight.•RQ3: Despite using an approximate method for checking if atrace is replayable, the minimized test suites nonetheless coverthe most of the original branch and screen coverage. DetReducefails to provide 100% coverage foramoney,explorer,ttable, andvlc. We manually analyzed the reasons for the missing branchesand screens, and determined that non-replayable traces were notfully removed while generating the original test suites beforephase 1 of DetReduce.•RQ4: In order to check how DetReduce affects the fault-detection capability of test suites, we collected exceptions raisedwhile executing each test suite. We identified seven distinct ex-ceptions based on their stacktrace. All survived after applyingDetReduce. Note that DetReduce does not consider exceptionsto be part of the test coverage it tries to preserve.•RQ5: We measured how many re-executions were required toidentify each non-replayable trace created during our experi-ments. The following table summarizes the results

Splicing and the Number of Fragments inTraces

RQ6: To understand the effect of bounding the number of tracefragments in phase 2 of our algorithm, we measured the relationshipbetween the bound and the likelihood of finding a replayable trace,and the average number of trace fragments in a trace generated bysplicing. For these measurements, we used four relatively complexbenchmark apps.

Bounding the Number of Fragments and the Replayabilityof Traces.We measured the correlation between the bound on thenumber of fragments and the possibility of finding a replayabletrace using ten different bounds onk(1≤k≤10). For eachk,we constructed 200 random traces by combiningktrace fragmentsfrom the test suite after phase 1. Furthermore, we restricted eachtrace to contain only 20 transitions. In order to construct the traces,we first collected at most 20,000 traces satisfying the requirementusing breadth-first search of the transition systemQTr(described insection 3.2). Note that the paths ofQTrconsist of traces that can be constructed via splicing. We then sampled 200 traces from the set of20,000 traces. Finally, we checked how many of the sampled tracesare replayable by executing each trace ten times. Table 4 showsthe results. The first column shows the name of the app and therest of the columns show the number of replayable traces for eachk. Our hypothesis was that increasing the number of fragmentswould decrease the possibility of finding replayable traces, and theresults confirm this hypothesis for the four apps.

Number of Fragments in Traces Generated by Splicing.Even if traces containing many trace fragments tend to be non-replayable, we would not need to bound the number of fragmentsduring phase 2 of DetReduce if most of the traces that can be constructed fromQTrcontain a small number of fragments. Ourhypothesis was that, without a proper bound, a significant num-ber of traces generated fromQTrwould contain many trace frag-ments, making them non-replayable with high probability. There-fore, phase 2 of DetReduce would spend considerable amount oftime checking the replayability of non-replayable traces. In orderto validate this hypothesis, we constructed 1000 traces composedof at most 20 transitions by sampling random paths fromQTr, andchecked the number of trace fragments in each sampled trace.Table 5 shows the results. The first column shows the name ofthe app and the rest of the columns shows the number of traces com-posed ofkfragments for eachkbetween 1 to 10. The results showthat there are many more traces composed of a large number offragments than traces composed of fewer fragments. Consequently,if we perform splicing without bounding the number of fragments,we are more likely to get traces composed of a large number offragments. The results of the previous experiment (Section 5.3.1)suggest that such traces are likely to be non-replayable. This vali-dates our hypothesis that phase 2 of DetReduce will not scale ifthe number of trace fragments is unbounded.

Threats to Validity

We used a limited number of benchmark apps to evaluate DetRe-duce, so it is possible that our results to not generalize. To addressthis issue, we carefully selected the benchmark apps, and the detailsof the selection process are explained in Section 5.1.1.The selection of the test generation algorithms could potentiallybias the evaluation results. Specifically, the results obtained froma single algorithm cannot determine whether the results can begeneralized to the other test generation algorithms. To address thisissue, we used both SwiftHand and Random algorithm. We couldnot use Monkey because it cannot generate replayable traces, asexplained in Section 5.1.3. The results obtained using Random show that DetReduce is not an artifact that only works with SwiftHand.However, the results are still not strong enough to decisively con-clude that DetReduce can effectively reduce test suites generatedfrom an arbitrary test generation tool.Finally, in the evaluation, we checked whether the exceptionsraised by the original test suites are also raised by the test suitesreduced by DetReduce. However, this is a limited evaluation. Amore robust evaluation would involve injecting artificial faults intothe benchmark apps and measuring the number of injected faultsdetected before and after the test suite reduction. We did not takethis approach because of the difficulty of injecting faults into thebinary of an Android app.