One important attribute of compilers is the quality of the generated code . To assess the difference between the quality of the generated codes of the compilers an experimental design may be used . Assume the following design is used . Select n distinct ( large as much as possible ) programs in such a way that any source file in a program does not appear in another program ( except compiler libraries ) to prevent correlation between programs where programs should be independent from each other . If sample size is not computed from power of the tests formulas , select a sample size at least greater than 15 . A sample size greater than 60 is extremely valuable . Only two compilers are compared . All of the programs are compilable by the compilers . Execute programs and record their success or failure in the following structure : Program CLang GCC ------------ ---------- --------- 1 0 or 1 0 or 1 2 0 or 1 0 or 1 . . . n 0 or 1 0 or 1 where 0 is success ( only correct results without a crash ) 1 is failure ( crash or incorrect results ) . When there are failures , generate a cross tabulation of the above table : GCC GCC -------------------------------------------- Success ( 0 ) Failure ( 1 ) | ----------------------------|------------------- CLang Success | count of ( 0 , 0 ) | count of ( 0 , 1 ) | pairs | pairs | ----------------------------|------------------- CLang Failure | count of ( 1 , 0 ) | count of ( 1 , 1 ) | pairs | pairs | -----------------------------|-------------------- One of the following tests with respect to table structure ( especially number of programs ) may be applied . http://en.wikipedia.org/wiki/Barnard%27s_exact_test ( Barnard's test ) http://en.wikipedia.org/wiki/Fisher%27s_exact_test ( Fisher's exact test ) http://en.wikipedia.org/wiki/Chi-square_test ( Chi-square test ) http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test ( Pearson's chi-square test ) If the difference ( the contingency coefficient ) is significant , one compiler is best ( small number of failures ), the other is worst ( large number of failures ) . ---------------------------------------------------------- Assume there is no any failure , and execution times are available . Program CLang GCC ------------ ---------- --------- 1 t t 2 t t . . . n t t where t is the execution time of the program . Apply paired t test . If the paired differences are significant , one compiler is best ( short execution time , small mean ) , the other is worst ( long execution time , large mean ) . --------------------------------------------------------- The above paired t test may be used for the generated program sizes . If the paired differences are significant , one compiler is best ( small program size , small mean ) , the other is worst ( large program size , large mean ) . Thank you very much . Mehmet Erol SanliturkReceived on Wed Mar 16 2011 - 05:00:47 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:12 UTC