Significant differences were found between runs for five of the measures investigated, at least when using the raw data. These measures were: Program Length, Replication Period, Cumulative Activity, Mean Cumulative Activity, and Diversity.

For Program Length and Replication Period, significant differences (at the 0.05 level) were observed in the raw data values between some runs. For these measures, the mean number of runs that each run is significantly different from at this level was calculated as 12.3 for Program Length and 10.2 for Replication Period, but the high criterion differential on these scores suggests that the true value should be somewhat lower (looking at Figures 6.6 and 6.7, probably somewhere in the range of 6 to 10).

Looking at the derived measures suggested by Bedau and colleagues (i.e. Cumulative Activity, Mean Cumulative Activity and Diversity), significant differences were found between runs which did hold up even at the 0.001 level. Again, the true value of each of these differences probably lay in the range of roughly 6 to 10.

*These results indicate that each run, on average, performed
significantly differently to between a third and a half of the other
runs*.
One of the main reasons for doing these
experiments was to understand how we should deal with contingency when
conducting further experiments with Cosmos. If we assume that at least
the finding that each run is statistically different to more than a third of
the others is a general result, then we can use the following rule of
thumb: For each re-run of a trial with a different seed for the RNG,
the probability of its outcome being statistically equivalent (at the *p*=
0.05 level) to the original one is, at most, about
.
Therefore, the
number of *re*-runs that should be conducted to be confident (at
the 95% level) of at least seeing one statistically different type of
behaviour is *n*, where
,
i.e.
,
or, in round figures, .
This is the
number of *re*-runs *after* the original, so, finally, we
can say that *any trial should be conducted nine times with different
seeds for the RNG*.

Having said that each run performed significantly differently to at least
a third of the other runs, precisely *which* runs were
significantly different depended upon the particular measure being
looked at. This emphasises the fact that one should be clear about
exactly what measure is being used when talking about comparisons
between evolutionary runs.

The fact that *no* significant differences were found between any
of the runs for any of the measures when looking at *differenced*
sample data suggests that the significant
differences observed in *raw* sample data may be caused (at least
in part) by the cumulative magnification of initially small
differences as a run proceeds. If this effect is controlled for (which
was the purpose of using differenced data), the behaviour of the runs
in terms of the *change* in values of the measures over a given
time period would seem to be very similar in all of the runs. However,
because of the cumulative magnification of small differences, the
*absolute* outcomes of the runs *do* differ significantly in
some cases, so contingency *does* play a big role.

As an aside, we can ask to what extent these results can be generalised to other evolutionary systems. Considering biological evolution first, it is clear that even just in terms of population size and the length of runs, the system is completely trivial. Also, the role of contingency may be different in systems which have rich ecological interactions (of which Cosmos programs have very little). It would therefore be unwise to claim that these results can tell us much about the role of contingency in biological evolution, but they may be relevant in specific cases. As for other artificial evolutionary systems, Cosmos is of comparable design, so the results, and the rule of thumb about the number of trials that should be run, should be broadly applicable to these platforms as well. The extent to which ecological interactions affect the results may be investigated by running similar trials on systems that display stronger interactions of this kind (such as Tierra).

In the remainder of this chapter, we will look at the effects of changing various parameter values on the behaviour of the system. In light of the results reported in this section, each of the following experiments is conducted nine times, using different seeds for the RNG each time.

1999-05-29