Pseudorandom Data and Clone Analyses

NOTE: The pseudoegg database has been moved to offline storage. If you need those data, please contact Roger Nelson.

One of the questions that we must ask is how to do control experiments to compare with the analyses that indicate an anomalous effect of global events. On May 2nd, 2001, I got the following note from John Walker, describing a way to provide pseudorandom data suitable for clone analyses corresponding to the actual events in the formal database. The pseudodata described are accessible as compressed files for download.

Date: Wed, 02 May 2001 21:46:25 +0200 
From: John Walker  
To: Roger Nelson  
Subject: Pseudorandom basketdata control files

With the growing apparent significance of the formal prediction results from the GCP, I've been thinking about possible controls against which the REG output might be tested in order to exclude systematic errors or faulty statistical analysis as the source of the measured effect.

Given the assumptions underlying the effect for which we're testing and the need for as close correspondence as possible between the evaluations of data for the formal predictions and control tests, I developed the following scheme, which has the additional merit that it may be used as a control for any analysis whatsoever which is based on the eggsummary/basketdata* files. ... 

[Progams and scripts create an "eggsummary" file corresponding to the real data for a given day, in the CSV format but with the] "11,1" header line modified to indicate the file as containing pseudorandom data and each egg data sample replaced by a pseudorandom value generated using the same statistics. The pseudorandom output file will contain precisely the same number of samples as the input file--missing egg samples in the input map to missing samples in the output. The process used to generate the pseudorandom file is deterministic--processing the same input basketdata file should always generate the same pseudorandom clone file. 
 ...
Generation of the pseudorandom mirror files is done in an extremely conservative manner in which each value from an egg is replaced with the result of 200 calls on the high quality pseudorandom generator used by the PSEUDO REG driver in the egg software. This is computationally expensive--generating the pseudorandom mirror for a typical current eggsumary file takes about 10 minutes of CPU time on the noosphere machine. If we wish to create a complete mirror of the existing data set, this will be a big job, but it can be run using a low "nice" priority to avoid interfering with other tasks on noosphere. Once we've caught up to current data, the 10 minutes per day to create the pseudorandom counterpart of each eggsummary file as it is posted will be a minor addition to noosphere's workload.

With parallel live and pseudorandom data available, it will be easy to test both the existing set of predictions and new predictions as they are made against a pseudorandom data set. The pseudorandom data set will also be available as a control for other analyses such as inter-egg correlation studies. A checkbox could be added to the Café to perform a query using the pesudorandom data set, or perhaps it might generate a plot showing the actual data and pseudorandom control on the same graph in different colours. Once a complete pseudorandom data set is available, performing a control experiment will consist of simply re-running the original experiment with the data source directory set to that containing the control files; they have the same names and are in the same gzipped .csv format as their real data counterparts. 
 ... 
-----------------  ---------------- 
John Walker           | You can teach a rock to calculate, 
kelvin@fourmilab.ch   | but you'll never make it think.

I was delighted by John's proposed pseudorandom clone of the database, because it offers an immediate and functional approach to the problem of control comparisons. The solution is not as general as the designed (but unimplemented) resampling scheme, but ultimately it can greatly strengthen our confidence in the statistics, which will increase as the number of events grows. I sent a response to John with some practical considerations and the following observation:

It sounds great. The only hitch is that any given experiment based on the pseudo data will be just one sample of the distribution of chance outcomes. However, in the long run, doing the clone analyses for the whole formal database will provide the same type of high level outcome distribution as the real data -- and they should _not_ look alike.

As a full-blown test of the new scheme, I obtained the pseudo data for April 22 2001, Earth Day, and processed it using the same functions as for the formal analysis. Everything worked fine. The mean for the day was 100, exactly, and the standard Chisquare test yielded 86498 on 86400 df, with p = 0.4062. The graph below shows the cumulative deviation of the pseudo data, which looks, not surprisingly, just like a graph of real data. But in the formal analysis for Earth Day the trend for the real data is consistently positive and culminates in a probability of 0.037. We will gradually make one of these control analyses for each of the 70-odd formal events, and build up the pseudo equivalent of the formal database. It will be an interesting and useful addition to the statistical background.

Image of Pseudorandom Data and Clone Analyses — Pseudorandom Data and Clone Analyses