Pseudorandom Data and Clone Analyses

One of the questions that we must ask is how to do "control" experiments to compare with the analyses that indicate an anomalous effect of "global events". On May 2nd, 2001, I got the following note from John Walker, describing a way to provide pseudorandom data suitable for "clone" analyses corresponding to the actual events in the formal database. The pseudodata described are accessible as compressed files for download here.


Date: Wed, 02 May 2001 21:46:25 +0200
From: John Walker 
To: Roger Nelson 
Subject: Pseudorandom basketdata control files

With the growing apparent significance of the formal
prediction results from the GCP, I've been thinking
about possible controls against which the REG output
might be tested in order to exclude systematic
errors or faulty statistical analysis as the
source of the measured effect.

Given the assumptions underlying the effect for
which we're testing and the need for as close
correspondence as possible between the evaluations
of data for the formal predictions and control tests,
I developed the following scheme, which has the
additional merit that it may be used as a
control for any analysis whatsoever which is
based on the eggsummary/basketdata* files.
...


[Progams and scripts create an "eggsummary" file corresponding 
to the real data for a given day, in the CSV format but with the] 
"11,1" header line modified to indicate the file as containing 
pseudorandom data and each egg data sample replaced by a
pseudorandom value generated using the same statistics.
The pseudorandom output file will contain precisely
the same number of samples as the input file--missing
egg samples in the input map to missing samples in the
output.  The process used to generate the pseudorandom
file is deterministic--processing the same input basketdata
file should always generate the same pseudorandom
clone file.

...

Generation of the pseudorandom mirror files is done in an
extremely conservative manner in which each value from
an egg is replaced with the result of 200 calls on the
high quality pseudorandom generator used by the PSEUDO REG
driver in the egg software.  This is computationally
expensive--generating the pseudorandom mirror for a
typical current eggsumary file takes about 10 minutes
of CPU time on the noosphere machine.  If we wish to
create a complete mirror of the existing data set, this
will be a big job, but it can be run using a low "nice"
priority to avoid interfering with other tasks on
noosphere.  Once we've caught up to current data,
the 10 minutes per day to create the pseudorandom counterpart
of each eggsummary file as it is posted will be a minor
addition to noosphere's workload.

With parallel live and pseudorandom data available, it
will be easy to test both the existing set of
predictions and new predictions as they are made against
a pseudorandom data set.  The pseudorandom data set will
also be available as a control for other analyses such as
inter-egg correlation studies.  A checkbox could be added
to the Café to perform a query using the pesudorandom
data set, or perhaps it might generate a plot showing the
actual data and pseudorandom control on the same graph
in different colours.  Once a complete pseudorandom data
set is available, performing a control experiment will
consist of simply re-running the original experiment
with the data source directory set to that containing the
control files; they have the same names and are in the same
gzipped .csv format as their real data counterparts.

...

-----------------    ----------------
John Walker             |  You can teach a rock to calculate,
kelvin@fourmilab.ch     |  but you'll never make it think.

I was delighted by John's proposed pseudorandom clone of the database, because it offers an immediate and functional approach to the problem of control comparisons. The solution is not as general as the designed (but unimplemented) resampling scheme, but ultimately it can greatly strengthen our confidence in the statistics, which will increase as the number of events grows. I sent a response to John with some practical considerations and the following observation:

  

It sounds great.  The only hitch is that any given
experiment based on the pseudo data will be just one sample
of the distribution of chance outcomes.  However, in the long 
run, doing the clone analyses for the whole formal database will
provide the same type of high level outcome distribution as
the real data -- and they should _not_ look alike.

As a full-blown test of the new scheme, I obtained the pseudo data for April 22 2001, Earth Day, and processed it using the same functions as for the formal analysis. Everything worked fine. The mean for the day was 100, exactly, and the standard Chisquare test yielded 86498 on 86400 df, with p = 0.4062. The graph below shows the cumulative deviation of the pseudo data, which looks, not surprisingly, just like a graph of real data. But in the formal analysis for Earth Day the trend for the real data is consistently positive and culminates in a probability of 0.037. We will gradually make one of these control analyses for each of the 70-odd formal events, and build up the pseudo equivalent of the formal database. It will be an interesting and useful addition to the statistical background.

Pseudorandom Data and Clone
Analyses


GCP Home