Discussion of Alternative Methods

GCP: Discussion of Alternative Methods Last update 981126

Other ways to look at the problem, possible additions to the methodology

Discussion of procedures with colleagues reveals much about assumptions and leads to better understood methodology, and to a clearer experimental design. The following material is drawn from such discussions in private or public forums. Some or all of the proposals for improvements will be integrated into the project as time permits.

Date: Tue, 10 Nov 1998 18:36:16 -0500
From: Richard Broughton

It indeed will be a challenge to deal with the [Global Event (GE)] situation, but there is an approach from a different angle that might be considered. Back in the 1970s, Brian Millar and I were urging the use of data-splitting techniques in experimental situations where one really didn't know what effects might be found. It was not a new technique, but the increasing availability of lab computers to record data blind made it much easier to employ. We were touting this as a way of combining pilot and confirmatory experiments in a single study. One would collect the data automatically and blindly, and then use some method to split part of it into a nominally "pilot" part and then a "confirmatory" part. One could scrounge through the pilot data as much as one wanted, but any findings could then be predicted to be found in the confirmatory part. If they didn't confirm, then one couldn't blame changed experimental conditions.

With [Global Events] it would seem that you are in a similar situation. You don't know what might result in an effect, what kind of an effect, and over what time period. Well, setting aside experimenter effect issues, if there is a "Gaia effect" on an EGG here or there that looks promising one would expect it to be distributed over the entire data sample. Suppose you simply doubled your data collection, but then only looked at the odd bits. Presumably, if Gaia affected the odd ones, she affected the even ones as well. So if you found a presumably [GE]-linked affect in the odd bits, you should be able to confirm that in the even bits as well. No?

That is just a crude example. Real data splitting can get pretty complicated in terms of the best ratio of pilot to confirmation data, and there are lots of ways one could decide to divide up the REG data. It is runs a high risk of never finding anything that you can really talk about. In the few times I used it I never got anything to confirm, but that was before the greater awareness of effect sizes and power. On the other hand, if you did find a [GE] effects in part of the data and confirmations in the other part, then folks would have to take notice.

--- Response from Roger Nelson:

Yes. We can do this as an adjunct to the simple predictions based on intuition and media intensity.

--- Date: Wed, 25 Nov 1998 11:28:51 -0500
From: Richard Broughton

An article on data splitting by Richard R. Picard and Kenneth N. Berk appeared in _The American Statistician_, May 1990, Vol 44, No 2, pp 140-47. The abstract reads:

"Data splitting is the act of partitioning available data into two portions, usually for cross-validatory purposes. One portion of the data is used to develop a predictive model and the other to evaluate the model's performance. This article reviews data splitting in the context of regression. Guidelines for slitting are described, and the merits of predictive assessments derived from data splitting relative to those derived from alternative approaches are discussed.

------- On Tue, 24 Nov 1998, James Spottiswoode wrote:

The Alternative
---------------
A way out of all these problems is as follows. Given the following hypothesis, which is I think a corrolary of the GCP hypothesis GCP - H2
--------
When an effective (in the sense of H1) event occurs, all the RNG's in the GCP will have their means and/or variances modified from MCE. Given H2, we can test the N*(N - 1)/2 independent correlations between the N RNG's, where we take as the summary variable from the RNG's deviation from mean, perhaps Z^2, or variance, which to be determined by trial. If we found, say, that there were time periods during which the intercorrelations were improbably large, we could look to see what events were occurring at those times which might, by H1 and H2, be responsible. Having then run the experiment for a pilot phase during which examples of "effective events" were gathered, one could then arrive at a set of definitions which covered them. That set of definitions could then be used in a prospective study to see if the associated deviations, and hence correlations, were replicated with the same event definitions being used to select a new set of "effective events."

The advantage of this method of working is that it sidesteps all the problems of whether a flood in Bangladesh is more effective at altering RNG's behavior than a football game watched by millions on TV. You get the data to tell you.

This is certainly data snooping in the pilot phase. But all is fine if the snoop leads to crisp definitions which are faithfully applied to filter events in the prospective part of the experiment.

I hope this is at least clear, even if wrong or silly.

Thanks. Much more clear, and it is useful, although I think the problems of identifying the qualities of effective events will remain. None of the methods 1, 2, or this alternate, 3, is likely to work out of the box, and all of them should improve with repetition. Richard Broughton has suggested a closely related procedure, modeled on the split-half reliability computations often used in psychology. I want to implement a version of that, as a complement to the simple analysis.

With some serious thought and planning, your plan might be practical, but I have 'til now rejected the general approach because I think it will be too easy to find corresponding events that are in fact irrelevant chance matches if one starts with a search in the data for big inter-correlations. Our pattern matching and wishful thinking propensities are deadly liabilities in this situation. But let's try it -- should be a good learning tool, one way or another.

Roger
On Wed, 25 Nov 1998, James Spottiswoode wrote:

[James responding to Ephraim Schechter] I'm not sure James' approach really is more objective than Roger's. As Roger notes, it may require just as much subjective decision making to identify events that seem associated with the RNG deviations. AND it requires a huge database of RNG outputs to identify the key time periods in the first place. Yes, RNG outputs may be less subject to "pattern matching and wishful thinking propensities" than world events are, but I'm not sure whether I'm more comfortable with using them as the dependent variable (Roger's approach) or as the starting place (James' approach). I don't see either as clearly the right way to go. Relative risks: Roger's approach risks picking wrong hypotheses about events in the first place, and therefore getting demonstrations but not the kind of results that would really test those hypotheses. James' approach might stand a better chance of zeroing in on the correct hypothesis in the long run, but only after immense time/effort to gather that huge database of RNG behaviors and world and experimenter events. If it's done with less data, it may be no better than Roger's approach at identifying hypotheses worth following up.

I agree with you about the shportcomings of my approach. However it has one distinct advantage which you haven't mentioned which is why it was brought up in the first place. With NO assumptions about the nature of "Effective Events (EE)" it tests the GCP hypothesis that (all) RNG's get zapped by EE's. Simply testing as to whether there are more extra-chance inter-correlations than there should be will tell you whether that hypothesis is false. That is its real power. Finding the nature of EE's, if they are found to exist, seems to be a secondary task for which the method would be rather inefficient.

James

Date: Sun, 29 Nov 1998 07:48:47 -0800
From: James Spottiswoode
Subject: Noosphere project - back to beginning

Folks, I see that my original posting on this topic contained the following question:
I am afraid that I take a very restricted, Popperian, view of science. There are many reasons for so doing: it has been extraordinarily successful at understanding the world, it is embodies realism (suitably modified for QM) and it provides a clear, and practically applicable, principle of demarcation between scientific theories and pseudoscience. A corrolary of this demarcation principle is that "the more a theory forbids the more it tells us.(1)" Therefore, my next question is this. While I cannot understand the theory being tested by the GCP project, perhaps I can have explained to me, simply, _what does this theory forbid_? What should NOT be observed in the RNG data outputs?

I think I have now provided the answer to this myself:

H1: Certain kinds of events involving human mental states changing (currently their exact definition is unknown) cause all RNG's to deviate from MCE.

(Comment: H1 seems to be the most general version of the GCP hypotheses).

Test for H1: Select a summary statistic for the RNG outputs, e.g. the variance over a chosen period. Compute this variance as a function of time for each of the N RNG's as Var1(t), var2(t),... varN(t) for the recording interval. Then by H1 the N(N - 1)/2 Pearson correlations between all pairings of RNG's are: rIJ = r(varI(t), varJ(t)) (i = 1,...,N, j = 1,...,i-1) where r(X,Y) is the Pearson correlation function. Then under the null (~H1) hypothesis the rIJ's should have chance expectation statistics. (1) Conversely, under H1, the rIJ's should have excessive positive excursions and these excursions should be seen simultaneously in all the rIJ's. (2)

Note 1: Some think that H1 should be restricted to RNG's in a certain locale, say the U.S. or a city. With this restriction the same test can be applied.

Note 2: There are two unspecified "parameters" in H1, the summary statistic to be used (variance in the above example) and the time interval over which to compute that statistic. Each choice of these gives rise to an individually testable version of H1.

(1) is what the GCP hypothesis forbids.
(2) is what it predicts.
Does everyone agree with this formulation? If not, why not?

James

------
Roger replied yes, and repeated his commitment to implement this correlation test as a complementary analysis model, with James' help in its development. Subsequent commentary by York Dobyns makes clear that this complementary model is not yet correctly formulated.

------
Date: Mon, 30 Nov 1998 20:03:35 -0500 (EST)
From: York H. Dobyns

I have followed the subsequent discussion on this thread, but have returned to James' initial proposal for clarity.

On Sun, 29 Nov 1998, James Spottiswoode wrote:
[...]
H1: Certain kinds of events involving human mental states changing (currently their exact definition is unknown) cause all RNG's to deviate from MCE. (Comment: H1 seems to be the most general version of the GCP hypotheses).

So far, so good. One caveat required is that it is the (unknown) underlying distribution of the RNG outputs that is presumed to have changed; there is no guarantee that any specific instance (sample) from that distribution will be detectably different from null conditions.

Test for H1: Select a summary statistic for the RNG outputs, e.g. the variance over a chosen period. Compute this variance as a function of time for each of the N RNG's as Var1(t), var2(t),... varN(t) for the recording interval. Then by H1 the N(N - 1)/2 Pearson correlations between all pairings of RNG's are:

rIJ = r(varI(t), varJ(t)) (i = 1,...,N, j = 1,...,i-1)
where r(X,Y) is the Pearson correlation function.
Then under the null (~H1) hypothesis the rIJ's should have chance expectation statistics. (1)

Conversely, under H1, the rIJ's should have excessive positive excursions and these excursions should be seen simultaneously in all the rIJ's. (2)

This is where I have problems. Why on earth is the extra step of the Pearson correlation useful or desirable? The hypothesis as formulated for fieldREG hitherto has been simply to compute a test statistic (not the variance, incidentally), and look for changes in its value. An effect that changes the average value of the test statistic in active segments need not have any effect on the correlation function; indeed, it will not, unless some further structure is presumed.

As far as I can tell the Pearson correlation function is simply the quantity I learned to call "the" correlation coefficient,
r_{xy} = ( - )/ s_x s_y,
where I am using <> to denote expectation values and s to denote the standard deviation. If I am calculating rIJ between RNG I and RNG J, and the effect has manifested as a simple increase in the mean value of the test statistic I am using, for either or both RNGs, there is no change whatever in the value of r.

Even though there have been some disagreements about the appropriate test statistic to use in fieldREG applications, the above problem applies to *any* test statistic. Suppose that we have datasets x and y, and calculate their correlation coefficient as above. Suppose we now look at x' = x+a and y'=y+b, a model for modified data where the values have been changed by some uniform increment. We all know this has no effect on the standard deviations. The numerator of r_{x'y'} becomes

<(x+a)(y+b)> - = - (+a)(+b)
= + a + b + ab - - a - b -ab
= - ,

exactly as before, so the whole formula ends up unchanged. If the active condition in fieldREG adds an increment to the test statistic -- whatever one uses for a test statistic, whatever the value of the increment -- the correlation test will not show a thing.

[...] Does everyone agree with this formulation? If not, why not?

I do not agree with this formulation, and have outlined my reasons above.

In a subsequent posting, James wrote:

But the _correlation_ between the summary stats of each RNG will be positive - that is the point and why this is such a general test for _anything_ being modified in the RNG's behaviour. All that is required is for the modification to be consistent across the RNG's and simultaneous - which seems to be the essance of the GCP hypothesis.

Since there is a clear class of models in which this test *fails* to detect the presence of a real effect, I do not see why James calls it "general". Since that class of models includes those tests currently and successfully in use in single-source and impromptu global field REG experiments, I am at a loss to understand the desirability of this non-test as the proper approach for GCP.

Date: Tue, 1 Dec 1998 12:12:58 -0500 (EST)
From: York H. Dobyns

On Mon, 30 Nov 1998, James Spottiswoode wrote: York - perhaps we have misunderstood each other here. Much of the early part of this thread revolved around the problems associated with defining the events deemed by the FieldREG-ers to be effective. So I wrote H1 with deliberate vagueness as to what these events might be. The point of my test is that it should work _with no specification of the nature or timings of "effective events"_ at all. Why? If, ex hypothese, all the RNG's deviate in sync when an "effective event" occurs, then the correlations between suitable summary statistics of their outputs will demonstrate this.

I was about to object vehemently to this form of the hypothesis, because it does not correspond at all to the fieldREG protocol as exercised; but I then remembered that we are talking in terms of a generalized and as-yet unspecified test statistic. Unfortunately there remains a strong objection to this approach, but it's a different objection.

To deal with the straw-man first, the extant fieldREG protocol requires the identification of "active" data segments by the experimenter, with all the subjectivity that implies. The test statistic that PEAR has used is simply the Z score of the active segments, and we test for the presence of (undirected) mean shifts by summing all the Z^2 figures to arrive at a chi-squared for the overall database. So our "hypothesis" is *not*, as implied above, that all the sources deviate "in sync" during an active segment, but rather that their output distribution develops a larger variance (where "output" is taken as the mean of, typically, 10-40 minutes of data).

The reason that this is a straw-man is that, even though your use of the phrase "deviate" above sounds like you're talking mean shifts, the phrase could be applied to any test statistic. The squared Z of, say, 10-minute blocks should "deviate in sync" more or less as you describe, even though it is less efficient. (The test we're using for fieldREG is provably optimal *if* one assumes that the experimenters can reliably identify active segments.)

But the real objection is that we *cannot* do this test in a meaningful fashion without having some identification of "special" time intervals in which an effect is expected. Lacking some external condition to predict that deviations (on whatever test statistic you care to use) are expected at some times and not at others, you can collect and correlate data for as long as you please, and all you are proving is that your RNGs are bogus.

This was the concern that (originally) caused me to be a bitter opponent of the GCP, especially after Dick's talk in Schliersee. Unless the experimental design includes a clear criterion for identifying active data periods that are distinct from a control condition, there is no way of distinguishing the GCP or EGG protocol from a gigantic multi-source calibration. There is no way whatever to distinguish the effect one is looking for from a design flaw in the RNG sources, no matter how cleverly one designs a test statistic. The inter-source correlation test merely defers the problem by forcing one to consider the ensemble of all the RNGs as the "source" under consideration; the answer comes out the same.

It was only Roger's assurance that active experimental periods would be identified and compared with "control" sequences that finally persuaded me that EGG was worth doing. Any effort to get away from that identification process in the name of "objectivity" or "precision" simply means you're doing a non-experiment. We are stuck with that identification process, subjective and flawed though it may be, because without we can never demonstrate an effect, we can only demonstrate that our RNGs are broken.

Date: Fri, 4 Dec 1998 15:04:51 -0500 (EST)
From: York H. Dobyns

On Wed, 2 Dec 1998, James Spottiswoode wrote:
[...I wrote:]
To deal with the straw-man first, the extant fieldREG protocol requires the identification of "active" data segments by the experimenter, with all the subjectivity that implies. The test statistic that PEAR has used is simply the Z score of the active segments, and we test for the presence of (undirected) mean shifts by summing all the Z^2 figures to arrive at a chi-squared for the overall database. So our "hypothesis" is *not*, as implied above, that all the sources deviate "in sync" during an active segment, but rather that their output distribution develops a larger variance (where "output" is taken as the mean of, typically, 10-40 minutes of data).

[James responds]
I only used vague terms like "deviate" because I had understood from reading Roger's posting that various stats, mean shift, variance and others, were under consideration to use. Your portrayal of the project is quite different: your statistic has been fixed a priori. Jolly good, one less degree of freedom to fudge.

Let us be careful here. My description was of the *fieldREG* protocol, of which GCP is a generalization. While I believe Roger would be content to proceed with a field-REG type analysis on GCP (he will have to speak for himself on this one), other GCP participants have their own ideas about the optimal test statistic. I don't see this as any big deal; at worst it means a Bonferroni correction for the number of analysts. It is my understanding, however, that all of the proposed analyses share in common the concept of distinguishing between active and non-active segments.

But there is something mysterious, to me, about the above para. You say that your "hypothesis" is not that the RNG's vary "in sync" but that there Z^2, summed over a time window that you also roughly specify, should increase. You then proceed to argue that a correlation test is useless and that unless "effective events" can be pre-specified the GCP could not distinguish between a real effect and faulty RNG's. This argument I do not follow. What is wrong with the following:

H1b: Certain kinds of events involving human mental states changing (currently their exact definition is unknown) cause the mean Z^2 (averaged across a 10 minute block for instance) of all RNG's to increase.

With the caveat that what I was talking about was a Z^2 *for the entire block*, not some kind of average within blocks, this is an accurate statement of the concept. The problem is with the subsequent reasoning, see below.

You seem to agree, York, that:
The squared Z of, say, 10-minute blocks should "deviate in sync" more or less as you describe, even though it is less efficient.

Yes.

Therefore, we presumably agree that the correlation between two RNG's Z^2's has an MCE of zero and an ex hypothese value 0 during an "effective event."

*Here* is the problem. The hypothesis says that the Z^2 test statistic (which has an expectation of 1 under the null hypothesis) should increase during "effective events" and remain at its theoretical value at other times. This does *not* produce a positive correlation during "effective events"; the expected correlation coefficient within effective periods remains zero (a change in the expectation value does not change the correlation). A positive correlation between sources is expected *only* if one looks at intervals that include both "effective" and "inactive" periods: then, and only then, are the correlated changes in the test statistic visible.

Equally,a set of N RNG's will give rise to N*(N-1)/2 inter-correlations between the Z^2 values which should all show a positive "blip" during an event.

Again, you *can't* get a positive blip on the correlation test: the positive blip is in the raw value of the test statistic, and gets you only one data point in each source's output (or N data points where N is (duration of event)/(duration of averaging block)). *If* all the devices respond, and you run over both active and inactive periods, you see that the correlation matrix for all sources is in fact net positive.

And even that is a big if. The multiple-source fieldREG, even when a highly significant result is attained, usually is driving the significance from large deviations in a small minority of the sources. For a chi-squared distribution there is nothing wrong with this; it is in fact the expected behavior given that a large chi-squared is attained. But this means that any given session is unlikely to increase the test statistic in *all* of the sources.

[...]
You seem to have considred the above argument and declared, if I understood right, that one could not distinguish between effective events and bust RNG's by the above method. But why should a set of bust RNG's show a simulataneous increase in their variance (assuming they're not all tied to the same dicky power supply, or similar SNAFU)?

Maybe they are sensitive to the same external parameter, e.g. geomagnetic activity. In fact, the whole *premise* of GCP is that they are sensitive to a common external parameter; what makes it a psi experiment is the assumption that said parameter is related to human consciousness. I think it important to have an identification criterion that justifies that conclusion rather than the presumption of some as-yet-unknown physical confound.

Email: rdnelson@princeton.edu