# Chisquare Peaks

Paul Bethke has been doing some developmental work looking for useful analytical approaches. That work has focused on peaks in the distribution of statistical measures, in particular the Chisquare used in our standard analysis. Here is some description from Paul, including some examples. The first note is actually from a later date than the main work, but it shows the basic point of the analysis by looking at a single egg's data from the perspective that is developed later in the composite examples.

``` 			-----------------

Date: Thu, 14 Dec 2000 16:28:34 -0600

> Could you send one or two of the prior figs, the bell-curve fit?
> That will help see the train of logic and questions you are asking.

I am attaching 2 views of the "composite" graph. You'll notice one has sharp
little spikes rising from the graph at the time we're interested in. This
graph is sampled with a 120 second window. The other is the same composite
graph with a 1200 second window. That widens out the peaks so there is a
greater overlap, and the "thin peaks" blend into the others.

To see a good description of what I termed the "bell-curve fit", you should
view the RPKP Probability and Statistics page.  I call it a "fit", as you
basically take the expected value (probability) and measure the deviation of
the actual data from the theoretical bell-curve. So maybe I should call it a
"deviation" from the probability distribution.

> A "raw" look at it would make the naive observer think that
> single large spike in the middle is the most important, but
> I think your "multiple spike" criterion is much more
> meaningful.  It feeds into the notion of doing grand-scale
> intercorrelations.

Agreed. Actually, when it all comes down to it, these graphs seem to be just
graphical representations of your "Extrema by Egg" table. Sure, it will
integrate multiple "extrema" into the chisquare calculation, but I think
each "peak" is easily attributable to a single extrema.

Here are the bitsums responsible for the peaks in these graphs...

Egg	Time		Value
----	---------------	-----
102	6333 [03:45:23]	128
102	6732 [03:52:02]	128
1000	6669 [03:50:59]	131
1005	6542 [03:48:52]	 70
1021	6552 [03:49:02]	131
1022	6824 [03:53:34]	130
1022	6876 [03:54:36]	131
1027	6757 [03:52:27]	131

Another reason the peaks look so nice and uniform is that most of them are
131s, so will be the same amplitude.  It isn't to say this is good or bad,
but this is (mostly) causing what we see.

-----------------

Date: Wed, 13 Dec 2000 17:34:58 -0600
From: Paul Bethke

I did some analysis on the GCP data from last night. I did my
"gauss-curve-fit" chisquare analysis on the data and saw abnormally narrow
peaks. The only explanation I could determine was that there must have been
peaks from different eggs close to each other in time, and their overlap
produced these strangely thin peaks.

So I did the same analysis by individual eggs - !!! I could hardly believe
what I found. There are at least 6 separate EGGs peaking at the same time! I
am including the graph for you to see. I have never seen them coincide like
this.

What you are seeing: The data is 3 hours of EGG data starting at 0200 UTC.
[This corresponds to the time surrounding the Supreme Court decision on
the 12th which signaled the final resolution of the election question.]

The level is the chisquare value. (BTW the P=0.05 value for 200 df is
about 234) The width of the peaks is basically the window size, or how many
seconds make up the window over which the chisquare is calculated. (So the
actual data causing the large differences is at the center of the "peak").

I have yet to identify which EGGs are involved to see if that tells us more.

I will tell you that I had been skeptical of this analysis method from
looking at some of the "case" data in the prediction chart - as they seemed
to only be producing spurious bits of unusual data. But this is pretty
interesting...

------------------

Date: Thu, 14 Dec 2000 10:55:14 -0600
From: Paul Bethke

> If the level is the chisq value, which I understand is
> expected to be 200 df, why is the main mass at about 50 or 60?

I am still playing with "what the values mean". Obviously, the lower the
chisq, the closer to "random" the data is. The procedure is to take a
"window" of data, in this case 1200 seconds, (+/- 600 seconds) and perform
the bell-curve fit on those data. Generally, the data fit the bell-curve
pretty well (the norm is around 50). When a very low or very high value
(bitsum) actually occurs, however, they tend to produce these "peaks", which
last for one window of data.

Typically, these peaks occur sporadically. I've tended to think that they
were misleading and had been trying to determine a way that they would be
less conspicuous. But last night's data changed my thinking in that, yes -
they will occur, but perhaps seeing multiple peaks coinciding among various
EGGs might be what to look for.

The method I am using is almost identical to that explained at the RPKP
Probability and Statistics page -- it shows the probabilities for the
distribution of counts of bitsums, indicating how the counts of GCP trials
should be distributed according to theoretical expectations.  This analysis
compares that with the actual distribution.

> The peaks appear to be ranging up to values well past the
> .05 value of 244, which sounds like the Chisq values
> indicate large deviations, but I suspect I need much better
> understanding of what you are doing.  I have tried to figure
> out what you mean by "abnormally thin peaks" but definitely need help.

First the "abnormally thin peaks". That reference is actually to another,
preliminary graph that I didn't include [Examples are presented above
in the introduction to this page.]. I reference the thin peaks to indicate
what led me to further investigate, leading in turn to the graph with the
expanded peak presentation.  Those thin peaks don't show up in this graph.

Typically the "peaks" are the width of the window (1200 seconds).  Seeing
thin ones was peculiar. As is seeing overly wide ones - which is how the
"bad egg" problem exhibited itself a week or two ago.  Thus, this
presentation or analysis is diagnostic of major, persistent deviations.

The large amplitude of the peaks is generally related to the impact of
bitsums which are very distant from the mean. For instance, a bitsum of 65
(or 135) has a probability of 2.2e-7. So when it *does* occur in a sampling
of 1200, it has a large impact.

> Do you have any notion of the "physical" meaning for the values?

I have not yet found a good meaning. Perhaps it's not that there *are*
peaks, but their coincidence that is significant. I think I should show you
an example of the normal graphs I've been looking at so you see why I was so
excited by this one. I will attach such a graph and move onto your next
message. ;)

(Actually, the sample attached does look a little interesting toward the end
of the sample, but I just chose it randomly.)
```