|
Last Update 20 Jan 2005 This page is a working document and is not to be quoted or referencedReturn to: GCP home page Background: Analysis 2004Did the long-timescale character of the GCP data change after 9/11?The plot shows the cumulative deviation of the daily values of the network variance [aka: netvar] from Oct 1, 1998 to Sept. 8, 2004. The netvar for each day is expressed as a normal z-score. The parabola is the 5% probability envelope for the cumdev. There are 2165 plotted daily values and the vertical bar marks Sept. 11, 2001. [there is no GCP data for the days Aug 5-8, 2002. NB: see bottom of this doc for comments on jargon and statistical procedures]
An update with data through Nov. 11, 2005 shows the trend leveling out, but it still has a slight negative slope. As before, the Stouffer Z at 1 sec resolution is squared and summed for each day. The resulting chisqr stat with df=86400 is converted to a zscore, giving a daily Z. The cumdev of these is plotted for normalized and vetted GCP data. The subsequent analyses examining the trends statistically, based on splitting, blocking, and modeling the data, use the original sequence to Sept. 8 2004.
The random deviations seem qualitatively greater after 9/11.Does this suggest a non-random trend in the post 9/11
data? These tests are too weak to support the suggestion of a non-random trend developing in the data after 9/11. However, visually, there does appear to be pronounced long-timescale structure after 9/11, with both positive and negative slopes. The mean and variance tests are not sensitive to this apparent structure. Further comment? Some Email comments
Correlation with External MeasuresWhile it is necessary to proceed with work (described below) to confirm the apparent structure, it is also useful to look for indirect validation of the structure by assessing correlations with independent external measures. For an instructive example, we compare GCP network variance with presidental poll data that represent a focus on matters of importance to great numbers of people around the world. Splitting the Data into Alternate SecondsAnother approach is the following: Calculate two netvar datasets using alternate seconds for each set. Because there is no autocorrelation in the random data sequences at short lags (in particular at lag 1, i.e. 1 sec) the interdigitated datasets are rigorously independent. If there is strong, non-random, long-timescale structure in the cumdev after 9/11, it will be present in both datasets. In that case correlations will exist between the alternate datasets which is strong evidence for an anomalous effect.
Call the interdigitated datasets A and B. A plot of the two sets is shown below (vertical bar marks 9/11):
Visually, there is structural correspondence between the red and blue curves after 9/11 AND both show structure similar to the full netvar curve for this period (in gray, rescaled and offset; vertical gray bar is 9/11). This is the main qualitative result of the A-B data splitting. Note that the strong peak of the Iraq campaign (near day 1700) and the preceeding steep descent appear in all three curves. In order to test the correspondence quantitatively, we want a test sensitive to structure on this scale. Standard correlation coefficients are not sensitive to detailed structure since they only test linear, or at best monotonic, correlations, as the table below shows. In fact, there does appear to be a rough linear correspondence between the curves over the whole dataset and this is evidenced by standard correlation measures. [Again, using daily netvars expressed as normal z-scores] The z-score for the Pearson and Kendall-rank correlations between the full, pre- and post- 9/11 segments of A and B are:
Both of the tests find weak positive correlations which are larger for the pre- 9/11 data. [the Kendall z-score of 1.88 gives a pval of 0.03]. The Kendall rank correlation is greater than the Pearson because it measures monotonic correlation that can be non-linear whereas the Pearson value looks for linear correlation only. All these results are consistent with what you would surmise from inspecting the plot above. The correlations are weak and the difference between the pre- and post- datasets is not significant. However, these tests say nothing about the apparent similarities in the detailed structure of the A and B cumdevs. A similar analysis on the event-based series is available, and also shows correlation.
Correlation of Fitting ParametersOne way to test for correlation in the structure is to fit the curves and test correlations between the two sets of fitting parameters. Below is a plot of fits to A and B using 51 cosine functions. The fit is done on the cumdev of the data because we want the structure to be prominent. The fitting parameters are the cosine amplitudes using wave vectors nπ / L , where L is the number of data points. The fits are done for n = [0,50]. Using n up to 50 allows fitting of structure on timescales of 1 month and longer. Thus we ignore short timescale structure in this approach. [Note: Roughly speaking, a cosine function fits structure as sharp as its half-wavelength. The shortest half-wavelength, in days, for the cosine fits is L/2N, where N is the max index of the n cosine fitting functions. Since the pre- and post- datasets are each about 1100 days long, the finest structure we fit is 2200/100 = 22 days, when N=50. There are 1100 days in the datasets, but we double the number of points for the fits in order to make centro-symmetric datasets.] The cosine expansion is most efficient for centro-symmetric structures, because these have even parity. The even parity cosine functions are then a natural orthogonal expansion set. To make the dataset centro-symmetric, it is concatenated with its reflection before fitting. The figure below shows the fits (in gray) for the full 2165 data points (days) for curves A and B. The centro-symmetric reflection doubles the number of points to 4330. The fit of the full datasets use 100 cosine functions. [The center of the plot, which is the last day of data, is the reflection point. It is marked by a vertical bar. The other bar to the left marks 9/11]
The fitting procedure gives a set of coefficients (cosine amplitudes) for each curve A and B. To look at the difference between pre- and post- 9/11, split the sets at that date and calculate correlations for the periods separately. The number of points is halved for each period and we need only 50 amplitudes. Let the coefficients be A[n] and B[n] where n = [0,50] labels the cosine functions. Then the correlation is the
sum of pair products of the coefficients:
(Note: the horizontal axis is the cosine wave index. Low order indices contribute to the long-timescale features and high order indices to short timescale structure. As mentioned above, indices around number 50 correspond to structure in the netvar cumdev with half-widths of roughly a month.) Probability envelopes for the correlation are estimated by 30,000 iterations of the same correlations calculated on pseudo-random normal datasets. The A/B correlation pval for the
51-amplitude fit of the post- 9/11 data is about 0.001 (z-score =
3.0). The cumulative also shows that many wave indices
between n=0 and n=50 contribute to the correlation. This is consistent
with the netvar cumdev which shows structure on the scale of
months to years. Thus, at the 3-sigma level, the post-9/11 data contain
non-random structure on long timescales. This is not seen in the pre-
9/11 data. A preliminary look at alt mins
and alt hours shows similar results, but indicates there is much to
learn.
Correlations of 90-day BlocksThe following figure shows the alt-sec correlations by 90-day blocks. We see that no blocks through the 2nd quarter of 2001 have correlations above the 0.05 level. But the 3rd quarter of 2001 is at pval=0.01 and after that 7 of 12 are over the 0.05 level and 5 at the .01 level or above. The red trace shows the square of the average number of online regs, N, for the quarter. The plot shows that the alt-sec correlations may possibly increase with N. An important and basic question which needs to answered is whether anomalous effects in general depend on the number of regs in the network and, if so, how. The N-dependence of effect size is a basic aspect of any model and any measure of this dependence is thus a means of distinguishing between model types. The evolution of correlation over 90-day blocks is an indication that we may have a handle to address this key question.
Details of Cosine Fit: Timescale of Contributions(You may wish to skip this section on a first reading.) To see where the correlations "reside", we can study the correlation by looking separately at different timescales. Steps in the correlation cumulative show that there is correlation associated wave indices where the correlation increases sharply. This information and the orthogonality of the cosine expansion functions lets us decompose the fits to see which features are contributing to the overall correlation. The plots are a bit busy, but they confirm what your eye sees in the A/B netvar cumdevs: there is correlated structure in the data on timescales from years down to weeks. The figure shows the fits for A and B datasets using cosine amplitudes up through n = 5, 25, 39 and 73. These give fits with detail down to (roughly) 7-months, 6-weeks, 4-weeks and 2-weeks. The upper right plot reproduces the correlation cumdev showing in color the selected amplitudes for decomposition. The other right-hand panels show fits with the preceeding lower-order-n fits subtracted out. This isolates structure responsible for correlation to timescale windows evidenced from the correlation cumulative. The timescales detailed are (roughly) 6-weeks for the "26 - 6 amplitude" curves, and 4 and 2 weeks for the "40 - 26" and the "74 - 40" plots, respectively. The last panel, at the 2-week level, shows that there is no obvious correlation on this short timescale. However, there are clear correlations of structure at the 4- and 6- week timescales, as well as in the first plot, which covers timescales of 7-months to 2+ years"
Device Variance: Alternate Seconds CorrelationNow we repeat the alternate second A/B analysis for the device variance. Visually, as seen in the plot below, the A/B devvar datasets have very different cumdevs and no correlation is obvious. Here, the red curve looks remarkably similar to the netvar cumdevs in the region mid-2002 to late 2003. The pval of the big drop-off is about 0.001 [ 0.01 when Bonferonni corrected for the whole curve and 2 two sets]. But since there is no apparent correlation with the alternate dataset, we can't interprete this as a non-random trend.
Correlation for both the pre- and post- periods are negligible for the device variance and lie within 0.05 probablity envelopes. Cumulative correlations for the devvar pre- and post- 9/11 periods are shown below.
Comparison of Network Variance and Device Variance MeasuresThe following plot compares the full datasets A and B for the netvar and the devvar. The cumulative A/B correlation for the the full dataset from Oct 1998 to Sept. 2004 is significant for the netvar (Z about 3.0) and insignificant for the devvar (Z less than 1.5) when we include correlations from structure down to the 1-month scale. Further calculation is needed in order to extend and increase the precision of the correlation probability envelopes of these full 2165 point datasets.
Application to External Correlates: Presidental Polls and GCP Network VarianceGiven the significant structure in the network behavior, it is appropriate to ask what factors might drive it or be correlated with the trends. For example, we can extend the event-based analysis that has provided evidence of structure in the data to a more general attempt to determine whether changes in the GCP data are associated with world events. Among the possibilities are sociological measures. Our first general correlation attempt is with Polls that ask the question: "Do you approve or disapprove of the way the president is handling his job?"
Notes on Statistics and Jargon1. Normalized trials.All the results are generated from normalized trial sums. The raw trials are 200-bit sums that are collected once a second from each reg. Theoretically, they distribute as binomial [200, 0.5] , but each reg has its own characteristic deviation from this ideal behavior. We normalize the regs individually so that each has mean of 0 and variance of 1. The normalization is done over all data produced by the reg up to Sept 8, 2004. The resulting scores are nearly normally distributed and we refer to them as "trial z-scores", [small "z"]. 2. Stouffer Z and Network Variance A "Stouffer Z" is the sum of trial z-scores divided by Sqrt[N] where N = number of regs for the second. We use a capital "Z" for this Stouffer Z-score. Many analyses look at the sum of squares of the Z's over some designated period, which is a chi-squared statistic that can be compared against null hypotheses, etc. A closely related statistic is the sample variance of the Z's for a period. We call this the network variance (because Z measures the mean output of the whole network). In fact, the sum of Z-sqr's is just the network variance with respect to the theoretical mean. A short name is netvar. 3. Presentation and Normal statistics It's conventient to represent the data, as much as possible, with standard normal statistics. We often will convert the chi-squared variances to normal statistics before plotting cumulative deviation curves (cumdevs) or doing analyses. We use the exact transformation to do the conversion. 4. Resolution The netvar is calculated from Z's taken at each second. We can also calculate a Z over a longer block and use that to calculate a differect netvar. Note that changing the blocking gives completely different statistics. We call the Z blocking the resolution and refer to "network variance at M-second resolution" when the Z blocking is over M seconds. When we just say "network variance" it's understood to be at a 1-second resolution. 5. Device variance The sample variance of the reg z-scores for a given time period is called the device variance (or devvar, for short). 6. Timescales When looking at long-time behavior in the data it's convenient to sum over deviations on short timescales. Much of the analysis above uses the daily network variance (instead of, for example, the netvar at one-minute intervals, which would make for huge data lists while adding only short timescale information ). Condensing the data in this way greatly facilitates the calculations. Be aware that the statistic is still the network variance at 1-second resolution. Further comments on implications? This has earlier versions, and a quantitative look at implications. Converted by Mathematica January 13, 2005 |