next up previous
Next: Chickenpox Epidemic Up: Examples Previous: Global Climate Data

United States Historical Climate Data

We downloaded this data set from the Carbon Dioxide Information Analysis Center at Oak Ridge National Laboratory (http://cdiac.ESD.ORNL.GOV/ftp/ndp019). The data contains mean monthly temperature from 1890 to 1996 from 1221 stations in US (note that this data set has been updated on this site since we downloaded it five years ago), geological locations of these stations in terms of longitude (long) and latitude (lat). We use this data set to illustrate how to fit periodic spline, thin-plate spline, SS ANOVA and NNR models. We use data from 1961 to 1990 from 48 stations in Texas only. The data is saved as TXtemp.

We first fit a periodic spline, compute the estimates without the constant term (yearly mean) and its amplitude for each station.

> data(TXtemp)  
> TXtemp$cm<- (TXtemp$month-0.5)/12  
> amp <- NULL 
> for(i in 1:48){ 
  tmpfit<- ssr(mmtemp~1, rk=periodic(cm), spar="m",
  p.tmpfit <- predict(tmpfit, terms=c(0,1), pstd=F, 
  amp <- c(amp,max(abs(p.tmpfit)))

Figure [*] shows the estimated amplitudes on the logarithm scale plotted against longitude and latitude. It is clear that the middle and northern parts of Texas tend to have larger amplitudes (hotter Summer and colder Winter).

Figure: Plot of log(amplitude) vs longitude (left) and latitude (right). Circles are observations and solid lines are cubic spline fits.

We now fit a thin-plate spline with $d=2$ and $m=2$ to model the effect of longitude and latitude on the log(amplitude):

> loc <- TXtemp[TXtemp[,4]==1961&TXtemp[,6]==1,2:3] 
> data <- data.frame(amp=log(amp),lat=loc[,1],long=loc[,2]) 
> tx.fit1 <- ssr(amp~long+lat,rk=tp.pseudo(list(long,lat)),data=data, spar="m") 
> i <- interp(data$long,data$lat,data$amp) 
> grid1 <- list(long=i$x,lat=i$y) 
> p.tx.fit1 <- predict(tx.fit1,expand.grid(grid1),pstd=F)
The contour and 3-d plots of the fit are shown in Figure [*].

Figure: Left: contour plot of raw data (dotted lines) and TPS fit (solid lines). Right: 3-d plot of the TPS fit.

We can use SS ANOVA or NNR models to investigate seasonal trend (temporal) and location effect (spatial) together. For this purpose, we first compute average mean monthly temperature from 1996-1990 for each station.

> y <- gapply(TXtemp, which=5, FUN=function(x) mean(x[x!=-99.99]), 
> tx.dat <- data.frame(y=as.vector(t(matrix(y,48,12,byrow=F))))
> tx.dat$month<-rep((1:12-0.5)/12, 48) 
> tx.dat$lat<-rep(TXtemp$lat[seq(1, nrow(TXtemp),by=360)] ,rep(12,48)) 
> tx.dat$long<-rep(TXtemp$long[seq(1, nrow(TXtemp),by=360)] ,rep(12,48)) 
> tx.dat$stacod<-rep(TXtemp$stacod[seq(1, nrow(TXtemp),by=360)] ,rep(12,48))
Denote $t_1=\mbox{\tt month}$, $x_1=\mbox{\tt longitude}$, $x_2=\mbox{\tt latitude}$, and $t_2=(x_1,x_2)$. We model month ($t_1$) effect using a periodic spline, and spatial ($t_2$) effect using a TPS. Then we have the following SS ANOVA model:
$\displaystyle y(t_1, t_2)$ $\textstyle =$ $\displaystyle \mu + \beta x_1 + \gamma x_2 + s_1(t_1) + s_2(t_2) +$  
    $\displaystyle sl_{12}^1(t_1,x_1) + sl_{12}^2(t_1,x_2) + ss_{12}(t_1,t_2) +
\epsilon (t_1,t_2),$ (70)

where components on the right hand side are constant, linear main effect of $x_1$, linear main effect of $x_2$, smooth main effect of $t_1$, smooth main effect of $t_2$, smooth-linear interaction between $t_1$ and $x_1$, smooth-linear interaction between $t_1$ and $x_2$, and smooth-smooth interaction between $t_1$ and $t_2$.

> tx.fit3 <- ssr(y~long+lat, data=tx.dat, spar="m",
    rk=list(periodic(month), tp(list(long,lat)), 
> tx.fit3
GML estimate(s) of smoothing parameter(s) : 8.184394e-06 2.426733e-02 
1.076039e-01 5.558457e-02 2.683768e-04 
Equivalent Degrees of Freedom (DF):  345.5728 
Estimate of sigma:  0.117142

If mean monthly temperature profiles from all stations have the same shape except a vertical shift and scale transformation, we may consider the following NNR model

$\displaystyle y(t_1,t_2) = g_1(t_2)+ \exp (g_2(t_2)) \times g_3(t_1) +
\epsilon (t_1,t_2),$     (71)

where $y(t_1,t_2)$ is the mean temperature in month $t_1$ of the station with longitude and latitude $t_2$. $g_1$, $g_2$ and $g_3$ are three unknown functions. $g_3$ represents seasonal trend. $g_1$ captures average climate differences between stations, and $g_2$ captures differences in the seasonal trend between stations. We will refer $\exp (g_2(t_2))$ as the amplitude. A bigger amplitude corresponds to a bigger seasonal variation. We model $g_1$ and $g_2$ using TPS's. Since $g_3$ is periodic and is close to a sinusoidal function, we use the $L$-spline with $L=D^2+(2\pi)^2$. To make model ([*]) identifiable, we use the following side conditions: (a) $\int_0^1 g_2(t) dt=0$, and (b) $\int_0^1 g_3(t) dt=0$. Therefore, the model spaces for $g_1$, $g_2$ and $g_3$ are TPS, $TPS\ominus \{ 1 \}$ and $W_2(per) \ominus \{ 1 \}$ respectively.

> S3 <- periodic(tx.dat$month) 
> f3.tmp <- ssr(y~1,rk=S3,data=tx.dat,spar="m") 
> f3.ini <- as.vector(S3%*%f3.tmp$rkpk.obj$c) 
> tx.fit4 <- nnr(y~f1(long,lat)+exp(f2(long,lat))*f3(month), 
> tx.fit4
Nonlinear Nonparametric Regression Model Fit by Gauss-Newton Method
 Model: y ~ f1(long, lat) + exp(f2(long, lat)) * f3(month) 
 Data: tx.dat 

 GCV estimate(s) of smoothing parameter(s): 2.576027e-06 1.736111e-03 4.471411e-07 
 Equivalent Degrees of Freedom (DF):  95.47625 

 Residual standard error: 0.9433764 
 Number of Observations: 576 
 Converged after 3 iterations

Contour plots of the estimates of $g_1$ and $g_2$ are shown in Figure [*]. There is clear spatial effects to the mean temperature and amplitude. Southern part of the state is warmer and has less seasonal variation.

Figure: Contour plots of the estimates of $g_1$ (left) and $g_2$ (right).

We now discuss two approaches two check the NNR model. We have fitted each station separately and saved the estimates without yearly mean in p.tx.fit1. One way to check if mean monthly temperature profiles from all stations have the same shape after removing yearly mean is to plot the estimates from one station against another to see if the points fall on a straight line. We rescale estimates from all stations such that all of them have amplitudes equal one. We then calculate Euclidean distances between stations. We select paired stations which have distances correspond to the 1%, 5%, 10%, 25%, 50%, 75%, 90%, 95% and 99% quantile of all possible paired stations. Figure [*] shows plots of these estimates for these selected stations. Some deviation from the straight line can be seen when distance becomes large.

> d <- dist(diag(1/amp)%*%t(p.tx.fit1))
> st <- NULL
> for (i in 1:47) { for (j in (i+1):48) st <- rbind(st,c(i,j))}
> ord <- order(d)
> tmp <- cbind(d[ord],st[ord,])

Figure: Plot of the centered and scaled estimates for the selected stations.

It is easy to see that model ([*]) reduces to the model ([*]) iff

$\displaystyle f_{12} = [\exp(g_2)/\int \exp(g_2) dx_2 - 1] f_2.$     (72)

Thus another way to check the NNR model ([*]), or equivalently condition ([*]), is to compute estimates of $f_2$ and $f_{12}$ for all stations, and then plot $f_2$ against $f_{12}$ to see if the points fall on a straight line.

> grid2 <- data.frame(month=rep(seq(0,1,len=40), 48), 
     lat=rep(TXtemp$lat[seq(1,nrow(TXtemp),by=360)] ,rep(40,48)), 
     long=rep(TXtemp$long[seq(1, nrow(TXtemp),by=360)] ,rep(40,48))) 
> p.tx.fit3.f2 <- predict(tx.fit3,grid2[1:40,],
> p.tx.fit3.f12 <- predict(tx.fit3,grid2, 
Such a plot is shown in Figure [*] which indicates certain deviation from the straight line.

Figure: Plot of $f_1$ vs $f_{12}$ for all stations.

For the purpose of illustration, we now fit the following additive model

y(t_1, t_2)= \mu + \beta x_1 + \gamma x_2 + s_1(t_1) + s_2(t_2)
+ \epsilon(t_1,t_2).

> tx.fit5 <- ssr(y~long+lat, rk=list(periodic(month), tp(list(long,lat))),  
                 data=tx.dat, spar="m") 
GML estimate(s) of smoothing parameter(s) :  0.05823508 16.36685971
Equivalent Degrees of Freedom (DF):  41.59342
Estimate of sigma:  1.773092

Number of Observations: 576

It is obvious that above additive model is a special case of the NNR model ([*]) with $g_2=0$. We can compare three models using a model selection procedure such as GCV, AIC or BIC.

> n <- 576
> rss1 <- sum(tx.fit3$resi**2)
> rss2 <- sum(tx.fit4$resi**2)
> rss3 <- sum(tx.fit5$resi**2)
> gcv1 <- rss1/n/(1-tx.fit3$df/n)**2
> gcv2 <- rss2/n/(1-tx.fit4$df$f/n)**2
> gcv3 <- rss3/n/(1-tx.fit5$df/n)**2
> aic1 <- n*log(rss1/n)+2*tx.fit3$df
> aic2 <- n*log(rss2/n)+2*tx.fit4$df$f
> aic3 <- n*log(rss3/n)+2*tx.fit5$df
> bic1 <- n*log(rss1/n)+log(n)*tx.fit3$df
> bic2 <- n*log(rss2/n)+log(n)*tx.fit4$df$f
> bic3 <- n*log(rss3/n)+log(n)*tx.fit5$df
> print(c(rss1,rss2,rss3))
[1]    3.161981  428.052072 1680.096971
> print(c(gcv1,gcv2,gcv3))
[1] 0.0343016 1.0672903 3.3885449
> print(c(aic1,aic2,aic3))
[1] -2306.88183    19.73069   699.79434
> print(c(bic1,bic2,bic3))
[1] -801.5293  435.1371  880.9798
All model selection procedures choose the SS ANOVA model ([*]).

next up previous
Next: Chickenpox Epidemic Up: Examples Previous: Global Climate Data
Yuedong Wang 2004-05-19