Next: Global Climate Data Up: Examples Previous: Examples

## Arosa Ozone Data

This is a data set in Andrews and Herzberg (1985). Monthly mean ozone thickness (Dobson units) in Arosa, Switzerland from 1926 to 1971 was recorded. The data is available as Arosa. We are interested in investigating how ozone thickness changes over months and years.

We use this data to illustrate how to fit a periodic spline, a partial spline, an -spline, an -spline with unequal variances, an -spline spline with correlated random errors, a partial spline with both variables month and year, an SS ANOVA model, partial splines for the whole time series, a semi-parametric linear mixed effects model, and some varying coefficients models.

Let us ignore the year effect first and concentrate on the month effect on ozone thickness. Let . We first fit a parametric sine and cosine model

 (58)

and a periodic spline
 (59)

where .
> data(Arosa)
> Arosa$csmonth <- (Arosa$month-0.5)/12
> attach(Arosa)
> ozone.fit1 <- lm(thick~sin(2*pi*csmonth)+cos(2*pi*csmonth), data=Arosa)
> summary(ozone.fit1)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)           337.0986     0.7605 443.263  < 2e-16 ***
sin(2 * pi * csmonth) -47.3881     1.0719 -44.209  < 2e-16 ***
cos(2 * pi * csmonth)   7.7966     1.0790   7.226 1.81e-12 ***
---
Signif. codes:  0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1  ' 1

Residual standard error: 17.31 on 515 degrees of freedom
Multiple R-Squared: 0.7958,	Adjusted R-squared: 0.795
F-statistic:  1003 on 2 and 515 DF,  p-value: < 2.2e-16

> ozone.fit2 <- ssr(thick~1, rk=periodic(csmonth), spar="m", data=Arosa)
> summary(ozone.fit2)
...
Coefficients (d):
(Intercept)
337.091

GML estimate(s) of smoothing parameter(s) : 1.717233e-06
Equivalent Degrees of Freedom (DF):  9.0131
Estimate of sigma:  16.78767

> anova(ozone.fit2,simu.size=500)
Testing H_0: f in the NULL space

test.value simu.size simu.p-value approximate.p-value
LMP  0.2663855       500            0
GML  0.2025431       500            0                   0


Figure shows that parts of the parametric sine and cosine fit are outside the confidence intervals of the periodic spline fit. It suggests that the simple parametric model may not be sufficient. We can test the departure from the sine-cosine model using a partial spline model by adding the sine and cosine functions to the null space of a periodic spline

 (60)

where .

> ozone.fit3 <- ssr(thick~sin(2*pi*csmonth)+cos(2*pi*csmonth),
rk=periodic(csmonth), spar="m", data=Arosa)
> summary(ozone.fit3)
...
Coefficients (d):
(Intercept) sin(2 * pi * csmonth) cos(2 * pi * csmonth)
337.0933             -47.42298              7.696095

GML estimate(s) of smoothing parameter(s) : 4.49682e-06
Equivalent Degrees of Freedom (DF):  7.46103
Estimate of sigma:  16.78036
> anova(ozone.fit3,simu.size=500)
Testing H_0: f in the NULL space

test.value simu.size simu.p-value approximate.p-value
LMP 0.001262064       500            0
GML   0.9553638       500            0        3.490208e-12


The departure from the parametric model is significant. Note that with penalty , sine and cosine functions we added to the null space are also in the space . Thus the parametric part of the partial spline model () is not orthogonal to . Another approach to test the parametric model is to use a periodic -spline model with :

 (61)

where . Now the sine and cosine functions are orthogonal to . Thus this approach will be more efficient. See Wang and Brown (1996) for detail.

> ozone.fit4 <- update(ozone.fit3, rk=lspline(csmonth,type="sine1"))
> summary(ozone.fit4,simu.size=500)
...
Coefficients (d):
(Intercept) sin(2 * pi * csmonth) cos(2 * pi * csmonth)
337.0937             -47.42247              7.695888

GML estimate(s) of smoothing parameter(s) : 5.404848e-09
Equivalent Degrees of Freedom (DF):  6.421139
Estimate of sigma:  16.77356
> anova(ozone.fit4,simu.size=500)
Testing H_0: f in the NULL space

test.value simu.size simu.p-value approximate.p-value
LMP 2.539163e-06       500            0
GML    0.9533696       500            0        1.164513e-12


Figure shows plots of the overall fits and their projections for three models: ozone.fit2, ozone.fit3 and ozone.fit4. As we can see, all three models have similar overall fits. But their projections are quite different. As expected, the confidence intervals for the projections of the -spline model () are narrower than those of the partial spline model ().

Figure suggests that the variances may not be a constant. We calculate residual variances for each month and plot them on the log scale (left panel of Figure ). It is obvious that they are not equal. It seems that simple sine and cosine functions can be used to model the variance function.

> v <- sapply(split(ozone.fit4$resi,month),var) > a <- unique(csmonth) > b <- lm(log(v)~sin(2*pi*a)+cos(2*pi*a)) > summary(b) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.43715 0.05763 94.341 8.57e-15 *** sin(2 * pi * a) -0.71786 0.08151 -8.807 1.02e-05 *** cos(2 * pi * a) -0.49854 0.08151 -6.117 0.000176 *** --- Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1  ' 1 Residual standard error: 0.1996 on 9 degrees of freedom Multiple R-Squared: 0.9274, Adjusted R-squared: 0.9113 F-statistic: 57.49 on 2 and 9 DF, p-value: 7.48e-06  Therefore we assume the following variance function for model () > ozone.fit5 <- update(ozone.fit4, weights=varComb(varExp(form= ~sin(2*pi*csmonth)), varExp(form=~cos(2*pi*csmonth)))) > summary(ozone.fit5) ... Coefficients (d): (Intercept) sin(2 * pi * csmonth) cos(2 * pi * csmonth) 337.0753 -47.42951 7.773344 GML estimate(s) of smoothing parameter(s) : 3.676933e-09 Equivalent Degrees of Freedom (DF): 6.84808 Estimate of sigma: 15.22469 Combination of: Variance function structure of class varExp representing expon -0.3555964 Variance function structure of class varExp representing expon -0.2496355  The estimated variance parameters, and , are very close to (up to a scale of 2 by definition) those based on residual variances, and . The fitted variance function is plotted on the left panel of Figure . As can be seen, it is almost identical to the fit based on residual variances. The right panel of Figure plots the -spline fit to the mean function with 95% Bayesian confidence intervals. Note that these confidence intervals are conditional on the estimated variance parameters. Thus they may have less coverage than the nominal value since the degrees of freedom for estimating the variance parameters are not counted. Nevertheless, we can see that unequal variances are reflected in the widths of these confidence intervals. Observations close in time may be correlated. In the following we include a first-order autoregressive structure for random errors. Since some observations are missing, we use the continuous AR(1) correlation structure. That is, we assume the covariance between the random error of month in year and the random error of month in year is . > Arosa$z <- month+12*(year-1)
> ozone.fit6 <- update(ozone.fit5,corr=corCAR1(form=~z))
> summary(ozone.fit6)
...
Coefficients (d):
(Intercept) sin(2 * pi * csmonth) cos(2 * pi * csmonth)
336.8969             -47.34584              7.788939

GML estimate(s) of smoothing parameter(s) : 3.573493e-09
Equivalent Degrees of Freedom (DF):  7.326922
Estimate of sigma:  15.31061

Correlation structure of class corCAR1 representing
Phi
0.3410923
Combination of:
Variance function structure of class varExp representing
expon
-0.3602399
Variance function structure of class varExp representing
expon
-0.3009847


Now let us consider the effects of both month and year variables. First, suppose that the simple sine and cosine functions are appropriate for modeling the month effect, and we do not want to assume a parametric model for the year effect. Then we may consider the following partial spline model

 (62)

where and .

> csyear <- (year-1)/45
> ozone.fit7 <- ssr(thick~sin(2*pi*csmonth)+cos(2*pi*csmonth)+I(csyear-0.5),
rk=cubic(csyear), spar="m", data=Arosa)
> summary(ozone.fit7)
...
Coefficients (d):
(Intercept) sin(2 * pi * csmonth) cos(2 * pi * csmonth) I(csyear - 0.5)
336.8798             -47.37047              7.776179        7.199672

GML estimate(s) of smoothing parameter(s) : 4.741079e-07
Equivalent Degrees of Freedom (DF):  16.49958
Estimate of sigma:  16.5235
> anova(ozone.fit7,simu.size=500)
Testing H_0: f in the NULL space

test.value simu.size simu.p-value approximate.p-value
LMP 0.01157664       500        0.026
GML  0.9891654       500            0        0.0004092629


If we want to model both month and year non-parametrically, we can fit the following SS ANOVA model with a periodic spline for month and a cubic spline for year:

 (63)

> ozone.fit8 <- ssr(thick~I(csyear-0.5),  spar="m", data=Arosa,
rk=list(periodic(csmonth), cubic(csyear),
rk.prod(periodic(csmonth), kron(csyear-.5)),
rk.prod(periodic(csmonth), cubic(csyear))))
> summary(ozone.fit8)
...
Coefficients (d):
(Intercept) I(csyear - 0.5)
336.8298        6.031531

GML estimate(s) of smoothing parameter(s) :
4.108999e-01 7.457006e-02 1.656897e+05 5.438436e+03
Equivalent Degrees of Freedom (DF):  24.59541
Estimate of sigma:  15.84154
> grid <- data.frame(csmonth=seq(0,1,len=50), csyear=seq(0,1,len=50))
> interaction.fit8 <- predict(ozone.fit8, grid, terms=c(0,0,0,0,1,1))
> max(abs(interaction.fit8$fit)/interaction.fit8$pstd)
[1] 0.009646192


The smoothing parameters for the interaction terms between month and year are large, which indicates that the interaction effects are small. Indeed the fitted values of interaction terms are very small (around ) compared to posterior standard deviations (around 0.01). Therefore, we delete these interaction terms in our final model.

> ozone.fit9 <- update(ozone.fit8, rk=list(periodic(csmonth), cubic(csyear)))
> summary(ozone.fit9)
...
Coefficients (d):
(Intercept) I(csyear - 0.5)
336.8287        5.989813

GML estimate(s) of smoothing parameter(s) : 0.41248034 0.07316391
Equivalent Degrees of Freedom (DF):  24.66142
Estimate of sigma:  15.8378


Figure shows the estimated main effects of month and year.

We may also consider observations as a long time series. Note that many observations are missing, thus the state space approach in Kitagawa and Gersch (1984) cannot be used directly. Our approach allows observations at unequally spaced points. Thus it can also be used to impute missing observations. Let us define a time variable as . Similar to Kitagawa and Gersch (1984), we may consider the following model

 (64)

where and model seasonal trend, models the long term trend and , is the maximum value of , and 's are random errors and . Model () is a partial spline. Since the domain is with , we use the cubic2 kernel function.
> Arosa$t <- csmonth+Arosa$year-1
> ozone.fit10 <- ssr(thick~t+sin(2*pi*t)+cos(2*pi*t), rk=cubic2(t),
spar="m", data=Arosa)
> summary(ozone.fit10)
...
Coefficients (d):
(Intercept)               t sin(2 * pi * t) cos(2 * pi * t)
330.8107521      -0.9901175     -47.3246660       7.7786452

GML estimate(s) of smoothing parameter(s) : 0.01568941
Equivalent Degrees of Freedom (DF):  20.38134
Estimate of sigma:  16.25793
> anova(ozone.fit10, simu.size=500)
Testing H_0: f in the NULL space

test.value simu.size simu.p-value approximate.p-value
LMP   1075.457       500         0.01
GML    0.98481       500            0         3.64301e-05


Estimate of the overall function and its two components, seasonal trend and long term trend, are shown in Figure . We can see that the long term trend is different from a constant.

Observations close in time are likely to be correlated. Suppose that we want to use the exponential correlation structure with nugget effect. Specifically, regard as a zero mean stochastic process with

 (65)

where is the nugget effect, and is the range parameter.
> ozone.fit11 <- update(ozone.fit10, corr=corExp(form=~t,nugget=T))
> ozone.fit11
...
GML estimate(s) of smoothing parameter(s) : 469618845337
Equivalent Degrees of Freedom (DF):  4
Estimate of sigma:  17.36963
Correlation structure of class corExp representing
range    nugget
0.3661093 0.6364710

New estimates are plotted in Figure . We can see that now the long term trend is almost a constant. The equivalent degrees of freedom is decreased to 4.008176.

The null space contains four components: the constant, linear, sine and cosine functions. Using the cubic2 kernel penalizes the sine and cosine functions which may lead to biases. If this is to be avoided, we can use the lspline kernel with type=''linSinCos''. The period for the seasonal component is 1, while the period for the function lspline with type=''linSinCos'' is . Therefore we multiply the covariate by a constant to match the scale.

> ozone.fat12 <- update(ozone.fit11, rk=lspline(2*pi*t,type="linSinCos"))
> ozone.fit12
...
GML estimate(s) of smoothing parameter(s) : 9.562702e+13
Equivalent Degrees of Freedom (DF):  4
Estimate of sigma:  17.37282
Correlation structure of class corExp representing
range    nugget
0.3665007 0.6360280


The estimates are plotted in Figure .

As in Kitagawa and Gersch (1984), sometimes we want to use a stochastic process to model the autocorrelation and regard this process as part of the signal. Then we need to separate this process with measurement errors and predict it at desired points. Specifically, assume

 (66)

where and are the same as in model (). Assume that is a stochastic process independent of with mean zero and , where is the range parameter as in ().

Denote as the vector of design points for the variable . Let be the vector of the process evaluated at design points. are random effects and , where depends the parameter . () is a SLM model. However, it cannot be fitted directly using slm since depends on the range parameter nonlinearly. We fit model () in two steps. We first regard as part of random error and estimate the range parameter. This is done in ozone.fit11. Then we calculate the estimate of (without nugget effect) and regard it as the true covariance matrix. We calculate the Cholesky decomposition of as , and transform the random effects , where . Now we are ready to fit the transformed SLM:

> tau <- coef(ozone.fit11$cor.est, F) > D <- corMatrix(initialize(corExp(tau[1],form=~t), data=Arosa)) > Z <- chol.new(D) > ozone.fit13 <- slm(thick~t+sin(2*pi*t)+cos(2*pi*t), rk=cubic2(t), random=list(pdIdent(~Z-1)), data=Arosa) > summary(ozone.fit13) ... Coefficients (d): (Intercept) t sin(2 * pi * t) cos(2 * pi * t) 332.2848808 0.3174285 -47.2930087 7.8994571 GML estimate(s) of smoothing parameter(s) : 50.62773 Equivalent Degrees of Freedom (DF): 4.517319 Estimate of sigma: 13.86935  Suppose that we want to predict on grid points , . Let . Then . > grid3 <- data.frame(t=seq(0,max(Arosa$t)+0.001,len=500))
> newdata <- data.frame(t=c(Arosa$t,grid3$t))
> RD <- corMatrix(initialize(corExp(tau[1],form=~t), data=newdata))
> R <- RD[(length(Arosa$t)+1):length(newdata$t),1:length(Arosa$t)] > u.new <- R%*%t(solve(Z))%*%as.vector(ozone.fit13$lme.obj$coef$random[[2]])


Estimates from ozone.fit13 are shown in Figure .

We now show how to use nnr and snr to fit varying coefficients models. Suppose that the thickness in each year can be well approximated by a sinusoidal function of month. We want to investigate how two coefficients, the average thickness and amplitude, change over years. Specifically, we consider the following model:

 (67)

where and are yearly average thickness and amplitude respectively, guarantees the horizontal shift to be between 0 and 1. Assume that . Note that the constraint is not enforced here. We may replace by as the amplitude function. The resulting model will be fitted later. Model () is a special case of the SNR model, and thus can be fitted using the function snr. We use the coefficients from ozone.fit1 to calculate the initial value for .
> tmp <- atan(-ozone.fit1$coef[3]/ozone.fit1$coef[2])/(2*pi)
> tmp <- log(tmp/(1-tmp))
> ozone.fit14 <- snr(thick~f1(csyear)+f2(csyear)*cos(2*pi*(csmonth+alogit(a))),
func=list(f1(x)+f2(x)~list(~x, cubic(x))), params=list(a~1),
data=Arosa, start=list(params=c(tmp)), spar="m")
> summary(ozone.fit14)
Semi-parametric Nonlinear Regression Model Fit by Gauss-Newton Method
Model: thick ~ f1(csyear) + f2(csyear) * cos(2 * pi * (csmonth + alogit(a)))
Data: Arosa

AIC      BIC    logLik
4362.429 4370.929 -2179.214

Coefficients:
Value  Std.Error   t-value p-value
a -1.243126 0.01933295 -64.30086       0

Standardized residuals:
Min          Q1         Med          Q3         Max
-3.37533987 -0.60538971 -0.03263471  0.49885054  3.13869256

GML estimate(s) of smoothing spline parameter(s): 0.0001225915 0.4999998276
Equivalent Degrees of Freedom (DF) for spline function:  16.95087
Residual standard error: 16.81619

Converged after 3 iterations

> p.ozone.fit14 <- intervals(ozone.fit14,data.frame(x=seq(0,1,len=100)),
terms=list(f1=matrix(c(1,1,1,1,1,0,0,0,1),ncol=3,byrow=T),
f2=matrix(c(1,1,1,1,1,0,0,0,1),ncol=3,byrow=T)))


Figures and show the estimated functions and their projections. The yearly average thickness has a similar trend as before, and the amplitude has a linear increasing, but non-significant, trend.

The sinusoidal function for monthly changes may be too restrictive. Then we can consider the following model

 (68)

where the function in model () is replaced by a nonparametric periodic function . Assume that . The constant functions were removed from the model space for identifiability. Thus we have . Also, for identifiability, we assume . Again, we did not enforce the constraint . Model () is a special case of the NNR model, thus can be fitted using nnr. In the following, instead of using nnr directly, we write a simple program using backfitting methods to estimate , and . The purpose is to show that writing S codes for more complicated models is fairly easy.
# transform time variable into [0,1]
z <- (Arosa$t-min(Arosa$t))/(max(Arosa$t)-min(Arosa$t))

# create matrices for RK's
S1 <- S2 <- cubic(z)
S3 <- periodic((max(Arosa$t)-min(Arosa$t))*z)

# find initial values
f3.tmp <- ssr(thick~1,rk=S3,data=Arosa,spar="m")
f3.est <- as.vector(S3%*%f3.tmp$rkpk.obj$c/max(abs(S3%*%f3.tmp$rkpk.obj$c)))
f2.est <- rep(max(abs(S3%*%f3.tmp$rkpk.obj$c)),length(thick))
f1.est <- rep(f3.tmp$rkpk.obj$d[1],length(thick))

# backfitting
prec <- 1
while (prec>.0001) {
# fix f2 and f3, fit f1
y.tmp <- thick-f2.est*f3.est
f1.tmp <- ssr(y.tmp~z,rk=S1,spar="m",limnla=c(-3,1))
f1.est.old <- f1.est
f1.est <- f1.tmp$fit # fix f1 and f3, fit f2. Note RK are multiplied by f3 y.tmp <- thick-f1.tmp$fit
f2.tmp <- ssr(y.tmp~f3.est+I(f3.est*z)-1,rk=rk.prod(S2,f3.est),
spar="m",limnla=c(-3,1))
f2.est.old <- f2.est
f2.est <- f2.tmp$rkpk.obj$d[1]+f2.tmp$rkpk.obj$d[2]*z
+as.vector(S2%*%f2.tmp$rkpk.obj$c)

# fix f1 and f2, fit f3, note the empty null space and weights
f3.tmp <- ssr(I(y.tmp/f2.est)~-1,rk=S3,spar="m",weights=f2.est)
f3.est.old <- f3.est
# enforce \sup |f_3(t)|=1
f3.est <- as.vector(S3%*%f3.tmp$rkpk.obj$c/
max(abs(S3%*%f3.tmp$rkpk.obj$c)))

# stopping criteria
prec <- max(sum((f1.est-f1.est.old)**2),
sum((f2.est-f2.est.old)**2),
sum((f3.est-f3.est.old)**2))
}

# for prediction of f3 on new grid, we need to refit which specifies the
# rk function rk=periodic. rk= a exist matrix will not work
y.tmp <- thick-f1.tmp$fit f3.tmp <- ssr(I(y.tmp/f2.est)~-1, rk=periodic((max(Arosa$t)-min(Arosa$t))*z), spar="m",weights=f2.est) # calculate fits for the plot.bCI function. # Note we inflate the posterior variances via degrees of freedom p.ozone.fit15.f1 <- predict(f1.tmp,terms=matrix(c(1,1,1,1,1,0,0,0,1),ncol=3, byrow=T)) p.ozone.fit15.f1$pstd <- p.ozone.fit15.f1$pstd*sqrt((length(thick)- f1.tmp$df)/(length(thick)-f1.tmp$df-f2.tmp$df-f3.tmp$df+1)) p.ozone.fit15.f2 <- predict(f2.tmp,terms=matrix(c(1,1,1,1,1,0,0,0,1),ncol=3, byrow=T)) p.ozone.fit15.f2$fit <- p.ozone.fit15.f2$fit/f3.est p.ozone.fit15.f2$pstd <- p.ozone.fit15.f2$pstd*sqrt((length(thick)- f2.tmp$df)/(length(thick)-f1.tmp$df-f2.tmp$df-f3.tmp$df+1))/abs(f3.est) grid4 <- data.frame(z=seq(0,1/(max(Arosa$t)-min(Arosa$t)),len=100)) p.ozone.fit15.f3 <- predict(f3.tmp,grid4) p.ozone.fit15.f3$pstd <- p.ozone.fit14.f3$pstd*sqrt((length(thick)- f3.tmp$df)/(length(thick)-f1.tmp$df-f2.tmp$df-f3.tmp$df+1))  Figures , and show the estimated functions and their projections. Yearly average thickness and amplitude have similar trends as before. is different from a sinusoidal function, even though the difference is small. To enforce the constraint in model (), we consider the following model  (69) where and . Since is close to a sinusoidal function, we use the -spline with . For identifiability, we remove the constant functions from the model space for : . > S3 <- cubic(z) > f3.tmp <- ssr(thick~1,rk=S3,data=Arosa,spar="m") > f3.ini <- as.vector(S3%*%f3.tmp$rkpk.obj\$c)
> ozone.fit16 <- nnr(thick~f1(csyear)+exp(f2(csyear))*f3(csmonth),
func=list(f1(x)~list(~I(x-.5),cubic(x)),
f2(x)~list(~I(x-.5)-1,cubic(x)),
f3(x)~list(~sin(2*pi*x)+cos(2*pi*x)-1,
lspline(x,type="sine0"))),
data=Arosa,
start=list(f1=mean(thick),f2=0,f3=f3.ini),
control=list(converg="coef"))
> ozone.fit16
Nonlinear Nonparametric Regression Model Fit by Gauss-Newton Method
Model: thick ~ f1(csyear) + exp(f2(csyear)) * f3(csmonth)
Data: Arosa

GML estimate(s) of smoothing parameter(s): 2.081637e-07 1.930501e-03 1.053201e-06
Equivalent Degrees of Freedom (DF):  34.20186

Residual standard error: 15.62308
Number of Observations: 518
Converged after 7 iterations

> x <- seq(0,1,len=50)
> u <- seq(0,1,len=50)
> p.ozone.fit16 <- intervals(ozone.fit16, newdata=list(csyear=x,csmonth=u),
terms=list(f1=matrix(c(1,1,1,1,1,0,0,0,1),nrow=3,byrow=T),
f2=matrix(c(1,1,1,0,0,1),nrow=3,byrow=T),
f3=matrix(c(1,1,1,1,1,0,0,0,1),nrow=3,byrow=T)))


Estimated functions and their projections are shown in Figures , and . They are similar to previous estimates.

Next: Global Climate Data Up: Examples Previous: Examples
Yuedong Wang 2004-05-19