Time Series Analysis of RTA Boardings, Route 16 (2008-2015)

November 20, 2016


This is a condensed version of an undergrad project for my Time Series analysis class at UCR. Files can be found at the rta-routes repository on GitHub.

Table of Contents

Data

1
2
3
4
5
6
7
8
  # Open TSA library
  library(TSA)

  # Set to directory where data files are
  # setwd("~/Documents/rta-routes/data/")

  # Import data files for routes 1 and 16; each file contains monthly numbers of riders
  route.16 <- ts(read.table(file="data/route16.txt"), start=c(2008,7), frequency = 12)

Although there are many bus routes, we are interested in Route 16 mainly due to the fact that it runs through UCR (University of California - Riverside). We create a time series data frame route.16 from the given .txt files. It contains monthly counts of how many riders take this bus route.

1
2
  # Plot time series
  plot(route.16, ylab="Number of Riders", xlab="Year", main="Riders Per Month - Route 16",type="o")

The time series plot suggest we can apply a seasonal model since there is a pattern. It also shows that there is an increasing trend over time, especially between years 2008-2013.


Model Specification

Data suggests an ARIMA(p,q,d) model. In order to reach stationarity, we look at possible transformations through a Box-Cox transformation with a 95% confidence interval and λ= 0.5.

Transformation

The output suggests a square root transformation.

1
2
3
4
5
  # Create subset of route 16 from Jan 2008 to Feb 2015
  route.16.window <- window(route.16, end=c(2015,2))

  # Use Box-Cox transformation to determine which power transf to use
  # BoxCox.ar(route.16.window)

Below is the plot of the transformed data:

1
  plot(sqrt(route.16.window),ylab="Passengers",xlab="Years",main="Squareroot transformation of Riders per month line 16")

Differentiation

The plot of the transformed data still shows stationarity, so we plot the first degree difference of the transformation. In addition to the stationarity, the plot still shows an increasing trend.

1
2
  # Differentiation
  plot(diff(sqrt(route.16.window),xlab="Years",ylab="First difference", main="First degree difference of the sqrt transformation"))

The Augmented Dickey-Fuller Test verifies the stationarity of the differentiated data with a p-value of 0.01. This value is sufficient to sustain the hypothesis that our data is stationary with a signficance level of α = 0.05.

1
2
3
4
5
6
7
8
9
  # Testing Stationarity
  adf.test(diff(sqrt(route.16.window)))

    ##
    ##  Augmented Dickey-Fuller Test
    ##
    ## data:  diff(sqrt(route.16.window))
    ## Dickey-Fuller = -6.8511, Lag order = 4, p-value = 0.01
    ## alternative hypothesis: stationary

We can now proceed to find the appropriate ARMA(p,q) model by using the sample ACF and PACF with a lag of 12:

1
2
  # Sample ACF
  acf(as.vector(diff(sqrt(route.16.window),lag=12)))

The Sample ACF resembles one of a seasonal MA(1) - it spikes at lag k = 1 and then drops below the margin of error. The Sample PACF also has a similar output.

1
2
  # Sample PACF
  pacf(as.vector(diff(sqrt(route.16.window),lag=12)))

Both figures suggest that an ARIMA(1,1,1) model is most suitable for our data. To confirm this, we can also look at the EACF:

1
2
3
4
5
6
7
8
9
10
11
12
13
  # Sample EACF
  eacf(as.vector(diff(sqrt(route.16.window),lag=12)))

    ## AR/MA
    ##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
    ## 0 x x o x x o o o o o o  o  o  o
    ## 1 x o o o o o o o o o o  o  o  o
    ## 2 x o o o o o o o o o o  o  o  o
    ## 3 x x x o o o o o x o o  o  o  o
    ## 4 x o o x o o o o o o o  o  o  o
    ## 5 x o o x o o o o o o o  o  o  o
    ## 6 o x o o o o o o o o o  o  o  o
    ## 7 o x o o x o o o o o o  o  o  o

The tip of the EACF is reached at the point (1,1), creating a wedge from there. In addition to the ARIMA(1,1,1) model, another suggested model for the data is an ARI(1,1) model since the sample ACF does not immediately drop below the margin of error. Different results can be obtained if the MA(1) of the ARIMA model is left out.


Model Fitting

Since we want to estimate how many people ride the bus each month, we will test two different approaches of our model, one with seasonality and one without it.

ARIMA(1,1,1) non-seasonal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  # ARIMA(1,1,1) non-seasonal
  arima111.ns.fit <- arima(sqrt(route.16.window), order=c(1,1,1), method='ML')
  arima111.ns.fit

    ##
    ## Call:
    ## arima(x = sqrt(route.16.window), order = c(1, 1, 1), method = "ML")
    ##
    ## Coefficients:
    ##          ar1      ma1
    ##       0.3046  -0.8660
    ## s.e.  0.1218   0.0487
    ##
    ## sigma^2 estimated as 315.9:  log likelihood = -339.87,  aic = 683.75

ARIMA(1,1,1) seasonal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  # ARIMA(1,1,1) seasonal
  arima111.fit <- arima(sqrt(route.16.window), order=c(1,1,1), seasonal=list(order=c(1,1,1), period=12), method="ML")
  arima111.fit

    ##
    ## Call:
    ## arima(x = sqrt(route.16.window), order = c(1, 1, 1), seasonal = list(order = c(1,
    ##     1, 1), period = 12), method = "ML")
    ##
    ## Coefficients:
    ##          ar1      ma1     sar1    sma1
    ##       0.2056  -0.8089  -0.8462  0.6708
    ## s.e.  0.1734   0.1168   0.3205  0.4487
    ##
    ## sigma^2 estimated as 36.52:  log likelihood = -217.05,  aic = 442.09

The following are the parameters:

which is equivalent to:

With ML estimates as \( \sigma^2_{e} \)≈ 315.9, \( \hat\phi \) = 0.3146, \( \hat\theta \) = -0.8660. The approximate 95% confidence intervals for \( \hat\phi \) and \( \hat\theta \) is (0.065872, 0.543328) and (-0.961452,-0.770548), respectively. We found that \( \hat\phi \) and \( \hat\theta \)are significantly different from 0.

ARI(1,1) non-seasonal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  # ARI(1,1) non-seasonal
  ari11.ns.fit <- arima(sqrt(route.16.window), order=c(1,1,0), method='ML')
  ari11.ns.fit

    ##
    ## Call:
    ## arima(x = sqrt(route.16.window), order = c(1, 1, 0), method = "ML")
    ##
    ## Coefficients:
    ##           ar1
    ##       -0.1733
    ## s.e.   0.1101
    ##
    ## sigma^2 estimated as 413.4:  log likelihood = -350.07,  aic = 702.15

ARI(1,1) seasonal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  ari11.fit <- arima(sqrt(route.16.window), order=c(1,1,0), seasonal=list(order=c(1,1,0), period=12), method="ML")
  ari11.fit

    ##
    ## Call:
    ## arima(x = sqrt(route.16.window), order = c(1, 1, 0), seasonal = list(order = c(1,
    ##     1, 0), period = 12), method = "ML")
    ##
    ## Coefficients:
    ##           ar1     sar1
    ##       -0.4081  -0.2667
    ## s.e.   0.1104   0.1282
    ##
    ## sigma^2 estimated as 44.52:  log likelihood = -222.77,  aic = 449.54

Model Diagnosis

Non-Seasonal Models

1
2
3
4
5
6
7
8
  #ARI(1,1)
  par(mfrow=c(2,2))
  plot(route.16, ylab="Number of Riders", xlab="Year", main="Riders Per Month -Route 16",type="o")
  plot(rstandard(ari11.ns.fit), ylab="Std. Residuals", xlab="Time", type="o")
  abline(h=0)
  hist(rstandard(ari11.ns.fit), xlab="Std. Residuals", main="")
  qqnorm(rstandard(ari11.ns.fit), main="")
  qqline(rstandard(ari11.ns.fit), main="")

Normality and Independence

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
  # Run Shapiro-Wilks test for normality
  shapiro.test(rstandard(ari11.ns.fit))  

    ##
    ##  Shapiro-Wilk normality test
    ##
    ## data:  rstandard(ari11.ns.fit)
    ## W = 0.95787, p-value = 0.009997

  shapiro.test(rstandard(arima111.ns.fit))

    ##
    ##  Shapiro-Wilk normality test
    ##
    ## data:  rstandard(arima111.ns.fit)
    ## W = 0.97308, p-value = 0.08925

  # Runs test for Independence
  runs(rstandard(ari11.ns.fit))  


    ## $pvalue
    ## [1] 0.916
    ##
    ## $observed.runs
    ## [1] 41
    ##
    ## $expected.runs
    ## [1] 40.975
    ##
    ## $n1
    ## [1] 39
    ##
    ## $n2
    ## [1] 41
    ##
    ## $k
    ## [1] 0

  runs(rstandard(arima111.ns.fit))

    ## $pvalue
    ## [1] 0.322
    ##
    ## $observed.runs
    ## [1] 35
    ##
    ## $expected.runs
    ## [1] 39.775
    ##
    ## $n1
    ## [1] 33
    ##
    ## $n2
    ## [1] 47
    ##
    ## $k
    ## [1] 0

Residual PACF

1
2
  # Residual PACF
  pacf(rstandard(ari11.ns.fit), main="Sample PACF")  

Overfitting

1
2
3
4
5
6
7
8
  #ARIMA(1,1,1)
  par(mfrow=c(2,2))
  plot(route.16, ylab="Number of Riders", xlab="Year", main="Riders Per Month - Route 16",type="o")
  plot(rstandard(arima111.ns.fit), ylab="Std. Residuals", xlab="Time", type="o")
  abline(h=0)
  hist(rstandard(arima111.ns.fit), xlab="Std. Residuals", main="")
  qqnorm(rstandard(arima111.ns.fit), main="")
  qqline(rstandard(arima111.ns.fit), main="")

1
2
3
4
  # Residual ACF / PACF
  par(mfrow=c(2,2))
  acf(rstandard(arima111.ns.fit), main="Sample ACF")
  pacf(rstandard(arima111.ns.fit), main="Sample PACF")


Seasonal Models

1
2
3
4
5
6
7
8
  #ARI(1,1)
  par(mfrow=c(2,2))
  plot(route.16, ylab="Number of Riders", xlab="Year", main="Riders Per Month -Route 16",type="o")
  plot(rstandard(ari11.fit), ylab="Std. Residuals", xlab="Time", type="o")
  abline(h=0)
  hist(rstandard(ari11.fit), xlab="Std. Residuals", main="")
  qqnorm(rstandard(ari11.fit), main="")
  qqline(rstandard(ari11.fit), main="")

Normality and Independence

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
  # Run Shapiro-Wilks test for normality
  shapiro.test(rstandard(ari11.fit))  

    ##
    ##  Shapiro-Wilk normality test
    ##
    ## data:  rstandard(ari11.fit)
    ## W = 0.90862, p-value = 2.871e-05

  shapiro.test(rstandard(arima111.fit))

    ##
    ##  Shapiro-Wilk normality test
    ##
    ## data:  rstandard(arima111.fit)
    ## W = 0.91728, p-value = 7.093e-05

  # Runs test for Independence
  runs(rstandard(ari11.fit))  

    ## $pvalue
    ## [1] 0.145
    ##
    ## $observed.runs
    ## [1] 34
    ##
    ## $expected.runs
    ## [1] 40.975
    ##
    ## $n1
    ## [1] 39
    ##
    ## $n2
    ## [1] 41
    ##
    ## $k
    ## [1] 0

  runs(rstandard(arima111.fit))

    ## $pvalue
    ## [1] 0.000994
    ##
    ## $observed.runs
    ## [1] 26
    ##
    ## $expected.runs
    ## [1] 40.975
    ##
    ## $n1
    ## [1] 41
    ##
    ## $n2
    ## [1] 39
    ##
    ## $k
    ## [1] 0

Residual PACF

1
2
  # Residual PACF
  pacf(rstandard(ari11.fit), main="Sample PACF")  


Forecasting

To test our model, we withold observations from February 2015 to February 2016. The table below summarizes the actual values in March 2015 and January 2016 and the predictions estimated by the four ARIMA and ARI models.

Time Observation ARIMA (1,1,1) Seasonal ARI (1,1) Seasonal ARIMA (1,1,1) Non-Seasonal ARI (1,1) Non-Seasonal
March 2015 119.5 122.3 119.43 125.42 131.9
January 2016 122.21 135.11 133.87 122.78 131.83

R time series