This document contains some discussion points from class
Some of you (especially these with Windows machines) ran into error messages along the lines of
*** arch - i386
ERROR: compilation failed for package 'survMisc'
Similar issues are reported here, here, and more. A workaround approach is to run the following codes before the command line that produce the error message:
> library(devtools)
> assignInNamespace("version_info", c(devtools:::version_info, list("3.5" = list(version_min = "3.3.0", version_max = "99.99.99", path = "bin"))), "devtools")
See here for the original discussion. This enable Rtools
package to be used alongside devtools
. However, this is only a temporary fix, and the user might need to run this line every time devtools
is loaded.
A possible reason for this issue is that Rtools
is not being detected or not installed properly. For this, the user can try to install Rtools
for windows from this link. Make sure to check “add rtools to system PATH”:
In one of the class example, we plotted the empirical survival curve with:
> whas100 %>% filter(fstat > 0) %>% mutate(surv = 1 - ecdf(lenfol)(lenfol)) %>%
+ ggplot(aes(lenfol, surv)) + geom_step()
The codes can be modified in the following way so that the empirical survival curve starts from \(\hat{S}_e(0)= 1\).
> whas100 %>% filter(fstat > 0) %>% mutate(surv = 1 - ecdf(lenfol)(lenfol)) %>%
+ add_row(lenfol = 0, surv = 1,.before = 1) %>% ggplot(aes(lenfol, surv)) +
+ geom_step() + labs(title = "Codes provided by Tian Jiang")
The following example produces the similar plot without the need to store \(\hat{S}_e(t)\) in surv
.
> whas100 %>% filter(fstat > 0) %>%
+ ggplot(aes(x = lenfol)) + geom_step(aes(y = 1 - ..y..), stat = "ecdf", show.legend = TRUE) +
+ ylab("surv")
Here we have \(\hat{S}_e(t) = 1\) for \(t < 6\) and \(\hat{S}_e(t) = 0\) for \(t \ge 2710\).
Suppose we want to generate the sequence \(\{1, -1, \ldots, -1, 1\}\) with 10 1’s and 9 -1’s (the vector then has length = 19). Here are some possible approaches:
> (-1)^(0:18)
[1] 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1
> diff(0:19 %% 2)
[1] 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1
> diff(rep(0:1, 10))
[1] 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1
> rep(c(1, -1), 10)[1:19]
[1] 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1
> ifelse(1:19 %% 2, 1, -1)
[1] 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1
lm
Suppose we want to fit a linear regression with y
as the outcome variable and \(\{\)X
, Z
\(\}\) as the covariates, where X
is a family of covariates \(\{\)x1
, x2
, …, x100
\(\}\) generated from standard uniform and Z
is a family of covariates \(\{\)z1
, z2
, …, z100
\(\}\) generated from standard normal. The data frame is observed in a way that the X
’s and Z
’s are placed in no particular order. For example, consider the following dat
.
Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
This warning is displayed once per session.
> ## fake data
> datatable(dat, options = list(pageLength = 10, scrollX=TRUE)) %>% formatRound(1:201, 3)
This fits all 200 covariates:
> head(coef(lm(y ~ ., dat)), 10)
(Intercept) x88 z17 x56 x77
-0.154390393 0.846932399 -0.001144469 1.066026142 -0.991538457
z86 z14 z31 x52 x22
-0.005226447 0.051806282 0.039848434 1.032491798 0.987676788
> length(coef(lm(y ~ ., dat)))
[1] 201
This fits linear regression with only the X
’s:
> head(coef(lm(y ~ ., dat[,grep("x|y", names(dat))])))
(Intercept) x88 x56 x77 x52 x22
-0.2496609 0.8704349 1.0865774 -0.9767312 0.9717011 1.0111935
> length(coef(lm(y ~ ., dat[,grep("x|y", names(dat))])))
[1] 101
where, like most regression models, \(\sum_{i=1}^n M_i(t_i)=0\). Consider a simulated dataset where we generated the true survival time from a standard exponential and the censoring time from a uniform distribution.
> n <- 100
> set.seed(1)
> sim <- tibble(cenTime = runif(n, 0, 5), time = pmin(rexp(n), cenTime),
+ status = 1 * (time < cenTime)) %>%
+ select(-cenTime)
The following codes can be used to create a new dataset containing variables like \(d_i\), \(n_i\), etc.
> dat <- sim %>% arrange(time) %>% group_by(time) %>%
+ summarize(di = sum(status), ni = length(status)) %>%
+ mutate(Nt = cumsum(di), ni = 100:1)
> dat
# A tibble: 100 x 4
time di ni Nt
<dbl> <dbl> <int> <dbl>
1 0.0203 1 100 1
2 0.0372 1 99 2
3 0.0521 1 98 3
4 0.0593 1 97 4
5 0.0694 1 96 5
6 0.0897 1 95 6
7 0.0911 1 94 7
8 0.0988 1 93 8
9 0.110 1 92 9
10 0.117 0 91 9
# … with 90 more rows
The individual martingales can be computed with
> dat <- dat %>% mutate(hi = di / ni, Hi = cumsum(hi), Mi = di - Hi)
> dat
# A tibble: 100 x 7
time di ni Nt hi Hi Mi
<dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.0203 1 100 1 0.01 0.01 0.99
2 0.0372 1 99 2 0.0101 0.0201 0.980
3 0.0521 1 98 3 0.0102 0.0303 0.970
4 0.0593 1 97 4 0.0103 0.0406 0.959
5 0.0694 1 96 5 0.0104 0.0510 0.949
6 0.0897 1 95 6 0.0105 0.0616 0.938
7 0.0911 1 94 7 0.0106 0.0722 0.928
8 0.0988 1 93 8 0.0108 0.0829 0.917
9 0.110 1 92 9 0.0109 0.0938 0.906
10 0.117 0 91 9 0 0.0938 -0.0938
# … with 90 more rows
Checking the Nelson-Aalen’s estimator.
> qplot(time, Hi, data = dat, geom = "step") +
+ geom_abline(intercept = 0, slope = 1, col = 2) + ylab("H(t)")
In this case, the sum of the martingale residuals is sum(dat$Mi)=
4.302114210^{-16}, close to 0 as expected.
In the aml
dataset, the sum of the martingale residuals is
> dat <- aml %>% arrange(time) %>% select(-x) %>% group_by(time) %>%
+ summarize(di = sum(status), ni = length(status)) %>%
+ mutate(Nt = cumsum(di), ni = rev(cumsum(rev(ni))), Na = cumsum(di/ni)) %>% right_join(aml)
> with(dat, sum(status - Na))
[1] 3.330669e-16
“Cox described the proportional hazards model in Cox (1972), in what is now the most quoted statistical papers in history. He also outlined in this paper the method for estimation which he referred to as using conditional likelihood. It was pointed out to him in the literature that what he proposed was not a conditional likelihood and that there may be some flaws in his logic. Cox (1975) was able to recast his method of estimation through what he called ‘partial likelihood’ and published this in Biometrika. This approach seemed to be based on sound inferential principles. Rigorous proofs showing the consistency and asymptotic normality were not published until 1981 when Tsiatis and others (1981) demonstrated these large sample properties. In 1982, Andersen and Gill (1982) simplified and generalized these results through the use of counting processes.”
Origin of this historial note can be found here.
e2c8136d4854ff9384fceb00437a00cc60452ee7
Andersen, Per Kragh, and Richard David Gill. 1982. “Cox’s Regression Model for Counting Processes: A Large Sample Study.” The Annals of Statistics. JSTOR, 1100–1120.
Cox, David R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society. Series B (Methodological) 34 (2): 87–22.
———. 1975. “Partial Likelihood.” Biometrika 62 (2). Oxford University Press: 269–76.
Tsiatis, Anastasios A, and others. 1981. “A Large Sample Study of Cox’s Regression Model.” The Annals of Statistics 9 (1). Institute of Mathematical Statistics: 93–108.