- Email: [email protected]

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Individual-specific, sparse inverse covariance estimation in generalized estimating equations Qiang Zhang a , Edward H. Ip a,∗ , Junhao Pan b , Robert Plemmons c a

Wake Forest School of Medicine, USA

b

Sun Yat-sen University, China

c

Wake Forest University, USA

article

info

Article history: Received 18 December 2015 Received in revised form 21 October 2016 Accepted 21 October 2016 Available online 16 November 2016 Keywords: Orthogonal matching pursuit Covariance modeling Asymptotic

abstract This paper proposes a data-driven approach that derives individual-specific sparse working correlation matrices for generalized estimating equations (GEEs). The approach is motivated by the observation that, in some applications of the GEE, the covariance structure across individuals is heterogeneous and cannot be appropriately captured by a single correlation matrix. The proposed approach enjoys both favorable computational and asymptotic properties. Simulation experiments and analysis of intensively measured longitudinal data on 158 participants collected from a dietary and emotion study are presented. © 2016 Elsevier B.V. All rights reserved.

1. Introduction As a statistical methodology, the Generalized Estimating Equation (GEE) (Liang and Zeger, 1986) has been widely used for longitudinal/clustered data analysis. In longitudinal data collected from a biomedical study, for example, observations from the same subject over time cluster together and are therefore correlated. Ignoring the intra-subject correlation could inflate the precision levels of the effects of risk factors or treatments, leading to spurious significant statistical testing results. The broad adoption of the GEE is partly due to the property that only the first few marginal moments of the joint density need to be specified, rather than conducting a full specification. The GEE has proven to be generally robust to misspecification of the correlation when the interest lies in the mean structure. The within-subject correlation is captured by a working correlation matrix, R, which is often assumed to be subject independent. Several correlation structures are commonly used— independent, order-1 autoregressive (AR(1)), exchangeable, and unstructured correlation. These standard correlation structures have their limitations. Unstructured correlation, for example, is rarely used in high dimensional situations. Other correlation structures are problematic when within-cluster correlations are present and the correlations are heterogeneous. As Fitzmaurice (1995) observed, in such cases large efficiency losses may occur with the first three commonly used structures in R, and the loss in efficiency generally becomes more severe when the cluster size increases. See Wang and Carey (2003). The limitations of standard correlational structure for R are that (1) they lack the flexibility to capture salient features in the correlation in high dimension, and (2) they are not designed to handle diverse patterns of correlation among individuals. For example, consider the case of n independent and identically distributed p-dimensional variables Y = (y1 , . . . , yp )T . One direct solution may be to specify individual-specific covariances. However, even in situations in which cluster size is large, the empirical covariance matrix for each subject, Σi = (yi − µ(β))(yi −

∗ Correspondence to: Department of Biostatistical Sciences, Wake Forest School of Medicine, Medical Center Blvd, Winston Salem, NC 27157, USA. Fax: +1 336 716 6427. E-mail address: [email protected] (E.H. Ip). http://dx.doi.org/10.1016/j.spl.2016.10.023 0167-7152/© 2016 Elsevier B.V. All rights reserved.

Q. Zhang et al. / Statistics and Probability Letters 122 (2017) 96–103

97

µ(β))T , where yi is the response and µ(β) is the mean as a function of some parameter β , cannot be used to derive the individual correlation matrix Ri due to possible rank deficiency and non-invertibility. In this paper, we propose to estimate individual-specific covariance within the GEE context, in which each individualspecific covariance is assumed to take on a sparse structure. Individual-specific covariance estimation is appealing for several reasons. First, in longitudinal data of large clusters (e.g., intensively repeated measurements for the individuals), there could be a long-term influence of historical data on the present outcome, which would invalidate the first-order autoregressive (AR(1)) assumption (Liang and Zeger, 1986). Second, there may exist large individual differences, in terms of both the structure and the magnitude of the covariance matrix. For example, in a data set collected using experience sampling method (ESM) in a dietary behavior study (see 3.2), it was observed that the distribution of the largest correlation between current and lagged observations of individuals spread over a wide range between 1 and 40. Yet another reason for using individual-specific covariance is for enhancing goodness-of-fit and improving the efficiency of the parameter estimates in the mean function. Misspecifying the correlation and covariance structure can lead to poor efficiency and convergence problems (Sutradhar and Das, 1999; Sutradhar and Kumar, 2001; Kauermann and Carroll, 2001; Wang and Carey, 2003; Wang and Hin, 2010). This is especially important when n is small and the focus of inference is on the effect of various factors on the mean response. In other words, if the true model is such that individual ‘‘clusters’’ do have different covariance matrices, then correctly specifying the individual covariance structure would improve estimation efficiency. However, even when the number of observations within a cluster is large, naively using an unstructured covariance matrix for each individual could easily lead to overfitting. Therefore it is desirable to limit the number of parameters in models for individual-specific covariance matrices by the use of modern statistical techniques such as regularization. Our proposed individual-specific, sparse-covariance estimation falls into the broad literature category of regularization of the covariance matrix and the inverse covariance matrix in high-dimensional data in which the dimension p of the independent and identically distributed variables may be much larger than the number of variables n, p ≫ n (Wu and Pourahmadi, 2003; Levina et al., 2008; Rothman et al., 2009). Dempster (1972) argued for the use of the inverse covariance matrix, or the precision matrix, which is the canonical parameters in the multivariate normal distribution, instead of the covariance matrix. Zero entries of a precision matrix correspond to the conditional independence of the corresponding variables with a joint, multivariate normal distribution (Lauritzen, 1996). A useful approach (Banerjee et al., 2008; Friedman et al., 2008) in choosing a sparse precision matrix is to regularize the matrix using the l1 norm, as follows: max L(X ) = log |X | − ⟨X , S ⟩ − α∥vec (X )∥1 ,

(1)

X ≻0

where the sparse estimator X of inverse covariance is constrained to be positive definite and |X | denotes the determinant of X . Matrix S denotes the sample covariance matrix, ⟨⟩ denotes the inner product of two matrices, and vec denotes the vectorization operator. One drawback of this approach is the slow computation in high-dimensional cases, especially in the graphical LASSO (gLasso) (Friedman et al., 2008; Meinshausen and Bühlmann, 2006) approach where entries of X are individually estimated. Even with the faster approach that uses the alternating direction method of multipliers (ADMM) (Boyd et al., 2011; Yuan, 2012), the gLasso procedure involves an eigen-decomposition of a large covariance matrix, which is often quite slow. Regularization methods have also been used in the context of GEE. For example, Fu (2003) proposed a bridge penalty model for GEE for longitudinal studies. More recently, Blommaert et al. (2014) extended penalized GEE to high-dimensional data analysis and data with multicollinearity and time-dependence, respectively. These approaches do not estimate individual-specific covariances. The key idea that makes the individual-specific sparse (inverse) covariance estimation possible is the representation of the inverse covariance with a regression-based model and the sparsification of the regression model. In practice, a regression-based model of the covariance structure in GEE represents an important trade-off somewhere between the extreme of an independence covariance structure and the unstructured covariance matrix with p(p + 1)/2 parameters at the other extreme (Zimmerman and Núñez-Antón, 2010). Regression-based modeling of the covariance structure within the GEE context has been investigated by Pourahmadi (2000) and Ye and Pan (2006). We follow Pourahmadi (2011) and Huang et al. (2006) to set up the regression-based model for individual-specific, precision-matrix estimation as follows. Let Y = (y1 , . . . , yp )T be a random vector with mean zero and a positive-definite covariance matrix Σ . The Cholesky decomposition of Σ is

Σ = CC T , (2) where C = (cij ) is a unique, lower-triangular matrix with positive diagonal entries. Let D = diag(c11 , . . . , cpp ). Then we have

Σ −1 = T T D−2 T , (3) process where T = DC −1 . Further assume that yt , t = 1, . . . , p can be represented by the following auto-regressive (AR) model: yt =

t −1 j=1

φj yt −j + ϵt ,

(4)

98

Q. Zhang et al. / Statistics and Probability Letters 122 (2017) 96–103

where ϵt represents Gaussian noise and matrix T has an explicit form (Pourahmadi, 2011):

1

−φ1 ... −φp−1

T = −φ2

1

−φ1

.

1

···

−φ1

(5)

1

It is clear that if the regression coefficient vector, φ = (φ1 , . . . , φp−1 ), is sparse, then the matrix T is a sparse below the diagonal. Then, generally, Σ −1 can be expected to be a sparse matrix. Also importantly, the regression model (4) for covariance results in a Toeplitz matrix T that contains diagonal band structures with only p − 1 distinct entries. If we have equal variance on the diagonal of Σ or equivalently Y is a stationary random process, then D is a multiple of the identity matrix and Σ −1 is also a sparse matrix which almost corresponds to a Toeplitz matrix, except for entries in the first and last few rows and columns. To regularize the Cholesky factor, the authors of Huang et al. (2006) use a doubly-penalized approach and construct the estimator of the inverse covariance from the regularized Cholesky factor. As pointed out by Zhang and Zou (2014), the approach always gives a positive-semidefinite matrix but does not guarantee a sparse estimator of the inverse covariance matrix. The approach used in this paper adopts a different tactic for sparsifying the regression coefficients by limiting the number of non-zero regression coefficients using the orthogonal matching-pursuit (OMP) algorithm (Tropp and Gilbert, 2007). One advantage of our approach is that if the AR process is stationary – i.e., the roots of the polynomial with coefficients φj are within the unit circle – the estimated precision matrix from (3) is always positive-definite. The Toeplitz assumption simplifies the estimation into solving k regression models, where k is the number of nonzeros in φ , as compared with O(p2 ) models by graphical Lasso (Friedman et al., 2008; Meinshausen and Bühlmann, 2006). Additionally, under the regression model that leads to decomposition of the inverse covariance matrix with a Toeplitz matrix T , direct inversion of the Cholesky factor can be avoided, resulting in much faster computation, which is important of course in problems with large cluster sizes. Of note the model as captured by (5) pertains to Gaussian noise but can be extended to a more general model, as we shall see in the Method section. 2. Method 2.1. Notation and preliminaries Let yij be the scalar observation of the jth measurement of individual i, with xij being a q × 1 covariate vector of individual i, where j = 1, . . . , pi and i = 1, . . . , n. Here, the cluster size p can vary by individual, but for simplicity, we assume balanced data—i.e., pi ≡ p. In this paper, p denotes the number of time points. Liang and Zeger (1986) assumed an exponential distribution of marginal yij through the generalized linear model: f (yij |xij , β, ψ) = exp yij θij − a(θij ) + b(yij )/ψ ,

(6)

where θij = xTij β , and ψ is a dispersion parameter. The vector β is the regression parameter of interest. The first and second moments are

µij (β) = E (yij |xij , β, ψ) = a′ (θij ), σij (β) = Cov(yij |xij , β, ψ) = a′′ (θij )ψ.

(7)

With the canonical link functions, we have h(t ) = a′ (t ) as the inverse link function and g (t ) = h−1 (t ) as the link function. Hence, g (µij ) = xTij β . Let yi = (yi1 , . . . , yip )T , Xi = (xi1 , . . . , xiq )T , µi = E (yi |Xi , β, ψ) = (µi1 , . . . , µip )T , and Σi = Cov(yi |Xi , β, ψ). Denote 1/2

Ai = diag(σi12 , . . . , σip2 ), Di = ∂µi /∂β and Vi = Ai

1/2

Ri Ai

. Here, Ri is the ‘‘working’’ correlation matrix. If we denote the

true correlation matrix as R¯ i , and if Ri = R¯ i , we have Vi = Σi . Liang and Zeger (1986) proposed solving the following generalized estimating equations (GEEs): g (β) =

n

DTi Vi−1 (yi − µi ).

(8)

i =1

There are usually four steps in the estimation procedure of GEE: 1. Initialize β say, using a univariate, generalized linear model—i.e., by assuming within-individual independence. −1/2

2. Compute Pearson residuals, y˜ i = Ai and the dispersion parameter. 3. Compute Ri and Vi .

(yi − µi ), and estimate parameters for a structured, working correlation matrix

Q. Zhang et al. / Statistics and Probability Letters 122 (2017) 96–103

99

4. Update β iteratively using the following Newton–Raphson rule, until convergence:

β

(l+1)

=β

(l)

−

n

−1 −1

D i Vi

Di

i=1

n

−1

Di Vi

(yi − µi ) .

(9)

i=1

2.2. Model with individual sparse inverse covariance Continuing with the setup in the GEE, assume that the residuals y˜ it = yit − µi (β) follow an order-m autoregressive (AR(m)) model (m ≤ p): y˜ ij =

m

φit y˜ i,j−t + ϵij ,

(10)

t =1

where φit is the contribution of the residual at time j − t to the residual at time j; ϵij is i.i.d. and satisfies E (ϵij ) = 0, E (ϵij2 ) = σ˜ i2 ; and ∃δ > 0, such that E (ϵij4+δ ) < ∞. This formulation generalizes (4) as we only need to specify the mean and the variance. Apparently the fourth-moment finiteness assumption is valid for binary yij . The autoregressive coefficient vector φi = (φi1 , . . . , φim )T is assumed to be individual specific and sparse and has k nonzero elements. The error component in (10), σ˜ i , is related to σij through φi , and hence σij is independent of the index j—i.e., σij = σi for some σi > 0. The GEE modeling effort described in the preceding section is augmented with the assumption that the individual covariance matrix Σi follows an autoregressive model and the Cholesky decomposition leads into a Toeplitz structure defined by (5). It is also assumed that the individual matrix is sparse with the maximum number of nonzero entries of φi bounded by a predetermined limit. Except for the model for the Σi component, the GEE procedure for the proposed model remains unchanged. 2.3. Sparse inverse covariance estimation (SICE) For each individual i, we solve the following minimization problem for an estimate of φi :

min φi

y˜ ij −

m

2 φit y˜ i,j−t

,

subject to ∥φi ∥0 ≤ k.

(11)

t =1

j

Note that the problem is set up slightly differently from the following regression model described in Wu and Pourahmadi (2003): y˜ ij =

j−1

φjt y˜ it + ϵij ,

(12)

t =1

where, at a given time point j, φjt is lag-dependent. In our model, φi is solved at the individual level, and the AR process is assumed to be stationary. On the other hand, specification under (12) leads to a uniform, unstructured, precision matrix for all yi , while the stationary assumption here leads to individual-specific Toeplitz precision matrices. We use the compact OMP algorithm to solve (11), which can be rewritten in the matrix–vector multiplication form as

φˆ i = min ∥Bi φi − bi ∥22 , φi

subject to ∥φi ∥0 ≤ k,

(13)

where bi = (˜yi,m+1 , . . . , y˜ ip )T , and

Bi =

y˜ i,m

··· y˜ i,p−1

··· ··· ···

y˜ i1

···

.

(14)

y˜ i,p−m

The OMP algorithm initializes a residual vector ri = bi , the solution φi as a zero vector, and the support subset S is an empty set. At each iteration, the OMP algorithm chooses the index of the largest entry of the residual vector and adds it to S . It then seeks a least-square (LS) solution on the subset S of entries of φi :

φiS = (BTiS BiS )−1 (BTiS bi ),

(15)

where BiS is a submatrix of Bi containing only columns of Bi with indices in S . Then the residual vector ri = bi − Bi φi is computed using the updated value of φ . The iterative algorithm stops after the kth iteration. At iteration l, l = 1, . . . , k, the OMP basically seeks a least-squares solution of an AR(l) model, and hence the asymptotic normality of the estimators φˆ iS is satisfied at each iteration if the bounded-fourth-order moment condition is satisfied—i.e., E (ϵij4 ) < ∞ (see, e.g., Theorem 2.4 in Shumway and Stoffer, 2000). We are, however, interested in the rate of convergence of φˆ i , which will subsequently be used for deriving the rate of convergence of the estimated working correlation matrix.

100

Q. Zhang et al. / Statistics and Probability Letters 122 (2017) 96–103

It is important to point out that the value of k is typically small compared to the total number of entries in the m-banded Toeplitz matrix T , and m ≤ p, where p denotes the number of time points in GEE. In practice, m is meant to be an upper limit to the plausible order of the AR model. Although the value of m is set to be uniform for all individuals, the order of the AR model for each individual is actually characterized by the pattern of the k non-zero terms in the individualized inverse covariance matrix. 2.4. Asymptotic properties of the SICE We provide two theoretical results for the SICE estimates. Because of space limitation, the proofs of the theorems are made available in a separate technical report. Here we only outline the ideas underlying the proofs. Theorem 1. The OMP estimator φˆ i satisfies log p √ sup P [ p(φˆ i − φi ) ≤ z ] − G(z ) ≤ C √ , p z ∈Rm

(16)

where G(z ) is the cumulative density of N (0, Γim ), Γim = E (Wit Wit′ ), Wt = (˜yit , . . . , y˜ i,t −m+1 )T , and C is a constant. The proof largely follows that by Roy and Bhattacharya (2013), except that we use the least-squares estimator rather than the instrumental variable estimator. The least square estimator is defined by

φˆ =

p

−1 ′

Wt Wt

t =1

p

Wt y˜ t +1

,

(17)

t =1

where, for simplicity, index i is omitted, and Wt = (˜yt , . . . , y˜ t −m+1 )T . The OMP estimator takes the inverse of a pthe subject ′ k × k submatrix of t =1 Wt Wt , and in the proof the convergence rate of parameters in φ is evaluated.

Given the estimated φˆi , we can now construct an estimate of the working correlation matrix Ri from φˆ i . Similar to (3) and (5), and noting that Di = σi I, we write 1 = Tˆi TˆiT , Rˆ − i

(18)

where

1 −φˆ i1 ˆ −φi2 Tˆi = · · · ˆ −φim ···

1

−φˆ i1

0

.

1

−φˆ i1

··· −φˆ im

···

···

1

−φˆ i1

(19)

1

Here, under the model assumption in (11), Ti is an m-banded Toeplitz matrix and only contains k non-zero bands. Theorem 2. The estimator of the inverse working correlation matrix in (18) satisfies 1 1 ∥Rˆ − − R− i i ∥ = OP

log p

√

p

.

(20)

The proof makes use of the inequality for a sparse symmetric matrix with at most k nonzeros in each column: ∥M ∥ ≤ k∥M ∥∞ , where ∥·∥∞ represents the l-infinity norm, and thus ∥Rˆ i −Ri ∥ = OP (k∥Rˆ i −Ri ∥∞ ). The result can then be obtained by applying the triangle inequality to the decomposition in Eq. (18). Note that (20) is independent of sample size n, which implies that the working correlation matrix for each individual can be estimated with guaranteed convergence when p → ∞. 3. Numerical experiments 3.1. Example 1: Gaussian variables Correlated, normally distributed responses were generated by the model Yij = Xij β + ϵij ,

(21)

where i = 1, . . . , n and j = 1, . . . , p, Xij is a scalar covariate generated from a standard normal distribution, and β is fixed at 0.5. The noise vector ϵi is sampled from a multivariate normal distribution N (0, Σi ), where the inverse of the correlation

Q. Zhang et al. / Statistics and Probability Letters 122 (2017) 96–103

101

Table 1 The GEE-SICE estimates of predictors on hunger score. Variable

Intercept Meal Healthy Less healthy PE NE

β 51.107 −9.349 2.053 1.654 0.003 0.011

Std. error

1.866 1.010 1.447 1.333 0.002 0.003

z value

27.381 −9.258 1.418 1.241 2.182 3.314

p-value

<0.001 <0.001 0.156 0.215 0.029 0.001

95% CI Low lim

Up lim

47.449 −11.328 −0.784 −0.959 0.000 0.004

54.766 −7.369 4.889 4.266 0.006 0.017

1 matrix, R− i , is generated by Ti , following (18) and (19). The nonzero entries of the sparse vector φi are randomly sampled for each subject from a uniform distribution on the set {1, . . . , m}. We fixed k = 5, m = 20 and chose the nonzero subset φΓ = {−0.5, 0.2, −0.1, −0.1, 0.1} to impose stationarity on the sequence yi . The covariance matrix was constructed from Ri by using a constant variance σi2 = 0.25, and Σi = σi2 Ri . The simulation was run 100 times for each combination of sample size from the set {2, 4, . . . , 30} and cluster size from the set {100, 200, . . . , 1000}. The codes for estimation in regular GEE were adapted from GEEQBox (Shults and Ratcliffe, 2008) and were modified for working correlation matrix estimation with SICE. The regular GEE with an AR(1) working correlation and the GEE combined with SICE (SICE-GEE) were both applied to the simulated data set. The performance of the two estimators was compared in ˆ n, p) and asymptotic relative efficiency (ARE), which is defined as the ratio between the variance of the terms of bias of β( GEE estimate and the variance of the SICE-GEE estimate. Fig. 1 shows the means of estimated βˆ (top panel) and mean ARE (bottom panel) in 100 simulations with varying n and p. The bars on each side of the point estimates represent ±2 standard deviations. In the graph for βˆ , no significant difference can be observed between the biases of the estimates derived from GEE under AR1 (blue) and those from SICE-GEE (red), which indicates that both GEE methods produce satisfactory, unbiased estimators with variance that decreases with increasing sample size and cluster size. In terms of efficiency, the ARE (bottom panel) shows the mean ARE along with the maximum and the minimum AREs of the set of simulations indicated by the upper and lower ends of an error bar. The SICE-GEE appears to significantly improve the efficiency of the estimates over the regular GEE when the sample size is greater than 10, with cluster size being one or two orders greater. As requested by a reviewer, we conducted a few additional simulation experiments to assess the performance of the SICE-GEE estimate in terms of bias and ARE. Additional conditions included (1) larger value of n (up to 100) (2) smaller value of p (as low as 10), and (3) when observations are independent (covariance is diagonal). Fig. 2 summarizes the results with the columns reflecting these conditions. From left to right, the columns represent n = 2, . . . , 100 for p = 100, 200, p = 10, 20, and for the independent case when p = 100, 200. Except for p = 10 and small n, bias in general is small and does not seem to change with increasing n. SICE-GEE maintains a higher efficiency and the advantage is diminished for the independent cases.

3.2. Example 2: Emotion and eating behavior data set This data set has been mentioned in the Introduction section. A total of 158 white, adult, non-obese women reported their eating behavior and momentary emotional states every two hours, six times per day over 10 observation days (Lu et al., 2011). At each of the 60 time point of the ESM-based data collection, participants were asked to indicate the degree to which they were feeling each of the 29 emotion descriptors. They were asked to place a mark on a 15-cm visual analogue scale (from ‘‘not at all to’’ very much). These scales were then transformed into numerical scales ranging from 0 to 150 with unit gradations. The 29 emotion descriptors were grouped into two domains: positive emotions (PE) and negative emotions (NE). Scores from each domain were derived by summation over the score of the descriptors in that domain. Simultaneously, participants in the ESM also recorded the meal composition status as compared with a typical meal that was established at a baseline interview. Three response categories – same as baseline, healthy, or less healthy than baseline – were allowed. A hunger score was also recorded as part of ESM using a scale similar to the emotion descriptors. The value of m = 36 was selected based on exploratory analysis. On the other hand, we used several values of k (5,10,15) and the result suggested that β coefficients were not sensitive to the choice of k. An alternative was to use cross-validation. Because of sample size limitation, we did not use cross-validation and here the result for k = 10 is reported. Table 1 lists the GEE-SICE estimates for several predictors on the outcome of hunger score. which is treated as continuous. Not surprisingly, hunger is significantly and negatively associated with having a meal (variable Meal). In addition, both PE and NE have an impact on hunger, and interestingly, in the same direction—i.e., both tend to increase the level of hunger. 4. Conclusion By ‘‘thumbnail sketching’’ individual inverse covariances and suppressing unimportant details through limiting the number of nonzero entries in the inverse covariance matrix, we develop a realistic and efficient representation of potentially diverse working correlation structures within a GEE framework. We also show that the proposed procedure can improve

102

Q. Zhang et al. / Statistics and Probability Letters 122 (2017) 96–103

ˆ n, p) with varying sample sizes and cluster sizes for simulated continuous data. The true value of β is 0.5. (Bottom panel) Fig. 1. (Top panel) Estimated β( Estimated ARE (n, p) with varying sample sizes and cluster sizes for simulated continuous data. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

efficiency of estimates in the regression model for the mean function. Further work needs to be done on issues such as the determination of m (history of the autoregressive process) and k (number of allowed non-zeros in the inverse covariance), missing values, and irregular time intervals (e.g., see Wang and Carey, 2004). Unless the time series requires going back to a long history, the more important parameter in the proposed regularization scheme is k. Not unlike tuning parameters in a regularization approach (e.g., multiplier of the penalty term in a penalized approach), k can be selected by methods such as cross-validation. This work is now in progress.

Acknowledgments

This work is supported by National Institutes of Health grants R01AG031827A, 1U01HL101066-01, and UL1TR001420, and National Science Foundation grant SES-1424875. The authors thank Dr. Laurette Dúbe for contributing the data set on dietary behavior and emotion.

Q. Zhang et al. / Statistics and Probability Letters 122 (2017) 96–103

103

ˆ n, p) with varying sample sizes and cluster sizes for simulated continuous data. Columns 3 and 4 from the left represent Fig. 2. (Top panel) Estimated β( estimates for n = 2, . . . , 100, and columns 5 and 6 represent estimates for the independent case. The true value of β is 0.5. (Bottom panel) Estimated ARE (n, p).

References Banerjee, O., El Ghaoui, L., d’Aspremont, A., 2008. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9, 485–516. Blommaert, A., Nens, H., Beutels, Ph., 2014. Data mining for longitudional data under multicollinearity and time dependence using penalized generalized estimating equations. Comput. Statist. Data Anal. 71, 667–680. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. R Found. Trends⃝ Mach. Learn. 3 (1), 1–122. Dempster, A.P., 1972. Covariance selection. Biometrics 28, 157–175. Fitzmaurice, G.M., 1995. A caveat concerning independence estimating equations with multivariate binary data. Biometrics 51, 309–317. Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441. Fu, W., 2003. Penalized estimating equations. Biometrics 59, 126–132. Huang, J., Liu, M., Pourahmadi, N.and, Liu, L., 2006. Covariance matrix selection and estimation via penalized normal likelihood. Biometrika 93, 85–98. Kauermann, G., Carroll, R., 2001. A note on the efficiency of sandwich covariance matrix estimation. J. Amer. Statist. Assoc. 96, 1387–1396. Lauritzen, S.L., 1996. Graphical Models. Oxford University Press, London. Levina, E., Rothman, A., Zhu, J., 2008. Sparse estimation of large covariance matrices via a nested Lasso penalty. Ann. Appl. Stat. 2, 245–263. Liang, K.-Y., Zeger, S.L., 1986. Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. Lu, J., Huet, C., Dubé, L., 2011. Emotional reinforcement as a protective factor for healthy eating in home settings. Am. J. Clin. Nutr. 94 (1), 254–261. Meinshausen, N., Bühlmann, P., 2006. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 34, 1436–1462. Pourahmadi, M., 2000. Maximum likleihood estimation of generalized linear models for multivariate normal covariance matrix. Biometrika 87, 425–435. Pourahmadi, M., 2011. Covariance estimation: The GLM and regularization perspectives. Statist. Sci. 26 (3), 369–387. Rothman, A., Levina, E., Zhu, J., 2009. Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 104, 177–186. Roy, S.S., Bhattacharya, S., 2013. Rate of convergence in the central limit theorem for parameter estimation in a causal, invertible ARMA(p,q) model. J. Time Series Anal. 34, 130–137. Shults, J., Ratcliffe, S.J., 2008. GEEQBOX: A MATLAB toolbox for generalized estimating equations and quasi-least squares. J. Stat. Softw. 25 (14), 1–14. Shumway, R.H., Stoffer, D.S., 2000. Time series analysis and its applications. Stud. Inf. Control 9 (4), 375–376. Sutradhar, B., Das, K., 1999. On the efficiency of regression estimators in generalised linear models for longitudinal data. Biometrika 86, 459–465. Sutradhar, B., Kumar, P., 2001. On the efficiency of extended generalized estimating equation approaches. Statist. Probab. Lett. 55, 53–61. Tropp, J.A., Gilbert, A.C., 2007. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inform. Theory 53 (12), 4655–4666. Wang, Y., Carey, V., 2003. Working correlation structure misspecificaiton, estimation and covariate design: Implications for generalized estimating equations performance. Biometrika 90, 29–41. Wang, Y., Carey, V., 2004. Unbiased estimating equations from working correlation models for irregularly timed repeated measures. J. Amer. Statist. Assoc. 99, 845–853. Wang, Y.G., Hin, L.Y., 2010. Modeling strategies in longitudinal data analysis: covariate, variance function and correlation structure selection. Comput. Statist. Data Anal. 54 (12), 3359–3370. Wu, W.B., Pourahmadi, M., 2003. Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90 (4), 831–844. Ye, H., Pan, J., 2006. Modelling of covariance structures in generalized estimating equations for longitudinal data. Biometrika 93, 927–941. Yuan, X., 2012. Alternating direction method for covariance selection models. J. Sci. Comput. 51 (2), 261–273. Zhang, T., Zou, H., 2014. Sparse precision matrix estimation via lasso penalized d-trace loss. Biometrika 101, 103–120. Zimmerman, D., Núñez-Antón, V.A., 2010. Antedependence Models for Longitudional Data. In: Monographs on Statistics and Applied Probability, vol. 112. CRC Press, Boca Raton, Florida.