- Email: [email protected]

Contents lists available at ScienceDirect

Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva

Linear shrinkage estimation of large covariance matrices using factor models Yuki Ikeda, Tatsuya Kubokawa ∗ Faculty of Economics, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-0033, Japan

article

info

Article history: Received 12 February 2015 Available online 21 August 2016 AMS subject classifications: 62F12 62H12 62J07 Keywords: Covariance matrix Factor model High dimension Large sample Non-normal distribution Normal distribution Portfolio management Ridge-type estimator Risk function

abstract The problem of estimating a large covariance matrix using a factor model is addressed when both the sample size and the dimension of the covariance matrix tend to infinity. We consider a general class of weighted estimators which includes (i) linear combinations of the sample covariance matrix and the model-based estimator under the factor model, and (ii) linear shrinkage estimators without factors as special cases. The optimal weights in the class are derived, and plug-in weighted estimators are proposed, given that the optimal weights depend on unknown parameters. Numerical results show that our method performs well. Finally, we provide an application to portfolio management. © 2016 Elsevier Inc. All rights reserved.

1. Introduction The estimation of a large covariance matrix is a fundamental subject in economics, financial engineering, biology, signal processing, and other fields. Accordingly, the problem has been widely studied. In the estimation of a p × p covariance matrix, the classical large-sample theory assumes that p is fixed and the sample size N tends to infinity. In this setting, the covariance matrix can be estimated from its sample version, which is a consistent estimator. However, in applications, we often encounter large data sets that contain high-dimensional variables. In this case, using the sample covariance matrix is inappropriate because it becomes singular when p is larger than N. Even if p < N, the sample covariance matrix is unstable as noted by Fan et al. [9]. Various methods have been proposed to estimate the covariance matrix in high dimension. Ledoit and Wolf [18], Schafer and Strimmer [23], Chen et al. [6], Fisher and Sun [12], Touloumis [27], and others suggested well-conditioned estimators that combine the sample covariance matrix and more stable statistics, which are called linear shrinkage or weighted estimators. Many other well-conditioned estimators such as regularization and thresholding techniques and non-linear shrinkage methods have been studied by Bickel and Levina [1,2], Rothman et al. [22], Cai and Liu [3], Cai and Zhou [4], Ledoit and Wolf [19], Ledoit and Wolf [20], Fan et al. [9], Fan et al. [10], Fan et al. [11], and others.

∗

Corresponding author. E-mail addresses: [email protected] (Y. Ikeda), [email protected] (T. Kubokawa).

http://dx.doi.org/10.1016/j.jmva.2016.08.001 0047-259X/© 2016 Elsevier Inc. All rights reserved.

62

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

When additional information on the covariance matrix is available, we can improve the sample covariance matrix. Factor models use correlations between variables of interest and several covariates, called factors. Factor models have been used in many applications, particularly, to explain stock returns in financial economics. Fama and French [8] found that excess asset returns are well explained by three factors, namely sensitivity to market excess return, market capitalization and book-toprice ratio. Ledoit and Wolf [17], Ren and Shimotsu [21] and Fan et al. [9] considered shrinking the sample covariance matrix toward the statistics constructed from the factor models. Ledoit and Wolf [17] and Ren and Shimotsu [21] suggested linear shrinkage estimators but did not address high-dimensional settings. Fan et al. [9] proposed a covariance estimator based on a strict factor model in high-dimensional settings. The strict factor model assumes independence among idiosyncratic components; hence, the error covariance matrix becomes diagonal. However, cross-sectional independence is restrictive in many applications, as noted by Chamberlain and Rothschild [5]. Recently, Fan et al. [10] applied a thresholding technique and suggested an invertible covariance estimator based on an approximate factor model, which allows cross-sectional correlations. Fan et al. [10] derived favorable convergence rates under sparsity of covariance of idiosyncratic components when both p and N tend to infinity. To describe the results in the present paper, let y and x be a variable of interest and a factor variable, respectively. The variance and covariance matrices of y and x are denoted by 6y , 6x and 6yx , and their sample variance and covariance matrices are denoted by Sy , Sx and Syx . The problem is the estimation of the p × p matrix 6y , but the sample covariance matrix Sy is ill-conditioned in high dimensions. When Sy is expressed as Sy = Sy|x + Syx Sx−1 Sxy

for Sy|x = Sy − Syx Sx−1 Sxy ,

the linear shrinkage estimator given in the literature suggests shrinking the term Sy|x . In this paper, we consider shrinking the term Syx Sx−1 Sxy as well as Sy|x . Specifically, we suggest a double shrinkage estimator denoted by δα,β with weights α and β . We obtain the optimal weights and derive approximations and estimators thereof, which lead to the proposed plug-in double shrinkage estimator of 6y . In evaluating the optimal weights α and β , their estimators and large-sample properties, we consider two cases: 1 ⊤ (i) tr(62y ) = O(p2 ) and (ii) tr(62y ) = O(p). 6y is expressed as 6y = 6y|x + 6yx 6− for 6y|x = x 6xy = 6y|x + B6x B 1 −1 6y − 6yx 6− x 6xy because the factor loading B corresponds to 6yx 6x as seen in Section 2. For 6y|x , we allow cross-sectional correlations among the idiosyncratic components and assume that tr(62y|x ) = O(p), which corresponds to the boundedness of eigenvalues of 6y|x given in [10]. Thus, case (i) holds if the factor loadings are dense, specifically, B⊤ B = O(p). On the other

hand, when the maximal number of non-zero elements over the columns of the factor loadings B is uniformly bounded with respect to N and p, we have case (ii). Hence, when the factor loadings are dense, assumption (i) is reasonable, and when the factor loadings are sparse or not dense, assumption (ii) is appropriate. In this paper, we treat the two cases, depending on the density of the factor loadings. Concerning the choice of loss function, we use the unnormalized loss tr(δ − 6y )2 for the estimator δ of 6y while Fan 1 2 et al. [9], Fan et al. [10] mainly work with the normalized quadratic loss function tr(δ6− y − I ) . Fan et al. [11] noted that the unnormalized loss functions of many estimators diverge even if the growth rate of p is moderate (pp. 614–615). Because the unnormalized loss function is widely used, however, we treat it in this paper. The remainder of this paper is organized as follows. In Section 2, we introduce the multivariate model and the factor model; we then express the factor model in the framework of the multivariate model. The double shrinkage estimators are introduced and the optimal weights are derived under appropriate assumptions for non-sparsity and sparsity of factor loadings. The mean squared errors of the estimators with optimal weights are provided. In Section 3, we give several estimators of the unknown parameters included in the optimal weights and suggest plug-in estimators by substituting the estimators for the optimal weights. Section 4 reports numerical studies, and Section 5 presents an application to portfolio management. Concluding remarks are given in Section 6, and technical proofs are provided in the Appendix. 2. Linear shrinkage estimators 2.1. Multivariate and factor models Consider the following multivariate model: N random pairs (y1 , x1 ), . . . , (yN , xN ) are mutually independently and identically distributed as E(yi ) = µy , E(xi ) = µx , E{(yi − µy )(yi − µy )⊤ } = 6y , E{(xi − µx )(xi − µx )⊤ } = 6x and E{(yi − µy )(xi − µx )⊤ } = 6yx = 6⊤ xy , where yi and xi are, respectively, p- and q-dimensional vectors, namely yi xi

∼ i.i.d.

µy 6 , y 6xy µx

6yx 6x

.

(2.1)

Let Sy =

N 1

n i =1

(yi − y )(yi − y )⊤ ,

Sx =

N 1

n i=1

(xi − x)(xi − x)⊤

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

63

and N 1 ⊤ Syx = Sxy = (yi − y )(xi − x)⊤ , n i=1

where n = N − 1 and y and x are the sample means. We consider the problem of estimating the covariance matrix 6y of the yi ’s. A standard estimator of 6y is the sample covariance matrix Sy , but it is ill-conditioned when p > n. To improve the performance of Sy using the information in the xi ’s, the following factor model has been used in the literature: for i = 1, . . . , N, yi = a + Bxi + ϵi , ϵi ∼ i.i.d.(0, 9), xi ∼ i.i.d.(µx , 6x ),

(2.2)

where the ϵi ’s are idiosyncratic error components that are not correlated with the xi ’s, and a and B are, respectively, a p-variate unknown vector and a p × q unknown matrix. Here, 9 is a p × p unknown positive definite symmetric matrix with some structure. For example, Ledoit and Wolf [17] used the diagonal matrix 9 = diag(ψ1 , . . . , ψp ). Another restriction is the sphericity 9 = ψ Ip for the p × p identity matrix Ip and scalar ψ . In the literature, factor model (2.2) and multivariate model (2.1) are also called a strict factor model and an approximate factor model, respectively. Under factor model (2.2), the mean vector and the covariance matrix of yi are µy(f ) = a + Bµx and

6y(f ) = B6x B⊤ + 9. Additionally, the covariance of yi and xi is 6yx(f ) = E{(yi − µf )(xi − µx )⊤ } = B6x . In this paper, we assume multivariate model (2.1) as an underlying model, and we consider the case that factor model (2.2) is estimated, but uncertain. To treat the factor model in the framework of the multivariate model, suppose that µy = µy(f ) = a + Bµx , 6yx = 6yx(f ) = B6x and 6y = 6y(f ) = B6x B⊤ + 9. Then, a, B and 9 correspond to 1 a = µy − 6yx 6− x µx , −1 B = 6yx 6x ,

(2.3)

1 9 = 6y − 6yx 6− x 6xy ≡ 6y|x .

We recall that 9 in the factor model is a restricted matrix such as a diagonal or spherical matrix. The correspondence 9 = 6y|x assumes that 9 is a function of 6y|x . Two typical examples are

9(6y|x ) = diag(6y|x ) and 9(6y|x ) = p−1 tr(6y|x )Ip ,

(2.4)

where diag(A) = diag(a11 , . . . , app ) for p × p matrix A = (aij ). Thus, in the framework of multivariate model (2.1), it is reasonable to express the factor model as 1 −1 yi = (µy − 6yx 6− x µx ) + 6yx 6x xi + ϵi ,

ϵi ∼ i.i.d.[0, 9(6y|x )],

i = 1, . . . , N ,

xi ∼ i.i.d.(µx , 6x ),

(2.5)

where 9(6y|x ) is a function of 6y|x . The covariance matrix of yi under the factor model (2.5) is 1 6y(f ) = 9(6y|x ) + 6yx 6− x 6xy .

(2.6)

When a joint multivariate normal distribution is assumed for yi and xi , the expression (2.5) can be derived from the conditional distribution of yi given xi . That is, under the assumption that yi xi

∼ Np+q

µy 6 , y 6xy µx

6yx 6x

,

(2.7)

the marginal distribution of yi and the conditional distribution of yi given xi are yi ∼Np (µy , 6y ), 1 yi |xi ∼Np (µy + 6yx 6− x (xi − µx ), 6y|x ).

The conditional distribution is written as 1 −1 yi = (µy − 6yx 6− x µx ) + 6yx 6x xi + ϵi ,

ϵi ∼ Np (0, 6y|x ),

(2.8)

which corresponds to the factor model (2.5) by replacing 6y|x with a restricted matrix 9(6y|x ). This implies that the factor model uses a restricted matrix 9(6y|x ) such as (2.4), while the multivariate model corresponds to the unrestricted case 9 = 6y|x .

64

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

2.2. Linear shrinkage estimators We now consider the estimation of the covariance matrix 6y of yi . The sample covariance matrix Sy is not invertible for p > n, and requires modification. Ledoit and Wolf [17] considered shrinking Sy toward the estimator constructed from the 1 factor model. Under the factor model (2.5), the covariance matrix of yi is 6y(f ) = 9(6y|x ) + 6yx 6− x 6xy , which gives the estimator

δf = 9(Sy|x ) + Syx Sx−1 Sxy for Sy|x = Sy − Syx Sx−1 Sxy ,

(2.9)

where 9(Sy|x ) is a restricted matrix such as 9(Sy|x ) = diag(Sy|x ) or 9(Sy|x ) = p tr(Sy|x )Ip . Although δf is well-conditioned and invertible, it is uncertain whether such a restriction on 9 holds. Thus, it is reasonable to consider the linear shrinkage estimator −1

δw = wSy + (1 − w)δf = Sy + (1 − w)(Sy − δf ),

(2.10)

where w is a constant satisfying 0 ≤ w ≤ 1. Using the expression (2.9), we rewrite δw as

δw = wSy|x + (1 − w)9(Sy|x ) + Syx Sx−1 Sxy .

(2.11)

The sample covariance matrix Sy is decomposed as Sy = Sy|x + Syx Sx Sxy . Thus, the weighted estimator δw shrinks the part Sy|x in Sy toward 9(Sy|x ). When the true matrix 6y|x is distant from the restriction 9(6y|x ) in the factor model, the optimal weight of w takes a value close to 1. Then, δw approaches the sample covariance matrix, and we are faced with the ill-conditioned problem again in the high-dimensional case. To fix this problem, we consider shrinking the term Syx Sx−1 Sxy toward a restricted statistic 9(Syx Sx−1 Sxy ). The resulting double shrinkage estimator is of the form −1

δα,β = α Sy|x + (1 − α)9(Sy|x ) + β Syx Sx−1 Sxy + (1 − β)9(Syx Sx−1 Sxy ),

(2.12)

where α and β are constants satisfying 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1. Two examples of 9(Syx Sx Sxy ) are the diagonal matrix 9(Syx Sx−1 Sxy ) = diag(Syx Sx−1 Sxy ) and the spherical matrix 9(Syx Sx−1 Sxy ) = p−1 tr(Syx Sx−1 Sxy )Ip . The class of double shrinkage −1

estimators includes δw and other specific estimators. For example, δ(w, 1) is the estimator δw when α = w and β = 1. When α = β = γ , the estimator δ(γ , γ ) reduces to 1

δγ = γ Sy + (1 − γ ) tr(Sy )Ip ,

(2.13)

p

in the spherical case for 9(Sy|x ) and 9(Syx Sx−1 Sxy ). Additionally, δ(γ , γ ) reduces to

δγ = γ Sy + (1 − γ )diag(Sy ).

(2.14)

in the diagonal case for 9(Sy|x ) and 9(Syx Sx Sxy ). These estimators are obtained by shrinking the sample covariance matrix Sy in the direction of the spherical matrix p−1 tr(Sy )Ip or the diagonal matrix diag(Sy ). These estimators have been studied by Ledoit and Wolf [18], Fisher and Sun [12], Chen et al. [6], and others. −1

2.3. Approximation of optimal weights under normality The double shrinkage estimator δα,β given in (2.12) depends on α and β . We derive the optimal weights of α and β and provide their asymptotic approximations, where an estimator δ of 6y is evaluated by Risk(δ) = p−1 E[tr{(δ − 6y )2 }]. Straightforward calculations yield the following result. Lemma 1. If the function 9 satisfies that 9(A) + 9(B) = 9(A + B) for matrices A and B, then the risk of the estimator δα,β is written as Risk(δα,β ) = α

β

J11

J12 J22

J12

α − 2 J10 β

J20

α + R0 , β

(2.15)

which gives the optimal weights α ∗ and β ∗ from

α∗ β∗

=

J11 J12

J12 J22

−1

J10 J20

.

(2.16)

The resulting risk of the estimator δα ∗ ,β ∗ with the optimal weights α ∗ and β ∗ is Risk(δα ∗ ,β ∗ ) = R0 − J10

J20

J11 J12

J12 J22

−1

J10 J20

,

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

65

where R0 = p−1 Etr[{6y − 9(Sy )}2 ], J10 = p−1 Etr[{Sy|x − 9(Sy|x )}{6y − 9(Sy )}], J20 = p−1 Etr[{Syx Sx−1 Sxy − 9(Syx Sx−1 Sxy )}{6y − 9(Sy )}], J11 = p−1 Etr[{Sy|x − 9(Sy|x )}2 ], J22 = p−1 Etr[{Syx Sx−1 Sxy − 9(Syx Sx−1 Sxy )}2 ], J12 = p−1 Etr[{Sy|x − 9(Sy|x )}{Syx Sx−1 Sxy − 9(Syx Sx−1 Sxy )}]. We evaluate the moments given in Lemma 1 under normality assumption (2.7). Hereafter, we focus on the spherical cases of 9, namely

9(Sy|x ) = p−1 tr(Sy|x )Ip ,

9(Syx Sx−1 Sxy ) = p−1 tr(Syx Sx−1 Sxy )Ip ,

(2.17)

and 9(6y ), 9(6y|x ) and 9(6yx 6x 6xy ) are similarly defined. We assume the following conditions: −1

(A1) n, p, and q satisfy that (n, p) → ∞, n ≥ q, and q is bounded. (A2) a1 = p−1 tr(6y ) = O(1), b1 = p−1 tr(6y|x ) = O(1), b2 = p−1 tr(62y|x ) = O(1), φ11 = p−1 tr(6y 6y|x ) = O(1) and (6xy )ij = O(1) where (M )ij is the (i, j)-element of matrix M . Assumption (A1) concerns the high-dimensional case in which both n and p tend to infinity, but q is bounded. This restriction enables us to use standard results about the Wishart distribution for nSx . assumption (A2) permits non-zero off-diagonal elements of 6y|x , i.e., the cross-sectional correlations of idiosyncratic components while these elements are not so dense because b2 = p−1 tr(62y|x ) = O(1). The corresponding assumption in [10] is the boundedness of the eigenvalues of 6y|x ; constants 0 < c1 < c2 exist such that c1 < λmin (6y|x ) ≤ λmax (6y|x ) < c2 where λmin (6y|x ) and λmax (6y|x ) denote the smallest and largest eigenvalues of 6y|x . This implies that b2 = O(1). Moreover, Fan et al. [10] assumed sparsity of 6y|x , which √ restricts the maximal number of non-zero elements over the rows of 6y|x to the order of o( N / log p). Thus, Fan et al. [10] treated a special case among the class of 6y|x considered in this paper. One typical example is the covariance matrix with AR(1) structure: (6y|x )ij = ρ |i−j| , because it satisfies b2 = O(1) but is clearly not sparse. Note that this sparsity assumption guarantees the fast rates of convergence of the estimators given in [10]. Additionally, they allow auto-correlations among the xi ’s while we assume the independence of the xi ’s. Let a2 = p−1 tr(62y ). Concerning the order of a2 , we need to treat the two cases below: (i) a2 = O(p)

or

(ii) a2 = O(1).

Because 6yx = B6x from (2.3), we have 6y = 6y|x + B6x B⊤ , which yields tr(62y ) = tr(62y|x ) + 2tr(6y|x B6x B⊤ ) + tr{(B6x B⊤ )2 }. We consider the case where every element in the factor loadings B = (b1 , . . . , bq ) is uniformly bounded with respect to (n, p), and the number of non-zero elements in every column vector bj is of the order O(pc ) for some non-negative constant c. Then, a2 = O(pmax{2c −1,0} ). Hence, we have a2 = O(p) if the number of non-zero elements in every column vector bj is of the order O(p). If the number of non-zero elements in every column vector bj is of the order O(p1/2 ), we have a2 = O(1). The former and latter cases correspond to non-sparsity and sparsity, respectively. Lemma 2. Assume conditions (A1), (A2), and non-sparsity condition a2 = O(p). Assume that the (yi , xi )’s are normally distributed as in (2.7). When 9 satisfies sphericity (2.17), one has R0 = a2 − a21 + O(n−1 ), J10 = φ11 − a1 b1 + O(n−1 ), J11 = b2 −

b21

+n

−1

pb21

J20 = a2 − a21 − φ11 + a1 b1 + O(n−1 ),

+ O(n ) + O(n−2 p), −1

J22 = (1 + n−1 )a2 − 2φ11 + b2 − (a1 − b1 )2 + n−1 p(a21 − b21 ) + O(n−1 ) + O(n−2 p),

(2.18)

J12 = φ11 − a1 b1 − b2 + b21 + O(n−1 ). Below, we also provide approximations of the moments under sparsity condition a2 = O(1); the proof is omitted. Lemma 3. Assume conditions (A1), (A2), and sparsity condition a2 = O(1). Assume that the (yi , xi )’s are normally distributed as in (2.7). When 9 satisfies sphericity (2.17), the terms R0 and J22 in (2.18) should be replaced with R0 = a2 − a21 + O(n−1 p−1 ) and J22 = a2 − 2φ11 + b2 − (a1 − b1 )2 + n−1 p(a21 − b21 ) + O(n−1 ) + O(n−2 p). The evaluation of the other Jij terms under sparsity of factor loadings is the same as in Lemma 2.

66

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

Using these results, we evaluate the order of the risk function of Sy , δw∗ , δγ ∗ , and δα ∗ ,β ∗ as follows. [Risk of Sy ] First, it follows from (A.13) that the risk of Sy is Risk(Sy ) = R0 + J11 + 2J12 + J22 − 2(J10 + J20 )

= a2 /n + (p/n)a21 = O(p/n). [Risk of δw∗ ] Second, in the case of α = w and β = 1, using (2.18), we evaluate the risk of δw as Risk(δw ) = J11 w 2 − 2(J10 − J12 )w + J22 − 2J20 + R0 , which is minimized at

w∗ =

J10 − J12 J11

=

b2 − b21 + O(n−1 ) b2 − (1 − p/n)b21 + O(n−1 ) + O(n−2 p)

.

(2.19)

Thus, the risk of δw∗ is given by Risk(δw∗ ) = R0 + J22 − 2J20 − (J10 − J12 )2 /J11

= Risk(Sy ) − (J11 + J12 − J10 )2 /J11 ,

(2.20)

which shows that δw∗ improves on Sy . The risk can be approximated as Risk(δw∗ ) =

a2 n

{n−1 pb21 + O(n−1 ) + O(n−2 p)}2 = O(p/n), b2 − (1 − p/n)b21 + O(n−1 ) + O(n−2 p)

p

+ a21 − n

(2.21)

specifically,

−(p/n)2 b41 /(b2 − b21 ) Risk(δw∗ ) − Risk(Sy ) ≈ −c 2 b41 /{b2 − (1 − c )b21 } −(p/n)b21

when p/n → 0, when p/n → c , when p/n → ∞,

(2.22)

for some constant c. Because Risk(Sy ) = O(p/n), the difference between Risk(δw∗ ) and Risk(Sy ) appears in the negligible order term (p/n)2 when p/n → 0 but in the leading term p/n when p/n → ∞ or p/n → c. Thus, the improvement of δw∗ over Sy is significant unless p/n → 0. [Risk of δγ ∗ ] Third, in the case of α = β = γ , from [6], the optimal weight γ of δγ in (2.13) is

γ∗ =

J10 + J20 J11 + 2J12 + J22

=

a2 − a21 + O(n−1 )

(1 + 1/n)a2 − (1 − p/n)a21 + O(n−1 ) + O(n−2 p)

,

(2.23)

so that the risk of δγ ∗ is given by Risk(δγ ∗ ) = R0 − (J10 + J20 )2 /(J11 + 2J12 + J22 )

= Risk(Sy ) − (J11 + 2J12 + J22 − J10 − J20 )2 /(J11 + 2J12 + J22 ),

(2.24)

which shows that δγ ∗ improves on Sy . The risk can be approximated as Risk(δγ ∗ ) =

a2 n

+

p n

a21 −

{n−1 a2 + n−1 pa21 + O(n−1 ) + O(n−2 p)}2 . (1 + n−1 )a2 − (1 − p/n)a21 + O(n−1 ) + O(n−2 p)

Under sparsity condition a2 = O(1), the risk difference is

−(p/n)2 a41 /(a2 − a21 ) Risk(δγ ∗ ) − Risk(Sy ) ≈ −c 2 a41 /{a2 − (1 − c )a21 } −(p/n)a21

when p/n → 0, when p/n → c , when p/n → ∞.

(2.25)

Similarly to the discussion below (2.22), one can see that the improvement in δγ ∗ over Sy is significant unless p/n → 0. Under non-sparsity condition a2 = O(p), however, we have Risk(δγ ∗ ) − Risk(Sy ) ≈ −

p (a2 /p + a21 )2 n2

a2 /p

.

(2.26)

Because Risk(Sy ) = O(p/n), the risk difference appears in the negligible order term O(p/n2 ). [Risk of δα ∗ ,β ∗ ] Finally, it follows from Lemma 1 that the risk of δα ∗ ,β ∗ is written as Risk(δα ∗ ,β ∗ ) = R0 −

2 2 J22 J10 − 2J12 J10 J20 + J11 J20 2 J11 J22 − J12

.

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

67

To compare this risk with the risks of δw∗ and δγ ∗ , we express Risk(δα ∗ ,β ∗ ) as Risk(δα ∗ ,β ∗ ) = Risk(δw∗ ) −

= Risk(δγ ∗ ) −

{(J10 − J12 )J12 + (J22 − J20 )J11 }2 2 J11 (J11 J22 − J12 ) {(J22 + J12 )J10 − (J11 + J12 )J20 }2 , 2 (J11 + 2J12 + J22 )(J11 J22 − J12 )

which shows that δα ∗ ,β ∗ improves on δw∗ and δγ ∗ . Using Lemma 2, we find that

a p p 2 − φ11 + a1 b1 + (a21 − b21 ) , (J10 − J12 )J12 + (J22 − J20 )J11 ≈ (b2 − b21 )(a2 /p + a21 ) + b21 n n n p p 2 2 2 2 2 J11 J22 − J12 ≈ b2 − b1 + b1 a2 − 2φ11 + b2 − (a1 − b1 ) + (a2 /p + a1 − b21 ) n

n

− (φ11 − a1 b1 − b2 + b21 )2 .

Because J11 ≈ b2 − b21 + (p/n)b21 , we can verify that under condition a2 = O(1),

O{(p/n)2 } ∗ ∗ ∗ Risk(δα ,β ) − Risk(δw ) = O(1) O(p/n)

when p/n → 0, when p/n → c , when p/n → ∞,

(2.27)

and Risk(δα ∗ ,β ∗ ) − Risk(δw∗ ) = O(p/n2 ) under condition a2 = O(p). Concerning the risk difference Risk(δα ∗ ,β ∗ ) − Risk(δγ ∗ ), we note that

(J22 + J12 )J10 − (J11 + J12 )J20 ≈

p n

(φ11 − a1 b1 )(a2 /p + a21 ) − b21 (a2 − a21 ) ,

J11 + 2J12 + J22 ≈ a2 − a21 +

p n

(a2 /p + a21 ).

When a2 = O(1), Risk(δα ∗ ,β ∗ ) − Risk(δγ ∗ ) is of negligible order O(n/p) when p/n → ∞ while it has the same order as in (2.27) when p/n → 0 or c. Under non-sparsity condition a2 = O(p), we note that

−(p/n)2 b21 Risk(δα ∗ ,β ∗ ) − Risk(δγ ∗ ) ≈ −c 2 b41 /{b2 − (1 − c )b21 } −(p/n)b21

when p/n → 0, when p/n → c , when p/n → ∞,

(2.28)

which has the same order as in (2.27). The above results can be summarized as follows. Theorem 1. Assume conditions (A1) and (A2) to approximate the risk functions. (1) The sample covariance matrix Sy is improved by δw∗ and δγ ∗ , both of which are further improved by δα ∗ ,β ∗ . (2) Risk(Sy ) = O(p/n), and Risk(δw∗ ) − Risk(Sy ) = O(p/n) when p/n → ∞ or c while Risk(δw∗ ) − Risk(Sy ) is negligible when p/n → 0. (3) When a2 = O(1), Risk(δγ ∗ ) − Risk(Sy ) = O(p/n) and Risk(δα ∗ ,β ∗ ) − Risk(δw∗ ) = O(p/n), but Risk(δα ∗ ,β ∗ ) − Risk(δγ ∗ ) is negligible when p/n → ∞ or c. These risk differences are negligible when p/n → 0. (4) When a2 = O(p), the risk differences Risk(δγ ∗ ) − Risk(Sy ) and Risk(δα ∗ ,β ∗ ) − Risk(δw∗ ) are negligible, but Risk(δα ∗ ,β ∗ ) − Risk(δγ ∗ ) = O(p/n) when p/n → ∞ or c. These risk differences are negligible when p/n → 0. Theorem 1 implies that when p/n is large, the improvement of δα ∗ ,β ∗ over δγ ∗ is significant under a2 = O(p) but not significant under a2 = O(1). This result is supported by the simulation results in Section 4. Note that δγ ∗ is an estimator based on Sy , and δα ∗ ,β ∗ uses the statistic Syx associated with the correlation between yi and xi . When the factor loadings are dense, or the yi ’s are strongly correlated with the xi ’s, the result (2.28) suggests the use of the double shrinkage estimator δα∗ ,β ∗ for large p/n. 3. Plug-in estimators and the evaluation of their risks Because the optimal weights of δw∗ , δγ ∗ and δα ∗ ,β ∗ depend on unknown parameters, we need to estimate them. For a1 , b1 , a2 and b2 , we use the estimators given in [24] as aˆ 1 = p−1 tr(Sy ), bˆ 1 = n(pm)−1 tr(Sy|x ), aˆ 2 = bˆ 2 =

n2

p(n − 1)(n + 2) n2 p(m − 1)(m + 2)

tr(Sy2 ) − (trSy )2 /n ,

tr(Sy2|x ) − (trSy|x )2 /m ,

68

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

and φ11 is estimated by

φˆ 11 =

n pm

m−2

tr(Sy Sy|x ) −

(m − 1)(m + 2)

tr(Sy2|x ) −

m

(m − 1)(m + 2)

(trSy|x )2 ,

where m = n − q. Using these estimators and noting (2.18), we suggest the estimators Jˆ10 = φˆ 11 − aˆ 1 bˆ 1 ,

Jˆ20 = aˆ 2 − aˆ 21 − φˆ 11 + aˆ 1 bˆ 1 ,

Jˆ11 = bˆ 2 − bˆ 21 + n−1 pbˆ 21 ,

(3.1)

Jˆ22 = (1 + n−1 )ˆa2 − 2φˆ 11 + bˆ 2 − (ˆa1 − bˆ 1 )2 + n−1 p(ˆa21 − bˆ 21 ), Jˆ12 = φˆ 11 − aˆ 1 bˆ 1 − bˆ 2 + bˆ 21 . Then from (2.16), the optimal weights (α ∗ , β ∗ ) are estimated by

αˆ Jˆ = ˆ11 βˆ J12

Jˆ12 Jˆ22

−1

Jˆ10 Jˆ20

.

(3.2)

∗ ∗ We suggest the plug-in estimator δα, ˆ = ˆ βˆ in (2.12). Similarly, noting (2.19) and (2.23), we estimate w and γ by w ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ (J10 − J12 )/J11 and γˆ = (J10 + J20 )/(J11 + 2J12 + J22 ), respectively, and we suggest the plug-in estimators δwˆ and δγˆ ∗ .

To show the consistency of w ˆ , αˆ and βˆ as (n, p) → ∞, we assume the following condition: 2 1 −1 2 2 (A3) tr(63y ) = O(p3 ), tr(64y ) = O(p4 ), tr(63y|x ) = O(p), tr(64y|x ) = O(p), tr(6y|x 6yx 6− x 6xy ) = O(p ) and tr(6y|x 6yx 6x 6xy ) = O(p).

We first show the ratio-consistency of the estimators above under non-sparsity condition a2 = O(p), where θˆ is a ratioconsistent estimator of θ if θˆ /θ converges in probability to 1 as (n, p) → ∞. Lemma 4. Assume (A1), (A2), (A3), and non-sparsity condition a2 = O(p). Then, under normality, aˆ 1 , aˆ 2 , bˆ 1 , bˆ 2 , and φˆ 11 are all unbiased and ratio-consistent: aˆ 1 /a1 = 1 + Op (n−1/2 ),

aˆ 2 /a2 = 1 + Op (n−1/2 ),

bˆ 1 /b1 = 1 + Op {(np)−1/2 },

φˆ 11 /φ11 = 1 + Op (n

−1/2

bˆ 2 /b2 = 1 + Op {(np)−1/2 } + Op (n−1 ),

(3.3)

).

If sparsity condition a2 = O(1) holds instead of a2 = O(p), again, we can show the ratio-consistency of all the estimators in the following theorem; the proof is omitted. Lemma 5. Assume sparsity condition a2 = O(1) together with (A1), (A2) and (A3), where the conditions on tr(64y ) and

1 4 −1 2 2 tr(6y|x 6yx 6− x 6xy ) in (A3) are replaced with tr(6y ) = O(p) and tr(6y|x 6yx 6x 6xy ) = O(p). Then, under normality, we again obtain the following ratio-consistency:

aˆ 1 /a1 = 1 + Op {(np)−1/2 },

aˆ 2 /a2 = 1 + Op {(np)−1/2 } + Op (n−1 ),

bˆ 1 /b1 = 1 + Op {(np)−1/2 },

bˆ 2 /b2 = 1 + Op {(np)−1/2 } + Op (n−1 ),

−1/2

φˆ 11 /φ11 = 1 + Op {(np)

(3.4)

}.

The continuous mapping theorem and the above ratio-consistency imply that w ˆ , γˆ , αˆ and βˆ are consistent for w∗ , γ ∗ ,

α ∗ , and β ∗ , respectively. The proof is presented in the Appendix.

Theorem 2. Under the assumptions of Lemma 4 with non-sparsity condition a2 = O(p), one has

w ˆ − w∗ =

Op (n−1/2 ) Op (n1/2 /p)

when p/n → 0 or c , when p/n → ∞,

αˆ − α ∗ =

Op (n−1/2 ) Op (n1/2 /p)

when p/n → 0 or c , when p/n → ∞,

γˆ − γ ∗ = Op (n−1/2 ) and βˆ − β ∗ = Op (n−1/2 ). Under the assumptions of Lemma 5 with sparsity condition a2 = O(1), the differences w ˆ − w∗ , γˆ − γ ∗ , αˆ − α ∗ and βˆ − β ∗ have order Op {(np)−1/2 } when p/n → 0 or c and have order Op (p−1 ) when p/n → ∞. We can now evaluate the loss differences between the oracle estimators δw∗ , δγ ∗ , δα ∗ ,β ∗ , and the plug-in estimators δwˆ ,

δγˆ , δα, ˆ βˆ .

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

69

Theorem 3. Let Loss(δ) = p−1 tr(δ − 6y )2 . Under the assumptions of Lemma 4 with non-sparsity condition a2 = O(p), one has Loss(δwˆ ) = Loss(δw∗ ) + Op (n−1/2 ), Loss(δγˆ ) = Loss(δγ ∗ ) + Op (p/n), Loss(δα, ˆ βˆ ) = Loss(δα ∗ ,β ∗ ) + Op (p/n).

(3.5)

Under the assumptions of Lemma 5 with sparsity condition a2 = O(1), one has Op {(np)−1/2 } when p/n → 0 or c , Op (n−1 ) when p/n → ∞, Loss(δγˆ ) = Loss(δγ ∗ ) + Op (n−1 ), −1 Loss(δα, ˆ βˆ ) = Loss(δα ∗ ,β ∗ ) + Op (n ). Loss(δwˆ ) = Loss(δw∗ ) +

(3.6)

Although Theorem 3 gives approximations for the loss functions, it may be possible to provide the corresponding approximations for the risk functions for which we assume appropriate but complicated conditions. Using the fact that Pr{Loss(δ) > M } < E{Loss(δ)}/M = Risk(δ)/M for any large M, we note that L(Sy ), Loss(δw∗ ), Loss(δγ ∗ ), and Loss(δα ∗ ,β ∗ ) have order Op (p/n) because the corresponding risk functions are of the order O(p/n). Under non-sparsity condition a2 = O(p), from (3.5), the order of the loss difference Loss(δα, ˆ βˆ ) − Loss(δα ∗ ,β ∗ ) has the same order as Loss(δα ∗ ,β ∗ ). This is because

−1 2 2 the weight βˆ on Syx Sx−1 Sxy in δα, ˆ βˆ is of a larger order, such as tr(Syx Sx Sxy ) = Op (p ), and unstable when the factor loadings are dense and the estimation error of βˆ is amplified by the fluctuation of Syx Sx−1 Sxy . A similar comment applies √ to Loss(δγˆ ) − Loss(δγ ∗ ). Also from (3.5), Loss(δwˆ ) − Loss(δw∗ ) has the same loss difference as Risk(δw∗ ) for p = n. Under sparsity condition a2 = O(1), on the other hand, the loss differences are negligible as seen from (3.6).

4. Simulation studies We now investigate the numerical performance of the empirical risks of the proposed estimators through simulations. We assume throughout that µx = 0 and 6x = I3 for q = 3. We then consider the following four cases: (M1) The strict factor model (sphericity) with 6y|x = 3Ip . (M2) The approximate factor model with non-sparse error covariance matrix:

6y|x

σ1 =

σ2

..

.

ρ |1−1|/7 ρ |2−1|/7 .. . σp ρ |p−1|/7

ρ |1−2|/7 ρ |2−2|/7 .. . ρ |p−2|/7

··· ··· .. . ···

ρ |1−p|/7 σ1 ρ |2−p|/7 .. . |p−p|/7 ρ

σ2

..

.

, σp

√

for σj = 3{1 + +0.2(−1)j−1 (p − j + 1)/p} and ρ = 0.2. (M3) The approximate factor model with sparse error covariance matrix, viz.

6y|x = diag(σ12 , . . . , σp2 ) + ss⊤ − diag(ss⊤ ). Here, σi2 ∼ i.i.d. χ32 and s = (s1 , . . . , sp ), where si ∼ i.i.d. N (0, 1) with probability of 0.1 and si = 0 with probability of 0.9. (M4) The approximate factor model with the equi-correlation error covariance matrix: (6y|x )ii = 3, (6y|x )ij = 3ρ with ρ = 0.2. The covariance structure (M3) was considered by Fan et al. [10]. The condition (A2) does not hold in (M4) because b2 = O(p) in (M4). Following Fan et al. [10], we take p = 100 and let N increase from 30 to 300 by increments of 30. For each fixed N, we repeat the following steps 1000 times, and report the average of the percentage relative improvement in the average loss (PRIAL) of each estimator over Sy , defined by

PRIAL(δ) = 100 × 1 − 1/2

E{tr(δ − 6y )2 } E{tr(Sy − 6y )2 }

.

(1) Generate ϵi = 6y|x ui and xi for i = 1, . . . , N, where ui = (uij )1≤j≤p , xi = (xij )1≤j≤3 , and uij , xij ∼ i.i.d. N (0, 1). (2) Generate 6yx . We consider the following two cases for their densities: (L1) (6yx )ij ∼ i.i.d. N (0, 1). From the discussion in Section 2.3, this corresponds to the case of non-sparse factor loadings, namely, a2 = O(p). (L2) Each column of 6yx consists of randomly ordered 0.1 × p independent standard normal variables and 0.9 × p zeros. This corresponds to the case of sparse factor loadings, namely, a2 = O(1).

70

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

Table 1 PRIAL comparison of the estimators of 6y under the strict factor model (M1) for p = 100. N

δγˆ δwˆ δα, ˆ βˆ δAFM δSFM δγˆ δwˆ δα, ˆ βˆ δAFM δSFM

(L1)

(L2)

30

60

90

120

150

180

210

240

270

300

28.1 20.9 39.5 17.8 20.5

16.1 21.9 32.0 18.7 21.5

12.1 22.1 29.7 18.9 21.7

9.0 22.0 27.7 18.8 21.6

7.6 22.6 27.3 19.4 22.2

6.0 22.5 26.2 19.3 22.1

6.0 22.4 26.2 19.2 22.0

4.4 22.8 25.4 19.5 22.4

3.7 23.0 25.1 19.6 22.5

3.3 23.2 25.0 19.8 22.7

91.5 73.4 92.8 62.8 72.2

84.4 78.0 89.8 66.7 76.6

78.8 79.5 88.2 68.0 78.0

73.5 80.4 87.1 68.7 78.9

67.9 80.2 85.8 68.6 78.7

64.7 80.8 85.6 69.1 79.3

61.2 81.1 85.1 69.3 79.5

58.3 81.2 84.8 69.4 79.6

55.2 81.1 84.4 69.3 79.5

53.8 81.6 84.6 69.8 80.1

Table 2 PRIAL comparison of the estimators of 6y under the approximate factor model (M2) for p = 100. N

(L1)

(L2)

δγˆ δwˆ δα, ˆ βˆ δAFM δSFM δγˆ δwˆ δα, ˆ βˆ δAFM δSFM

30

60

90

120

150

180

210

240

270

300

28.1 12.0 30.9 14.0 4.2

16.6 8.4 19.3 12.9 −16.5

11.2 6.6 13.6 12.2 −37.8

7.9 5.4 10.0 11.5 −58.4

6.7 4.7 8.6 11.2 −79.6

5.8 4.1 7.5 10.6 −98.0

5.0 3.7 6.6 10.5 −118.4

4.8 3.3 6.3 10.2 −136.6

4.2 3.1 5.5 9.9 −156.3

3.8 2.9 5.0 9.6 −176.9

53.0 39.8 52.8 46.6 12.9

36.3 28.3 36.5 43.1 −53.9

27.6 21.5 28.0 39.6 −119.9

21.9 17.2 22.2 36.8 −184.6

18.5 14.5 18.8 34.6 −243.2

15.8 12.4 16.0 32.6 −300.7

13.9 11.1 14.1 31.3 −353.1

12.1 9.5 12.3 29.2 −405.4

11.1 8.8 11.2 28.3 −454.8

9.9 7.9 10.0 27.0 −498.0

(3) Generate a ∼ Np (0, Ip ). (4) Generate the yi ’s using the formula yi = a + Bxi + ϵi . Let δwˆ , δγˆ and δα, ˆ βˆ be given in Sections 2 and 3, where the corresponding optimal weights are estimated by the consistent estimators. To compare the performance of the above estimators with other methods given in the literature, we consider the approximate-factor-model-based thresholded estimator δAFM suggested by Fan et al. [10] and the strict-factor-model-based thresholded estimator δSFM suggested by Fan et al. [9], namely,

δAFM = Syx Sx−1 Xxy + Sy∗|x , δSFM = Syx Sx−1 Xxy + diag(Sy|x ), where Sy∗|x is the statistic derived from Sy|x by the thresholding method. Then, we compare the performance of the six estimators: the sample covariance matrix Sy , δγˆ , δwˆ , δα, ˆ βˆ , δAFM , and δSFM . We compute δAFM by the R package POET introduced in [11] where we set the number of factors to be zero (K = 0) and apply soft-thresholding with C = 0.5, which the authors reported works empirically. Tables 1–4 report the simulation results in estimation of 6y under the models (M1)–(M4), respectively. The case of nonnormal distributions yields similar results and, thus, are omitted. Table 1 reports the case of the strict factor model (M1) where the estimator δα, ˆ βˆ performs best, although δAFM and δSFM have comparable PRIAL when N becomes large. Tables 2, 3, and 4 report the cases of the approximate factor model with (M2) non-sparse, (M3) sparse and (M4) equicorrelated error covariance matrix, respectively. In Table 2, δAFM is the best in both cases of (L1) and (L2) except when the sample size is small, in which case δα, ˆ βˆ performs better. Because (M2) does not assume the strict factor model or the approximate factor model with sparse error covariance matrix, the estimator δSFM does not do as well as the sample covariance Sy ; δSFM thresholds all the off-diagonal elements of Sy|x and, thus, errors accumulate. On the other hand, in Table 3, although δα, ˆ βˆ performs better for small N, δAFM , and δSFM are preferable when the sample size is large. In Table 4, δSFM is preferable where, again, δα, ˆ βˆ works better with small sample size N. In Tables 1 and 3, the PRIAL of our proposal decreases when N increases, but the PRIAL values of δAFM and δSFM are almost constant. To explain this phenomenon, we write PRIAL = CN and consider the two cases of CN = C and CN = C /N for a positive constant C . Then, from the definition of PRIAL, the risk function Risk(δ) = E{tr(δ − 6y )2 } of an estimator δ is expressed as Risk(δ) = (1 − CN )Risk(Sy ) =

(1 − C )Risk(Sy ) Risk(Sy ) − (C /N )Risk(Sy )

for CN = C , for CN = C /N .

This implies that the improvement in δ over Sy appears in the first-order terms of the risk functions when CN = C > 0 and in the second-order terms when CN = C /N. In Tables 1 and 3, we treat the cases that 6y|x is a spherical matrix and close

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

71

Table 3 PRIAL comparison of the estimators of 6y under the approximate factor model (M3) for p = 100. N

δγˆ δwˆ δα, ˆ βˆ δAFM δSFM δγˆ δwˆ δα, ˆ βˆ δAFM δSFM

(L1)

(L2)

30

60

90

120

150

180

210

240

270

300

35.0 24.3 44.3 23.8 27.4

20.4 22.7 33.0 25.5 29.3

15.7 19.9 28.1 25.5 29.2

12.2 17.9 24.1 25.7 29.4

9.2 16.6 21.0 26.3 30.1

8.7 15.0 19.4 26.1 29.8

7.4 13.7 17.5 25.9 29.5

6.3 13.0 16.0 26.4 30.0

5.9 11.9 14.9 26.0 29.5

4.1 11.3 13.0 26.4 30.0

81.4 65.8 81.2 62.7 71.0

68.3 59.5 68.9 63.7 70.5

58.9 53.1 59.9 62.6 67.5

51.8 47.6 52.9 61.1 64.1

46.2 43.1 47.5 59.4 60.6

41.8 39.5 43.1 57.9 57.3

38.0 36.2 39.5 56.1 53.8

34.9 33.5 36.3 54.7 50.8

32.5 31.4 33.9 53.3 47.8

30.3 29.3 31.6 51.9 45.0

Table 4 PRIAL comparison of the estimators of 6y under the approximate factor model (M4) for p = 100. N

δγˆ δwˆ δα, ˆ βˆ δAFM δSFM δγˆ δwˆ δα, ˆ βˆ δAFM δSFM

(L1)

(L2)

30

60

90

120

150

180

210

240

270

300

27.3 17.3 32.4 15.1 18.6

15.7 14.8 21.5 11.9 16.7

11.7 12.7 17.2 9.1 14.9

8.7 11.1 13.7 7.0 13.5

7.2 9.9 11.8 5.4 12.7

5.8 8.9 10.0 4.1 12.0

5.2 8.0 9.1 2.9 11.0

4.5 7.3 8.1 1.9 10.1

4.2 6.7 7.5 1.3 10.1

3.7 6.2 6.9 0.5 9.2

49.0 38.8 49.0 33.9 42.0

32.4 26.8 32.0 21.5 30.3

24.1 20.3 23.5 14.6 23.8

19.3 16.4 18.6 10.3 20.1

16.1 13.7 15.4 7.4 17.4

13.8 11.8 13.1 5.5 16.1

12.1 10.4 11.3 3.8 14.4

10.8 9.2 10.0 2.5 13.2

9.7 8.3 9.0 1.5 12.0

8.8 7.6 8.1 0.8 11.5

to a diagonal matrix, and these situations are preferred by the estimators δAFM and δSFM . Since the PRIAL values of δAFM and δSFM in Tables 1 and 3 are almost constant, the above argument implies that these estimators improve Sy in terms of the first order of risk in the corresponding situations of 6y|x . We next conduct experiments with fixed N = 100 and p = 100, 200, 300, 400, and 500. Steps (1)–(4) are repeated 200 times, and the results are reported in Table 5. The risk difference between δγˆ and δα, ˆ βˆ is small in the cases of (L2) but relatively large in the cases of (L1), which is consistent with the results in Section 2.3. Overall, the risks of δAFM and δα, ˆ βˆ are stable with any covariance structure or distribution treated in the simulation. Additionally, δα, works well, particularly when the sample size is small relative to the dimension. We thus recommend the ˆ βˆ use of δα, in estimation of 6 in high dimension. y ˆ βˆ 5. Applications to portfolio management In this section, we consider Markowitz’s problem (variance minimization) min c ⊤ 6c c

subject to c ⊤ 1 = 1,

where c is a p × 1 vector of weights of assets, 1 is the p × 1 vector of 1, and 6 is a p × p covariance matrix of capital assets. 6 corresponds to 6y in the previous sections. The analytic solution of the quadratic design problem is known as c ∗ = (1⊤ 6−1 1)−1 6−1 1, and the minimum variance becomes varmin = 1/(1⊤ 6−1 1). Because 6 is unknown, we estimate 6 using a suitable estimator 6. As in [10], the plug-in minimum variance portfolio is given by −1

cˆ = cˆ ( 6) = (1⊤ 6

−1

1)−1 6

1.

We compare the actual variances of portfolios cˆ that are constructed based on the estimators of 6. As a factor model, we treat the standard Fama–French three-factor model handled by Fan et al. [9,10], viz. yij = β0 + β1 MKTij + β2 SMBij + β3 HMLij + εij , where yij is the daily excess return of the jth stock on day i, MKTij is the total market excess daily return, SMBij is the small minus the large share factor, HMLij is the book-to-price ratio, and εij is the error term. We conduct data analysis using the S&P500 daily excess returns and the daily three-factor data of the Fama–French model from August 2003 to July 2013 (10 years). The data are obtained from QuantQuote (https://quantquote.com/historical-stockdataand) and the data library on French’s website.

72

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

Table 5 PRIAL comparison of the estimators of 6y under factor models (M1)–(M4) for N = 100. p

(L1)

(L2)

(L1)

(L2)

(M1)

(M2)

100

200

300

400

500

100

200

300

δγˆ δwˆ δα, ˆ βˆ δAFM δSFM δγˆ δwˆ δα, ˆ βˆ δAFM δSFM

12.4 21.9 30.0 18.7 21.5

12.2 22.0 30.0 19.5 21.8

8.8 22.8 27.9 20.5 22.7

11.4 22.8 30.0 20.6 22.7

10.3 21.9 28.4 19.9 21.8

8.0 6.4 10.7 12.3 −43.7

9.4 9.3 14.9 16.0 −12.5

12.3 11.2 19.2 17.3 −1.1

10.1 12.9 19.0 18.3 4.9

11.9 13.9 21.6 18.6 8.3

77.0 80.0 87.8 68.4 78.5

78.3 80.3 88.3 70.9 79.5

78.4 80.0 88.4 71.7 79.4

77.7 79.7 88.0 72.1 79.3

78.1 79.8 88.1 72.7 79.5

24.6 19.1 25.0 38.0 −146.3

37.5 30.5 37.4 53.2 −43.2

45.5 38.6 45.7 59.7 −3.8

50.6 44.4 51.5 63.2 16.6

54.7 48.8 56.1 65.4 29.0

p

(M3) 100

200

300

400

500

200

300

400

500

12.5 20.3 26.6 24.9 28.3

14.3 22.3 29.9 25.0 27.6

15.3 25.4 33.5 27.0 29.8

14.4 24.7 32.6 24.8 27.2

13.0 27.0 33.3 26.6 29.2

9.5 12.2 15.0 8.3 14.3

10.1 12.9 15.8 11.5 23.1

9.8 13.3 15.7 12.8 26.4

10.0 13.3 16.0 13.3 27.4

10.1 13.4 16.2 14.0 28.9

60.7 55.0 61.6 71.0 81.3

71.7 66.8 74.2 72.9 81.3

74.9 71.0 78.3 73.8 81.4

77.9 73.5 80.8 75.2 82.9

77.1 73.2 80.5 73.1 80.2

18.9 22.3 21.6 12.8 21.9

19.0 22.4 21.7 16.5 33.2

19.0 22.4 21.6 18.7 38.6

19.1 22.5 21.7 19.4 39.8

19.0 22.4 21.6 20.3 41.5

δγˆ δwˆ δα, ˆ βˆ δAFM δSFM δγˆ δwˆ δα, ˆ βˆ δAFM δSFM

400

500

(M4) 100

On the first trading day of each month t for t = August 2004, . . . , July 2013, we construct an estimate of 6 based on the historical data of excess daily returns and the three factors for the preceding 12 months t − 1, t − 2, . . . , t − 12; we denote the resulting estimator by δt −1 . This implies that the sample size N for the estimation of 6 is approximately 252 for every t. The 449 stocks listed during the 10 years have complete data, specifically, the dimension of 6 is p = 449. For the estimation procedures used for constructing δt −1 for 6, we use the five estimators δγˆ , δwˆ , δα, ˆ βˆ , δAFM , and δSFM and compute the corresponding five estimated optimal portfolio weights cˆt −1 (δt −1 ) based on the preceding 12 months t − 1, t − 2, . . . , t − 12. The value of the threshold in δAFM is chosen based on the Cmin function suggested in Section 4.1 of Fan et al. [11]. As explained in [17,7], at the end of each month t for t = August 2004, . . . , July 2013, we calculate the actual portfolio (out-of-sample) variance during month t with weight cˆt −1 (δ), i.e., APVt (δ) = {ˆct −1 (δ)}⊤ St cˆt −1 (δ), where St is the sample covariance matrix of the daily excess returns of month t. Thus, we obtain 108 values of the APVt (δ)’s for t = August 2004, . . . , July 2013. For each estimator δ, we compute the average actual portfolio variance and the largest actual portfolio variance over T = 108 months, respectively given by average APV(δ) =

T 1

T t =1

APVt (δ),

largest APV(δ) = max APVt (δ). 1≤t ≤T

These quantities for the four estimators δγˆ , δα, ˆ βˆ , δAFM and δSFM are reported in Table 6, where the corresponding values for δwˆ are omitted because the absolute value of the difference in the empirical portfolio variances of δwˆ and δα, ˆ βˆ is less than 0.014. For each t, the minimum actual portfolio variance is given by minAPVt = min{APVt (δ); δ = δγˆ , δα, , δ ˆ ˆ β AFM , δSFM }. For each estimator δ, Table 6 also reports the value the rate of taking the smallest APV =

1 T

× the number of t such that APVt (δ) = minAPVt .

Table 6 shows that the highest rate of taking the smallest APV is given by δAFM with 47.2%, and δα, ˆ βˆ gives the second highest rate at 33.3%. Over the 108 months, δα, ˆ βˆ has the smallest average actual portfolio variance at 0.359. The largest actual variance of the portfolio based on δAFM and δSFM are higher than the others. This is because δAFM and δSFM have large actual variances during the global financial crisis. Thus, overall, δα, ˆ βˆ is favorable for constructing minimum variance portfolios. We next compare the estimators based on the empirical quadratic losses for the same data. For t = August 2004, . . . , July 2013, the actual loss of δt is calculated by Losst (δ) = tr(δt −1 − St )2 /p, where St is the sample covariance matrix of month t, and δt −1 is calculated based on the historical data of excess daily returns and the three factors for the preceding 12 months t − 1, t − 2, . . . , t − 12. We then obtain the empirical percentage relative improvement in average loss (EPRIAL),

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

73

Table 6 Comparison of the empirical variances of plug-in minimum variance portfolio based on the covariance estimators.

Average APV Largest APV Rate of taking the smallest APV

δγˆ

δα, ˆ βˆ

δAFM

δSFM

0.416 5.839 11.1%

0.359 6.398 33.3%

0.390 7.857 47.2%

0.570 9.138 8.3%

Table 7 Comparison of empirical PRIAL of the covariance estimators.

Average EPRIAL Smallest EPRIAL Rate of taking the largest EPRIAL

δγˆ

δwˆ

δα, ˆ βˆ

δAFM

δSFM

2.4 −4.5 33.3%

1.9 −2.9 18.5%

2.8 −3.1 7.4%

2.4 −14.0 18.5%

2.1 −19.8 22.2%

viz. EPRIALt (δ) = 100 × {Losst (δS ) − Losst (δ)}/Losst (δS ) for the estimators δγˆ , δwˆ , δα, ˆ βˆ , δAFM and δSFM , where δS denotes the estimator based on the corresponding sample covariance matrix. Similar to APV, we calculate the following quantities: T 1

EPRIALt (δ), T t =1 smallest EPRIAL(δ) = min EPRIALt (δ), average EPRIAL(δ) =

1≤t ≤T

and the rate of taking the largest EPRIAL =

1 T

× the number of t such that EPRIALt (δ) = maxEPRIALt ,

for maxEPRIALt = max{EPRIALt (δ); δ = δwˆ , δγˆ , δα, ˆ βˆ , δAFM , δSFM }. These values are reported in Table 7. Table 7 shows that the estimators δAFM and δSFM give the smallest EPRIAL −14.0 and −19.8 and are unstable at the time of the global financial crisis. The estimator δα, ˆ βˆ gives the largest average EPRIAL, which suggests that it may be more stable than δAFM and δSFM during the global financial crisis. 6. Concluding remarks To estimate a large covariance matrix, we considered double shrinkage estimators, which include (i) linear combinations of the sample covariance matrix and the specific estimators suggested under the strict factor models and (ii) the linear shrinkage estimators suggested in high-dimensional situations as special cases. Under the assumption of non-sparsity and sparsity factor loadings, we derived the optimal weights and provided consistent estimators thereof. The resulting plugin estimators are invertible and well-conditioned. They are also useful when strict factor models are assumed and when the approximate factor models hold or the factor models cannot be assumed. Numerical results show that the suggested estimators perform well in various structures of the covariance matrix. Our application to portfolio management shows that the procedures based on the suggested estimators give the smallest actual portfolio variance. Concerning the matrix 9, we treated the spherical case (2.17) to simplify the presentation of the results. When 9 is a diagonal matrix, we derived similar results, but the proofs and conditions for the results are more complicated, and we omitted the diagonal case. For the details of the diagonal case, see [15]. For the numerical performances of the estimators with diagonal and spherical matrices, there is little difference between the two cases. Acknowledgments We would like to thank the Editor, the Associate Editor and three reviewers for many valuable comments and helpful suggestions that led to an improved version of this paper. We would also like to thank Editage (www.editage.jp) for English language edition. The second author acknowledges support from Grant-in-Aid for Scientific Research (15H01943 and 26330036), Japan. Appendix. Proofs A.1. Proof of Lemma 2 −1 ⊤ For notational convenience, let V11 = nSy , V22 = nSx and V12 = V21 = nSyx . Denote V11.2 = V11 − V12 V22 V21 . If n ≥ p + q, then from the standard theory of the Wishart distribution, V11.2 is independent of (V12 , V22 ), and V11.2 ∼ Wp (n − q, 6y|x ),

74

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

1 V12 |V22 ∼ Np,q (6yx 6− x V22 , 6y|x , V22 ) and V22 ∼ Wq (n, 6x ). These properties are helpful in evaluating the risk when n ≥ p + q. When p > n, however, we cannot use these properties. To evaluate the risk in this case, we will call on the following lemmas, which hold true for any order of n and p.

Lemma 6. Let W = XX ⊤ = X1 X1⊤ + · · · + Xn Xn⊤ for X = (X1 , . . . , Xn ), where the Xi ’s are mutually independently and identically distributed as Xi ∼ Np (0, 6). Then E(W ) = n6 and E(W 2 ) = n(n + 1)62 + n{tr(6)}6. For p × p symmetric matrices A and B, one has E(trAW 2 ) = n(n + 1)(trA62 ) + n(tr6)(trA6) and E{(trAW )(trBW )} = n2 (trA6)(trB6) + 2ntr(A6B6). Also, E{(trW 2 )2 } = 4n2 (2n2 + 5n + 5)tr64 + 16n(n + 1)tr63 tr6 + n(n3 + 2n2 + 5n + 4)(tr62 )2

+ 2n(n2 + n + 4)(tr62 )(tr6)2 + n2 (tr6)4 ,

(A.1)

and E{(trW )4 } = 48ntr(64 ) + 32n2 tr(63 )tr(6) + 12n2 (tr62 )2 + 12n3 (tr62 )(tr6)2 + n4 (tr6)4 .

(A.2)

Watamori [28] derived identities (A.1) and (A.2) using the properties of the Wishart distribution when n ≥ p. Lemma 6 confirms that the same identities hold for any pair (n, p) without assuming the Wishart distribution. Proof. We note that E(W ) = E(X1 X1⊤ + · · · + Xn Xn⊤ ) = n6. To evaluate the other terms, we use the Stein–Haff or Konno identity and the Stein identity, respectively given by E(WG1 ) = E{n6G1 + 6(X ∇ ⊤ )⊤ G1 }, E(XG2 ) = E(6∇ G2 ),

(A.3) (A.4)

where G1 and G2 are, respectively, p × p and n × p matrices of functions of W , and ∇ = (∂/∂ xij ) is the p × n matrix of differential operators. See [16,25,26] for (A.3) and (A.4), respectively. It follows from (A.3) that E{tr(WG1 )} = E[ntr(6G1 ) + tr{X ∇ ⊤ (6G1⊤ )}].

(A.5)

To compute ∇ ⊤ (6G1⊤ ), the following equalities from Haff [13,14] are useful:

∇(UV ) = (∇ U )V + (U ⊤ ∇ ⊤ )⊤ V , tr{∇(UV )} = tr{(∇ U )V } + tr{U ⊤ (∇ ⊤ V ⊤ )}, ∇ ⊤ (A1 X ) = (trA1 )In , ∇(A2 X ) = A⊤ 2 .

(A.6)

where U and V are matrices of the functions of W such that the product ∇ UV is defined, and A1 and A2 are p × p and n × p matrices, respectively, of constants. For E(W 2 ), E(W 2 ) = E{n6W + 6(X ∇ ⊤ )⊤ W }. We show that (X ∇ ⊤ )⊤ W = W + (trW )I , so that E(W 2 ) = E{n6W + 6W + (trW )6} = n(n + 1)62 + n(tr6)6. This implies that E(trAW 2 ) = n(n + 1)(trA62 ) + n(tr6)(trA6). Using (A.5), we obtain E{(trAW )(trBW )} = E{n(trA6)(trBW )} + Etr[X ∇ ⊤ {6A(trBW )}]. Because ∇ ⊤ tr(BXX ⊤ ) = 2X ⊤ B, we have E{(trAW )(trBW )} = E{n(trA6)(trBW ) + 2tr(6AWB)}

= n2 (trA6)(trB6) + 2ntr(A6B6). The computation of E{tr(W 2 )2 } is more difficult. A sketch of proof goes as follows. From (A.5), one has E{(trW 2 )2 } = E{n(tr6W )(trW 2 )} + Etr[X ∇ ⊤ {6W (trW 2 )}]. We show that tr[X ∇ ⊤ {6W (trW 2 )}] = (tr6)(trW )(trW 2 ) + tr{X (X ⊤ 6∇)⊤ X ⊤ (trW 2 )}, tr{X (X ⊤ 6∇)⊤ X ⊤ (trW 2 )} = (tr6W )(trW 2 ) + tr{X ⊤ 6(X ⊤ X ∇ ⊤ )⊤ (trW 2 )}, tr{X ⊤ 6(X ⊤ X ∇ ⊤ )⊤ (trW 2 )} = 4tr(6W 3 ).

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

75

Combining these terms yields E{(trW 2 )2 } = E{(n + 1)(tr6W )(trW 2 ) + (tr6)(trW )(trW 2 ) + 4(tr6W 3 )}.

(A.7)

Similarly, we note that E{(trW 2 )(tr6W )} = E{(n + 1)(tr6W )2 + (tr6)(trW )(tr6W ) + 2tr62 W 2 }, E{(trW 2 )(trW )} = E{(n + 1)(tr6W )(trW ) + (tr6)(trW )2 + 2tr6W 2 }, E{tr(6W 3 )} = E{(n + 2)tr62 W 2 + (tr62 )(trW 2 ) + (trW )(tr62 W )}.

(A.8)

Combining (A.7) and (A.8), and using the second moments of W , we obtain identity (A.1). For E{(trW )4 }, the identity in (A.5) gives E{(trW )4 } = E{n(tr6)(trW )3 + tr6X ∇ ⊤ (trW )3 }. Because tr6X ∇ ⊤ (trW )3 = 6(tr6W )(trW )2 , we have E{(trW )4 } = E{n(tr6)(trW )3 + 6(tr6W )(trW )2 }.

(A.9)

Similarly, we demonstrate that E{(trW )3 } = E{n(tr6)(trW )2 + 4(trW )(tr6W )}, E{(tr6W )(trW )2 } = E{n(tr62 )(trW )2 + 4(tr62 W )(trW )}. Hence, combining (A.9) and (A.10), and using the second moments of W , we show the equality in (A.2).

(A.10)

Lemma 7. For the random matrices V11 , V22 , and V12 defined at the beginning of the Appendix, the following properties hold true under normality for any positive integers n, p and q with n ≥ q: −1 (1) V11.2 = V11 − V12 V22 V21 is expressed as V11.2 = UU ⊤ for U = (U1 , . . . , Un−q ), where Ui ’s are mutually independently and identically distributed as Ui ∼ Np (0, 6y|x ). (2) V11.2 is independent of (V12 , V22 ). 1 (3) V12 |V22 ∼ Np,q (6yx 6− x V22 , 6y|x , V22 ) and V22 ∼ Wq (n, 6x ).

Proof. We first note that Y = (Y1 , . . . , Yn ) and X = (X1 , . . . , Xn ) exist such that V11 = YY ⊤ , V12 = YX ⊤ , V22 = XX ⊤ , and the (Xi , Yi )’s are mutually independently and identically distributed as Yi Xi

∼ Np+q

0 6 , y 0 6xy

6yx 6x

.

1 Let Z = (Z1 , . . . , Zn ) = Y − 6yx 6− x X . We note that Z is independent of X , and the Zi ’s are mutually independently and identically distributed as Zi ∼ Np (0, 6y|x ). Then, V11 and V12 may be rewritten as 1 ⊤ ⊤ −1 V11 = (Z + 6yx 6− x X )(Z + X 6x 6xy ) 1 −1 ⊤ −1 ⊤ −1 = ZZ ⊤ + ZX ⊤ 6− x 6xy + 6yx 6x XZ + 6yx 6x XX 6x 6xy , 1 ⊤ 1 ⊤ V12 = (Z + 6yx 6− = ZX ⊤ + 6yx 6− x X )X x XX .

(A.11)

−1 Additionally, V12 V22 V21 may be expressed as −1 1 ⊤ ⊤ −1 1 V12 V22 V21 = (Z + 6yx 6− X (Z ⊤ + X ⊤ 6− x X )X (XX ) x 6xy ) 1 −1 ⊤ −1 ⊤ −1 = ZX ⊤ (XX ⊤ )−1 XZ ⊤ + ZX ⊤ 6− x 6xy + 6yx 6x XZ + 6yx 6x XX 6x 6xy .

Thus, we obtain V11.2 = Z {I − X ⊤ (XX ⊤ )−1 X }Z ⊤ .

(A.12)

Because I − X (XX ) X is idempotent and of rank n − q, a p × (n − q) random matrix U exists such that V11.2 = UU ⊤ and U ∼ Np,n−q (0, 6y|x , In−q ). This shows part (1) of Lemma 7. For part (2), note that XX ⊤ is complete and sufficient for 6x and I − X ⊤ (XX ⊤ )−1 X is ancillary because XX ⊤ ∼ Wq (n, 6x ). It follows from Basu’s theorem that I − X ⊤ (XX ⊤ )−1 X is independent of XX ⊤ . Recall expression (A.12). V11.2 is independent of V22 = XX ⊤ . To check the independence between V11.2 and V12 , from (A.11) and (A.12), it is sufficient to show that ZX ⊤ is independent of Z {I − X ⊤ (XX ⊤ )−1 X }. Because the two are conditionally mutually independent given X , for measurable sets A ⊂ Rp×q and B ⊂ Rp×n , ⊤

⊤ −1

Pr[ZX ⊤ ∈ A, Z {I − X ⊤ (XX ⊤ )−1 X } ∈ B] = E[Pr[ZX ⊤ ∈ A, Z {I − X ⊤ (XX ⊤ )−1 X } ∈ B|X ]]

= E[Pr(ZX ⊤ ∈ A|X )Pr[Z {I − X ⊤ (XX ⊤ )−1 X } ∈ B|X ]].

76

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

Because Z ∼ Np,n (0, 6y|x , In ), we see that ZX ⊤ |X ∼ Np,q (0, 6y|x , XX ⊤ ) conditionally, specifically, the conditional distribution of ZX ⊤ depends on X through XX ⊤ . Using the fact that I − X ⊤ (XX ⊤ )−1 X is independent of XX ⊤ again, we note that E[Pr(ZX ⊤ ∈ A|X )Pr[Z {I − X ⊤ (XX ⊤ )−1 X } ∈ B|X ]] = E{Pr(ZX ⊤ ∈ A|X )}E[Pr[Z {I − X ⊤ (XX ⊤ )−1 X } ∈ B|X ]], which equals Pr(ZX ⊤ ∈ A)Pr[Z {I − X ⊤ (XX ⊤ )−1 X } ∈ B]. 1 1 ⊤ ⊤ ⊤ ⊤ For part (3), it follows that V12 |X ∼ Np,q (6yx 6− + 6yx 6− x XX , 6y|x , XX ) because V12 = ZX x XX . Because this ⊤ conditional distribution depends on X through V22 = XX , the conditional distribution of V12 given V22 has the same distribution. Therefore, the proof of Lemma 7 is complete. Proof of Lemma 2. In this proof, we use the notation m = n − q for simplicity. From Lemmas 6 and 7, it follows that E(Sy ) = 6y , E(Sy2 ) = n−1 (n + 1)62y + n−1 {tr(6y )}6y , E[{tr(Sy )}2 ] = 2n−1 tr(62y ) + (tr6y )2 , E(Sy|x ) = n−1 m6y|x , E(Sy2|x ) = n−2 m(m + 1)62y|x + n−2 m{tr(6y|x )}6y|x and E[{tr(Sy|x )}2 ] = 2n−2 mtr(62y|x ) + n−2 m2 {tr(6y|x )}2 . Thus, we observe that p−1 E{tr(Sy )} = a1 , p−1 E{tr(Sy|x )} = b1 + O(n−1 ), p−1 E{tr(Sy2 )} E[{p−1 tr(Sy )}2 ] p−1 E{tr(Sy2|x )} E[{p−1 tr(Sy|x )}2 ]

= = = =

(1 + 1/n)a2 + (p/n)a21 , a21 + O(n−1 ), b2 + (p/n)b21 + O(n−1 ) + O(pn−2 ), b21 + O(p−1 n−1 ).

(A.13)

These evaluations are used to approximate the terms R0 , J11 , J12 , J22 , J10 , and J20 given in Lemma 1. For R0 and J11 , R0 = p−1 E[tr{6y − 9(Sy )}2 ] = a2 − 2a21 + E[{p−1 tr(Sy )}2 ] = a2 − a21 + O(n−1 ), J11 = p−1 E[tr{Sy|x − 9(Sy|x )}2 ] = p−1 E{tr(Sy2|x )} − E[{p−1 tr(Sy|x )}2 ]

= b2 − b21 + (p/n)b21 + O(n−1 ) + O(pn−2 ). Because Sy|x and Syx Sx−1 Sxy are mutually independent, J12 = p−1 E[tr{Sy|x − 9(Sy|x )}{Syx Sx−1 Sxy − 9(Syx Sx−1 Sxy )}]

= p−1 tr[E{Sy|x − 9(Sy|x )}E{Syx Sx−1 Sxy − 9(Syx Sx−1 Sxy )}]. Noting that Syx Sx−1 Sxy − 9(Syx Sx−1 Sxy ) = Sy − 9(Sy ) − Sy|x + 9(Sy|x ), we have J12 =

m pn

tr{6y|x − 9(6y|x )} {6y − 9(6y )} −

n m

{6y|x − 9(6y|x )}

= (m/n)(φ11 − a1 b1 ) − (m/n)2 (b2 − b21 ) = φ11 − a1 b1 − b2 + b21 + O(n−1 ). Because J22 is expressed as J22 = p−1 Etr[{Sy − 9(Sy )} − {Sy|x − 9(Sy|x )}]2 , it can be further rewritten as J22 = p−1 Etr{Sy − 9(Sy )}2 − 2p−1 Etr[{Sy − 9(Sy )}{Sy|x − 9(Sy|x )}] + J11

= p−1 Etr{Sy − 9(Sy )}2 − 2(J12 + J11 ) + J11 = p−1 Etr{Sy − 9(Sy )}2 − 2J12 − J11 . Similar to J11 , we have J22 = {(1 + n−1 )a2 + (p/n)a21 − a21 + O(n−1 )} − 2{φ11 − a1 b1 − b2 + b21 + O(n−1 )}

− {b2 − b21 + (p/n)b21 + O(n−1 ) + O(pn−2 )} p

= (1 + n−1 )a2 − 2φ11 + b2 + (a21 − b21 ) − (a1 − b1 )2 + O(n−1 ) + O(n−2 p). n

For J10 , we note that J10 = p−1 Etr[{Sy|x − 9(Sy|x )}{6y − 9(Sy )}]

= p−1 tr[(m/n){6y|x − 9(6y|x )}6y ] = (m/n)(φ11 − a1 b1 ) = φ11 − a1 b1 + O(n−1 ). For J20 , we note that J20 = p−1 Etr[{Syx Sx−1 Sxy − 9(Syx Sx−1 Sxy )}{6y − 9(Sy )}]

= p−1 Etr[{Sy − 9(Sy )}6y ] − p−1 Etr[{Sy|x − 9(Sy|x )}6y ],

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

77

which leads to J20 =

1 p

tr[{6y − 9(6y )}6y ] −

m pn

tr[{6y|x − 9(6y|x )}6y ]

= a2 − a21 − (m/n)(φ11 − a1 b1 ) = a2 − a21 − φ11 + a1 b1 + O(n−1 ). Therefore, the proof of Lemma 2 is complete.

A.2. Proof of Lemma 4 Because the unbiasedness of the estimators is easy to show, we only prove their consistency. For this purpose, we evaluate the variances of the estimators. For aˆ 1 and aˆ 2 , Srivastava [24] shows that var(ˆa1 ) = 2a2 /(np) and var(ˆa2 ) =

8(n + 2)(n + 3)(n − 1)2 pn5

a4 +

4(n + 2)(n − 1) n4

a22 −

a4

p

,

so that aˆ 1 = a1 + Op (n−1/2 ) and aˆ 2 = a2 + Op (n−1/2 p), because a1 = O(1) but a2 = O(p) and a4 = O(p3 ). Similarly, using Lemmas 6 and 7, we can show that var(bˆ 1 ) = 2b2 /(np) and

8(m + 2)(m + 3)(m − 1)2 4(m + 2)(m − 1) b4 2 ˆ var(b2 ) = b4 + b2 − , 5 4 pn

n

p

so that bˆ 1 = b1 + Op {(np)−1/2 } and bˆ 2 = b2 + Op (n−1 ) + Op {(np)−1/2 }, because b1 = O(1), b2 = O(1) and b4 = O(1). We write φˆ 11 as 1

φˆ 11 − φ11 =

2 2 2 m2 trV11 .2 − m{(m + 1)tr611.2 + (tr611.2 ) }

pnm(m − 1)(m + 2) − m (trV11.2 )2 − m{m(tr611.2 )2 + 2tr6211.2 } −1 + (m − 1)(m + 2) trV12 V22 V21 V11.2 − (nmtr611 611.2 − m2 tr6211.2 ) 1 m2 I1 − mI2 + (m − 1)(m + 2)I3 , =: pnm(m − 1)(m + 2)

(A.14)

where E(I1 ) = E(I2 ) = E(I3 ) = 0. Hence, it is sufficient to evaluate the variances of I1 , I2 , and I3 because the cross product 2 2 2 2 2 2 terms are bounded by the Cauchy–Schwarz inequality. For I1 , E(I12 ) = E{(trV11 .2 ) } − m {(m + 1)tr611.2 + (tr611.2 ) } . Because from Lemma 6, 2 2 2 2 4 3 3 2 2 2 E{(trV11 .2 ) } = 4m (2m + 5m + 5)tr611.2 + 16m(m + 1)tr611.2 tr611.2 + m(m + 2m + 5m + 4)(tr611.2 )

+ 2m(m2 + m + 4)(tr6211.2 )(tr611.2 )2 + m2 (tr611.2 )4 , we have E(I12 ) = 4m(2m2 + 5m + 5)tr6411.2 + 16m(m + 1)tr6311.2 tr611.2 + 4m(m + 1)(tr6211.2 )2 + 8mtr6211.2 (tr611.2 )2 . Hence, from conditions (A2), (A3) and a2 = O(p), I1 = Op (n3/2 p1/2 ) + Op (np) + Op (n1/2 p3/2 ).

(A.15)

Additionally, for I2 , E( ) = E{(trV11.2 ) } − m {m(tr611.2 ) + I22

4

2

2

2tr6211.2 2 .

} From Lemma 6,

E{(trV11.2 )4 } = 48mtr6411.2 + 32m2 tr6311.2 tr611.2

+ 12m2 (tr6211.2 )2 + 12m3 (tr6211.2 )(tr611.2 )2 + m4 (tr611.2 )4 , so that E(I22 ) = 48mtr6411.2 + 32m2 tr6311.2 tr611.2 + 8m2 (tr6211.2 )2 + 8m3 (tr6211.2 )(tr611.2 )2 . Therefore, I2 = Op (n1/2 p1/2 ) + Op (np) + Op (n3/2 p3/2 ). Finally, for I3 , −1 trV12 V22 V21 V11.2 = trYY ⊤ A = tr(Y − θ + θ)(Y − θ + θ)⊤ A

= tr(Y − θ)(Y − θ)⊤ A + trθθ ⊤ A + 2tr(Y − θ)θ ⊤ A,

(A.16)

78

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

−1/2

−1/2

where Y = 611.2 V12 V22

−1/2

1/2

1/2

1/2

, θ = 611.2 612 622 V22 and A = 611.2 V11.2 611.2 . Then, I3 is decomposed as

I3 = {tr(Y − θ)(Y − θ)⊤ A − qmtr6211.2 } ⊤ 1 + (trθθ ⊤ A − nmtr612 6− 22 621 611.2 ) + 2tr(Y − θ)θ A =: I3,1 + I3,2 + 2I3,3 ,

where E(I3,1 ) = E(I3,2 ) = E(I3,3 ) = 0. Noting that Y |V22 ∼ Np,q (0, Ip , Iq ), from Stein’s lemma, E[{tr(Y − θ)(Y − θ)⊤ A}2 ] = Etr[∇ Y {tr(Y − θ)(Y − θ)⊤ A}(Y − θ)⊤ A] ⊤ ⊤ = qE{trAtr(Y − θ)(Y − θ)⊤ A} + E[{∇ ⊤ Y tr(Y − θ)(Y − θ) A}trA (Y − θ)]

= q2 E(trA)2 + 2EtrA⊤ (Y − θ)(Y − θ)⊤ A. Because Y is independent of A, using Lemma 6, E[{tr(Y − θ)(Y − θ)⊤ A}2 ] is evaluated as q2 E(tr611.2 V11.2 )2 + 2qEtr(611.2 V11.2 )2

= q2 {2mtr6411.2 + m2 (tr6211.2 )2 } + 2q{m(m + 1)tr6411.2 + m(tr6211.2 )2 } = qm(2 + qm)(tr6211.2 )2 + 2qm(n + 1)tr6411.2 , so that by (A3), E(I32,1 ) = E[{tr(Y − θ)(Y − θ)⊤ A}2 ] − q2 m2 (tr6211.2 )2

= 2qm(tr6211.2 )2 + 2qm(n + 1)tr6411.2 = O(np2 ) + O(n2 p). Hence, I3,1 = Op (n1/2 p) + Op (np1/2 ).

(A.17)

For I3,2 , 1 2 E(I32,2 ) = E{(trθθ ⊤ A)2 } − n2 m2 (tr612 6− 22 621 611.2 ) 1 −1 2 2 = 2nm(tr612 6− 22 621 611.2 ) + 2nm(2m + 1)tr(612 622 621 611.2 )

= O(n2 p2 ) + O(n3 p2 ), which implies that I3,2 = Op (np) + Op (n3/2 p).

(A.18)

For I3,3 , using Stein’s lemma again, ⊤ E(I32,3 ) = E[{tr(Y − θ)θ ⊤ A}2 ] = Etr[∇ ⊤ Y Aθ{tr(Y − θ)θ A}]

= Etr[θ ⊤ A∇ Y {tr(Y − θ)θ ⊤ A}] = Etr(θ ⊤ A2 θ) 1 −1 2 = Etr(V22 6− 22 621 V11.2 612 622 ),

which is evaluated as 1 −1 2 3 2 2 nm(m + 1)tr(612 6− 22 621 611.2 ) + nmtr(611.2 )tr(612 622 621 611.2 ) = O(n p) + O(n p ),

so that I3,3 = Op (n3/2 p1/2 ) + Op (np).

(A.19)

Thus, combining (A.14)–(A.19) gives

φˆ 11 − φ11 = Op (n−1/2 ). Therefore the proof of Lemma 4 is complete.

A.3. Proof of Theorem 2 Using Lemmas 4 and 5, we show that under non-sparsity a2 = O(p), Jˆ10 = J10 + Op (n−1/2 ),

Jˆ20 = J20 + Op (n−1/2 p),

Jˆ11 = J11 + Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 ), Jˆ22 = J22 + Op (n−1/2 p),

Jˆ12 = J12 + Op (n−1/2 ),

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

79

and under sparsity a2 = O(1), Jˆ10 = J10 + Op {(np)−1/2 },

Jˆ20 = J20 + Op {(np)−1/2 } + Op (n−1 ),

Jˆ11 = J11 + Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 ), Jˆ22 = J22 + Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 ), Jˆ12 = J12 + Op {(np)−1/2 } + Op (n−1 ). We note that J20 = O(p) and J22 = O(p) under a2 = O(p) while J20 = O(1) and J22 = O(1) + O(p/n) under a2 = O(1). Regarding the other terms, we have J10 = O(1), J11 = O(1) + O(p/n) and J12 = O(1). These approximations show the consistency of w ˆ , γˆ , αˆ , and βˆ . Concerning the consistency of w ˆ , under a2 = O(p), w ˆ − w ∗ can be evaluated as

(J10 − J12 ) + B J10 − J12 J11 B − (J10 − J12 )A − = J11 + A J11 J11 (J11 + A) −1/2 {Op (1) + Op (p/n)}Op (n ) − Op (1)[Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 )] = {Op (1) + Op (p/n)}{Op (1) + Op (p/n) + Op (n−1/2 p−1/2 ) + Op (n−1 ) + Op (n−3/2 p1/2 )}

w ˆ − w∗ =

=

Op (n−1/2 ) + Op (n−3/2 p) Op (1) + Op (p/n) + Op (n−3/2 p1/2 ) + Op (n−2 p2 ) + O(n−5/2 p3/2 )

=

Op (n−1/2 ) Op (n1/2 /p)

when p/n → 0 or c , when p/n → ∞,

where A and B denote the remainder terms such that Jˆ10 − Jˆ12 = J10 − J12 + B and Jˆ11 = J11 + A. Under a2 = O(1), on the other hand, we have

{Op (1) + O(p/n)}[Op {(np)−1/2 } + Op (n−1 )] + O(n−3/2 p1/2 ) {Op (1) + Op (p/n)}{Op (1) + Op (p/n) + Op (n−3/2 p1/2 )} O {(np)−1/2 } + Op (n−1 ) when p/n → 0 or c , = p −1 Op (p ) when p/n → ∞.

w ˆ − w∗ =

Concerning αˆ and βˆ , 2 αˆ = (Jˆ22 Jˆ10 − Jˆ12 Jˆ20 )/(Jˆ11 Jˆ22 − Jˆ12 ), 2 ˆβ = (Jˆ11 Jˆ20 − Jˆ12 Jˆ10 )/(Jˆ11 Jˆ22 − Jˆ12 ).

Under a2 = O(p), we have 2 2 Jˆ11 Jˆ22 − Jˆ12 = J11 J22 − J12 + Op (n−1 p1/2 ) + Op (n−1/2 p) + Op (n−3/2 p2 ),

Jˆ22 Jˆ10 − Jˆ12 Jˆ20 = J22 J10 − J12 J20 + Op (n−1/2 p), Jˆ11 Jˆ20 − Jˆ12 Jˆ10 = J11 J20 − J12 J10 + Op (n−1 p1/2 ) + Op (n−1/2 p) + Op (n−3/2 p2 ) + Op (n−2 p3/2 ). Also, under a2 = O(1), 2 2 Jˆ11 Jˆ22 − Jˆ12 = J11 J22 − J12 + Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 ) + Op (n−2 p) + Op (n−5/2 p3/2 ),

Jˆ22 Jˆ10 − Jˆ12 Jˆ20 = J22 J10 − J12 J20 + Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 ), Jˆ11 Jˆ20 − Jˆ12 Jˆ10 = J11 J20 − J12 J10 + Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 ) + Op (n−2 p). Based on these approximations, it follows under a2 = O(p) that

αˆ − α ∗ =

Op (n−1/2 ) + Op (n−3/2 p)

{Op (1) + Op (n−1 p) + Op (n−3/2 p1/2 )}{Op (1) + Op (n−1 p)} O (n−1/2 ) (p/n → 0 or c ), = p 1/2 Op (n /p) (p/n → ∞).

Under a2 = O(1), on the other hand, we have

{Op (1) + Op (n−2 p2 )}[Op {(np)−1/2 } + Op (n−1 ) + Op (n−3/2 p1/2 )] {Op (1) + Op (n−2 p2 )}2 + Op {(np)−1/2 } + Op (n−1 ) O {(np)−1/2 } + Op (n−1 ) when p/n → 0 or c , = p 1/2 −3/2 Op {n p } when p/n → ∞.

αˆ − α ∗ =

80

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81

Similarly, βˆ − β ∗ = O(n−1/2 ) under a2 = O(p) when p/n → 0, c or ∞ and under a2 = O(1),

)−1/2 } + Op (n−1 ) ˆβ − β ∗ = Op {(np −1 Op (p )

(p/n → 0 or c ), (p/n → ∞).

Concerning γˆ , we can approximate Jˆ10 + Jˆ20 and Jˆ11 + 2Jˆ12 + Jˆ22 by J10 + J20 + Op (n−1/2 p) and J11 + 2J12 + J22 + Op {(np)−1/2 }+ Op (n−1/2 p) + Op (n−3/2 p1/2 ) under a2 = O(p), and J10 + J20 + Op {(np)−1/2 } + Op (n−1 ) and J11 + 2J12 + J22 + Op {(np)−1/2 } + Op (n−1 )+ Op (n−3/2 p1/2 ) under a2 = O(1). Using these approximations, we obtain the results given in Theorem 2. The details are omitted because they can be verified similarly. A.4. Proof of Theorem 3 We first evaluate the loss function p−1 tr{(δwˆ − 6y )2 }. Because δwˆ − 6y = (Sy − 6y ) − (1 − w){ ˆ Sy|x − 9(Sy|x )}, the loss can be expanded as Loss(δwˆ ) = Loss(δw∗ ) + (w ˆ − w∗ ){(w ˆ − w ∗ ) + 2(w ∗ − 1)}p−1 tr{Sy|x − 9(Sy|x )}2

+ 2(w ˆ − w∗ )p−1 tr[{Sy|x − 9(Sy|x )}(Sy − 6y )]. Using (A.1) and (A.2) in Lemma 6, we show that E[p−1 tr{Sy|x − 9(Sy|x )}2 ] = O(1) + O(n−1 p), which means that p−1 tr{Sy|x − 9(Sy|x )}2 = Op (1) + Op (n−1 p). We note that

[p−1 tr{Sy|x − 9(Sy|x )}(Sy − 6y )]2 ≤ [p−1 tr{Sy|x − 9(Sy|x )}2 ]{p−1 tr(Sy − 6y )2 }. Because E{p−1 tr(Sy − 6y )2 } = a2 /n + (p/n)a21 = O(p/n) in the two cases of a2 = O(p) and a2 = O(1), p−1 tr[{Sy|x − 9(Sy|x )}(Sy − 6y )] = O{(p/n)1/2 } + O(p/n). Hence, it follows from Theorem 2 that Loss(δwˆ ) − Loss(δw∗ ) = Op (n−1/2 ) under a2 = O(p) and under a2 = O(1), Loss(δwˆ ) − Loss(δw∗ ) =

Op {(np)−1/2 } + Op (n−1 ) Op (n−1 )

when p/n → 0 or c , when p/n → ∞.

ˆ ˆ Sy|x − 9(Sy|x )}, which is further Concerning the loss of δα, ˆ − β){ ˆ βˆ , we note that δα, ˆ βˆ = Sy + (β − 1){Sy − 9(Sy )} + (α rewritten as δα, ˆ βˆ = δα ∗ ,β ∗ + G, where G = (βˆ − β ∗ ){Sy − 9(Sy )} + {(αˆ − α ∗ ) − (βˆ − β ∗ )}{Sy|x − 9(Sy|x )}. −1 2 −1 Hence, the loss function of δα, ˆ βˆ is expanded as Loss(δα, ˆ βˆ ) − Loss(δα ∗ ,β ∗ ) = p tr(G ) + 2p tr{(δα ∗ ,β ∗ − 6y )G }, where

p−1 tr(G 2 ) = (βˆ − β ∗ )2 p−1 tr{Sy − 9(Sy )}2 + {(αˆ − α ∗ ) − (βˆ − β ∗ )}2 p−1 tr{Sy|x − 9(Sy|x )}2

+ 2(βˆ − β ∗ ){(αˆ − α ∗ ) − (βˆ − β ∗ )}p−1 tr[{Sy − 9(Sy )}{Sy|x − 9(Sy|x )}]. We note from (A.13) that E[p−1 tr{Sy − 9(Sy )}2 ] = (1 + n−1 )a2 + (p/n)a21 − a21 + O(n−1 ) =

O(p) O(1) + O(p/n)

for a2 = O(p), for a2 = O(1).

Hence, it follows from Theorem 2 that p−1 tr(G 2 ) = Op (p/n) under a2 = O(p), and that p−1 tr(G 2 ) = Op {(np)−1 } under a2 = O(1). Because Loss(δα ∗ ,β ∗ ) = Op (p/n), p−1 tr{(δα ∗ ,β ∗ − 6y )G } has order Op (p/n) for a2 = O(p) and order = Op (n−1 ) for a2 = O(1). Thus, under a2 = O(p), we have Loss(δα, ˆ βˆ ) − Loss(δα ∗ ,β ∗ ) = Op (p/n). Additionally, under a2 = O(1), we −1 −1 −1 have Loss(δα, ˆ βˆ ) − Loss(δα ∗ ,β ∗ ) = Op {(np) } + Op (n ) = Op (n ). ∗ Concerning the loss of δγˆ , δγˆ = δγ ∗ + (γˆ − γ ){Sy − 9(Sy )}. Using the same arguments, we obtain the results given in Theorem 3, which is proved.

References [1] [2] [3] [4] [5] [6] [7]

P. Bickel, E. Levina, Covariance regularization by thresholding, Ann. Statist. 36 (2008) 2577–2604. P. Bickel, E. Levina, Regularized estimation of large covariance matrices, Ann. Statist. 36 (2008) 199–227. T. Cai, W. Liu, Adaptive thresholding for sparse covariance matrix estimation, J. Amer. Statist. Assoc. 106 (2011) 672–684. T. Cai, H. Zhou, Optimal rates of convergence for sparse covariance matrix estimation, Ann. Statist. 40 (2012) 2389–2420. G. Chamberlain, M. Rothschild, Arbitrage, factor structure and mean–variance analysis in large asset markets, Econometrica 51 (1983) 1305–1324. Y. Chen, A. Wiesel, C.Y. Eldar, O.A. Hero, Shrinkage algorithms for MMSE covariance estimation, IEEE Trans. Signal Process. 58 (2010) 5016–5029. V. DeMiguel, L. Garlappi, F.J. Nogales, R. Uppal, A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms, Manage. Sci. 55 (2009) 798–812.

Y. Ikeda, T. Kubokawa / Journal of Multivariate Analysis 152 (2016) 61–81 [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]

81

E. Fama, K. French, Common risk factors in the returns on stocks and bonds, J. Financ. Econ. 33 (1993) 3–56. J. Fan, Y. Fan, J. Lv, High-dimensional covariance matrix estimation using a factor model, J. Econometrics 147 (2008) 186–197. J. Fan, Y. Liao, M. Mincheva, High-dimensional covariance matrix estimation in approximate factor model, Ann. Statist. 39 (2011) 3320–3356. J. Fan, Y. Liao, M. Mincheva, Large covariance estimation by thresholding principal orthogonal complements, J. R. Stat. Soc. Ser. B 75 (2013) 603–680. T.J. Fisher, X. Sun, Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix, Comput. Statist. Data Anal. 55 (2011) 1909–1918. L.R. Haff, An identity for the Wishart distribution with applications, J. Multivariate Anal. 9 (1979) 531–542. L.R. Haff, Solutions of the Euler–Lagrange equations for certain multivariate normal estimation problems. Unpublished manuscript, 1982. Y. Ikeda, T. Kubokawa, Linear shrinkage estimation of large covariance matrices with use of factor models. Discussion Paper Series, CIRJE-F-958, Faculty of Economics, University of Tokyo, 2015. Y. Konno, Shrinkage estimators for large covariance matrices in multivariate real and complex normal distributions under an invariant quadratic loss, J. Multivariate Anal. 100 (2009) 2237–2253. O. Ledoit, M. Wolf, Improved estimation of the covariance matrix of stock returns with an application to portfolio selection, J. Empir. Finance 10 (2003) 603–621. O. Ledoit, M. Wolf, A well-conditioned estimator for large-dimensional covariance matrices, J. Multivariate Anal. 88 (2004) 365–411. O. Ledoit, M. Wolf, Nonlinear shrinkage estimation of large-dimensional covariance matrices, Ann. Statist. 40 (2012) 1024–1060. O. Ledoit, M. Wolf, Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions, J. Multivariate Anal. 139 (2015) 360–384. Y. Ren, K. Shimotsu, Improvement in finite sample properties of the Hansen–Jagannathan distance test, J. Empir. Finance 16 (2009) 483–506. A. Rothman, E. Levina, J. Zhu, Generalized thresholding of large covariance matrices, J. Amer. Statist. Assoc. 104 (2009) 177–186. J. Schafer, K. Strimmer, An empirical Bayes approach to inferring large-scale a gene association networks, Bioinformatics 21 (2005) 754–764. M.S. Srivastava, Some tests concerning the covariance matrix in high-dimensional data, J. Japan Statist. Soc. 35 (2005) 251–272. C. Stein, Estimation of the mean of a multivariate normal distribution, in: Proc. Prague Symp. Asymptotic Statist., 1973, pp. 345–381. C. Stein, Estimation of the mean of a multivariate normal distribution, Ann. Statist. 9 (1981) 1135–1151. A. Touloumis, Nonparametric Stein-type shrinkage covariance matrix estimators in high-dimensional setting, Comput. Statist. Data Anal. 83 (2015) 251–261. Y. Watamori, On the moments of traces of Wishart and inverted Wishart matrices, South African Statist. J. 24 (1990) 153–176.