Dynamic treatment assignment and evaluation of active labor market policies

Dynamic treatment assignment and evaluation of active labor market policies

Accepted Manuscript Dynamic Treatment Assignment and Evaluation of Active Labor Market Policies Johan Vikstrom ¨ PII: DOI: Reference: S0927-5371(16)...

850KB Sizes 0 Downloads 2 Views

Accepted Manuscript

Dynamic Treatment Assignment and Evaluation of Active Labor Market Policies Johan Vikstrom ¨ PII: DOI: Reference:

S0927-5371(16)30153-1 10.1016/j.labeco.2017.09.003 LABECO 1587

To appear in:

Labour Economics

Received date: Revised date: Accepted date:

30 September 2016 8 September 2017 11 September 2017

Please cite this article as: Johan Vikstrom, Dynamic Treatment Assignment and Evaluation of Active ¨ Labor Market Policies, Labour Economics (2017), doi: 10.1016/j.labeco.2017.09.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • I consider evaluation with dynamic treatment assignment in a survival time setting. • I provide identification results and provide a new estimator. • The results cover both the case with one single treatment and sequences of programs.

AC

CE

PT

ED

M

AN US

CR IP T

• The application uses Swedish data to study a training program and a work practice program.

1

ACCEPTED MANUSCRIPT

Johan Vikstr¨om† September 12, 2017

Abstract

CR IP T

Dynamic Treatment Assignment and Evaluation of Active Labor Market Policies∗

AN US

This paper considers treatment evaluation in a discrete time setting in which treatment can start at any point in time. We consider evaluation under unconfoundedness and propose a dynamic inverse probability weighting estimator. A typical application is an active labor market program that can start after any elapsed unemployment duration. The identification and estimation results concern both cases with one single treatment as well as sequences of programs.

M

The new estimator is applied to Swedish data on participants in a training program and a work practice program. The work practice program increases re-employment rates. Most sequences

ED

of the two programs are inefficient when compared to one single program episode.

Keywords: Program evaluation; treatment effects; work practice; training; survival time

AC

CE

PT

JEL classification: C21, J18



I am grateful for helpful suggestions Gerard J. van den Berg, Gregory Jolivet, Per Johansson, Bas van der Klaauw, Martin Huber, Helena Holmlund, Oskar Nordstr¨ om Skans, Ingeborg Waernbaum and seminar participants at IFAU Uppsala, the 4th IZA/IFAU conference, IFAU, SOFI and EALE 2016. Financial support from the Jan Wallander and Tom Hedelius Foundation, and FORTE are acknowledged. † IFAU, Uppsala, Box 513, 751 20 Uppsala, Sweden and UCLS, Uppsala University, Department of Economics, [email protected]

2

ACCEPTED MANUSCRIPT

1

Introduction

A common feature of many active labor market programs (ALMPs) is that they can start after any elapsed unemployment duration. This dynamic nature of the treatment assignment raises several methodological issues. The main issue is that, currently, non-treated individuals may become

CR IP T

treated later on. The implication is that unconfoundedness-based methods that use static treatment status, defined as enrollment in treatment before exiting from unemployment, are no longer valid (see discussions in e.g., Sianesi 2004, Fredriksson and Johansson 2008, Cr´epon et al. 2009, Lechner et al. 2011). The reason is that static treatment status depends on survival time (i.e., the outcome), since the probability of treatment enrollment by construction increases with time in unemployment,

AN US

and this confounds any analysis based solely on static treatment indicators.

As a response, several papers explicitly address this dynamic treatment assignment problem. Sianesi (2004) proposes to transform the dynamic problem into a static problem by focusing on the effect of treatment now versus waiting for treatment. Applications of this approach include Sianesi

M

(2008), Fitzenberger et al. (2008) and Biewen et al. (2014). Several other papers focus on the average effect of treatment after some elapsed duration compared with never receiving treatment,

ED

and this is also the average effect considered in this paper. Both Fredriksson and Johansson (2008) and Cr´epon et al. (2009) utilize the outcomes of the not-yet treated to obtain the counterfactual

PT

outcome under never receiving treatment. In a related paper, Kastoryano and van der Klaauw (2011) compare different evaluation approaches in a dynamic setting. Other influential studies

CE

include Lechner (1999), Gerfin and Lechner (2002) and Lechner et al. (2011). In particular, Lechner et al. (2011) evaluate the long-run effects of a training program and discuss the static evaluation model in a case in which the sample size is too small to use the methods in Fredriksson

AC

and Johansson (2008) and Sianesi (2008), adding to the literature on evaluation under dynamic treatment assignment. This paper contributes to the literature on evaluation in cases with dynamic treatment assign-

ment. We consider a setting in which the population of interest is a cohort of units that are in an initial state and the outcome of interest is the survival time in the initial state. In evaluations of ALMPs, the initial state is typically unemployment. The key feature is that exits out of the initial 3

ACCEPTED MANUSCRIPT

state and the start of treatment are allowed to occur at any point in time. Besides ALMPs, an important example of this setting is a medical treatment implemented at various times after the onset of a disease. For this setting we consider identification and estimation under the assumption of sequential unconfoundedness (selection on observables). That is, conditional on covariates,

CR IP T

treatment assignment among the non-treated survivors is unrelated to future potential outcomes. In some settings this is a restrictive assumption, but it is more likely to hold in cases with detailed administrative data and/or survey data. Below we discuss related approaches for cases where unconfoundedness does not hold.

This paper contributes in several ways. One contribution is that we consider identification and

AN US

estimation with selection on time-variant covariates. This may be important for evaluations of active labor market programs and other policies, since it is likely that selection to the program not only depends on characteristics measured at the beginning of the unemployment, but also on characteristics like marital status that may change during the unemployment spell. Allowing for

covariates in Cr´epon et al. (2009).

M

selection on time-varying covariates extends the identification results for selection on time-invariant

Another contribution is to provide a dynamic inverse probability weighting (DIPW) estimator

ED

for the average treatment effect on the treated for treatment in a certain period against no treatment now nor thereafter. One advantage of the DIPW approach is that once the scores forming

PT

the weights have been estimated no additional functional form assumptions are needed. Recent applications of the new methods proposed in this paper include van den Berg et al. (2015) and

CE

Albanese et al. (2015). The finite sample properties of the DIPW estimator are compared with those of the two-step matching estimator in Fredriksson and Johansson (2008) and the blocking

AC

estimator in Cr´epon et al. (2009) in a Monte Carlo simulation. The estimator allows for selection on time-variant covariates. As an illustration of the DIPW estimator consider the average effect on the treated at t. The

survival rate under treatment is obtained directly from those actually treated at t. The counterfactual exit rate under no treatment at t is estimated by weighting the outcomes of the not-yet treated at t in order to mimic the distribution of the confounders in the population of treated at

4

ACCEPTED MANUSCRIPT

t. In subsequent periods, some of the not-yet treated at t are treated, and this creates selective censoring in the group of not-yet treated. However, under unconfoundedness the weights at t + 1 correct for this selective censoring, so that the DIPW estimator gives the desired exit rate at t + 1. Similar re weighting occurs in subsequent periods. An estimator for the average effect averaged

CR IP T

over all pre-treatment durations is provided. We also extend the results to allow sequences of treatments. In an ALMP setting, this is important, since many unemployed workers enroll in several programs during a single unemployment spell.1 As for the analysis of a single treatment we focus on a survival time setting in which the outcome of interest is a transition from an initial state to a destination state. We study both

AN US

treatment sequences defined by sequences of the same treatment and sequences defined by multiple types of programs. Specifically, we examine identification and DIPW estimation of the difference in survival under two treatment sequences, and show that this parameter is identified under a generalization of the sequential unconfoundedness assumption for cases with a single treatment. The analysis of sequences of treatments in a survival time setting is closely related to previous

M

papers on sequences of treatments. In particular, Lechner (2008, 2009) and Lechner and Miquel (2010) develop a seminal framework for causal analysis of sequences of treatments, and propose

ED

and implement matching and inverse probability weighting estimators for various average effects.2 This causal model assumes a setting with discrete periods for which treatments, confounders and

PT

outcomes are observed in all periods. The outcome of interest is the difference between two potential outcomes at a single point in time.

CE

The main difference compared to our study is that we consider a survival time setting, which affects the causal effects that are possible to identify. One reason for this is that, in a duration

AC

setting, the outcome and the treatment at t are only observed for individuals conditional on that (s)he has survived up to time t − 1. This leads to the well known dynamic selection problem, which, for instance, implies that the difference in hazard rates cannot be identified without model assumptions since conditioning jointly on the outcome and a counterfactual treatment at time t − 1

1 For instance, in Sweden almost 24% of program participants participate in more than one program during one unemployment spell. This is conditional on participating at least once in a program during the time period 1999-2006. 2 There is also a parallel epidemiological literature, see e.g., Robins (1986) and Hernan et al. (2001).

5

ACCEPTED MANUSCRIPT

is not possible (van den Berg, 2001). Instead we show that it is possible to identify effects on the survival rate, and this separates our analysis of sequences of treatment from previous studies on sequences of treatments. This paper also relates to several other strands of the literature. In particular, if the sequen-

CR IP T

tial unconfoundedness is believed not to hold there are several alternative approaches. First, the Timing-of-Events (ToE) approach by Abbring and van den Berg (2003) also considers evaluation in a dynamic setting, in which exits and treatments are allowed to occur at any point in time. One difference compared with our approach is that the ToE approach allows the selection into treatment to be based on both observed and unobserved heterogeneity. This is achieved at the

AN US

expense of imposing the mixed proportional hazard structure, whereas the DIPW approach in this paper requires no parametric assumptions. Another difference is that we consider evaluation in discrete time while the ToE approach is for continuous time.3 Second, Heckman and Navarro (2007) establish semi-parametric identification of a discrete time model of time to the treatment and its effects. Their model requires the presence of additional covariates, besides the treatment indicator,

M

that are independent of unobservable errors and have large support, whereas this paper considers evaluation under unconfoundedness. Third, with multiple spell data on can attempt to adjust for

ED

fixed unobserved heterogeneity, using approaches similar to the difference-in-differences design. The DIPW estimator is illustrated using data from a Swedish work practice program. Data for

PT

the period 2003–2006 are used and the result is that the program increases the employment rate 15 months after enrollment in the program. We also study the impact of sequences of work practice

CE

episodes and the effects of different combinations of the work practice program and a labor market training program. From this analysis we conclude that in most cases a second program episode

AC

does not reduce the total unemployment more than a single program episode. This holds for most timings of the first program and most spacings between the two program episodes. Our results on the effects of sequences of programs are related to previous studies on the effects

of sequences of ALMP programs using the Lechner (2009) framework: Lechner (2009), Lechner and 3

In evaluations of ALMPs, unemployment durations are typically measured at the daily level (discrete time). However, if the treatment rate is low, some aggregation (for instance to 30-day intervals) is necessary in order to obtain a sufficient number of treated in each time period. The ToE approach does not suffer from this aggregation issue.

6

ACCEPTED MANUSCRIPT

Miquel (2010), Lechner and Wiehler (2013) and Dengler (2015). Our comparison of two different programs is related to studies on the relative efficiency of different programs: Osikominu (2014), Hotz et al. (2006), Dyke et al. (2006), Sianesi (2008) and Fitzenberger et al. (2008). The rest of the paper is organized as follows. Section 2 considers identification and DIPW

CR IP T

estimation for the case with one single treatment. In Section 3, we extend the results allowing for sequences of treatments. Section 4 gives the simulation results for the DIPW estimator and related estimators. Section 5 reports the results from the application, and Section 6 concludes.

Identification and estimation

2.1

AN US

2

Evaluation framework

We consider the average effects on survival time when transitions as well as the start of treatment can occur at any point in discrete time. The time to the start of the treatment is denoted by S and we let Yt (s) be an indicator of a transition in period t if treated at s. The potential outcome if

M

never treated is denoted by Yt (0) and the observed outcome in period t is Yt . Denoted by Y t (s) is the sequence of potential outcomes Y t (s) = {Y1 (s), . . . , Yt (s)}, Y t is a similar sequence of observed

ED

outcomes, and Y t (s) = 0 implies that all outcomes in the sequence are equal to zero. Throughout the paper we assume a sample of N individuals i = 1, . . . , N . Subsequently, the notation Yt,i is

other variables.

PT

used to denote the observed outcome of a specific individual, and similar notation will is used for

CE

We consider the average effect of treatment at s on the probability of surviving to time point t

AC

compared with survival throughout the same interval if never treated:

ATETt (s) =

(1)

Pr(Y t (s) = 0|S = s, Y s−1 (s) = 0) − Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0).

Let us illustrate this setting using the application in this paper, which evaluates the effects of a Swedish work practice program. The aim of the program is to provide unemployed workers

7

ACCEPTED MANUSCRIPT

with practical experience in a certain profession, and the program can start after any elapsed unemployment duration. Then, S is the time between the entry into unemployment and the start of the work practice program, Yt (s) is an indicator of a transition from unemployment to employment in time period t if treated at s, and Y t =0 implies that the unemployed worker remains

CR IP T

in unemployment for at least t periods. The average effect of interest, ATETt (s), is the difference in the probability of remaining in unemployment up until period t when comparing work practice after s periods with no work practice.

2.2

Assumptions

AN US

We assume that we have data on the selection to treatment such that it is reasonable to assume that the potential durations are independent of assignment to treatment when conditioning on observed covariates. In particular, we allow for selection on time-variant covariates. In evaluations of active labor market programs (ALMPs), this allows treatment assignments to not only depend on characteristics measured at the beginning of the unemployment spell, but also to depend on

M

characteristics that change during the unemployment spell. For instance, some unemployed workers divorce and local labor market conditions can change.

ED

To rule out effects of the treatment in period t on X, the covariates determining treatment assignment in t should be measured before assignments are made. For that reason we use the

PT

notation Xt− for the observed covariates at t, where t− indicates that X is measured at least slightly before t. Note that Xt− may include covariates from previous periods, even the entire

CE

vector of covariates from all previous periods. Also, let us introduce the notation Dt , which denotes treatment in period t.

AC

Formally, the assumptions is that unconfoundedness should hold sequentially conditional on the time-varying observed covariates among the non-treated survivors:

{Yk (s); ∀k, s ≥ t} ⊥ Dt | Xt− , S > t − 1, Y t−1 (0) = 0).

(A1)

In evaluations of ALMPs, the assumption is that for unemployed workers still unemployed and non-treated at t, treatment assignments in period t are independent of the potential outcomes (re8

ACCEPTED MANUSCRIPT

employment rates) after conditioning on covariates measured shortly before t. Whether sequential unconfoundedness assumption holds or does not hold depends on how treatment is assigned and on the information available. The assumption is more likely to hold in cases with detailed administrative data and/or survey data. For instance, if unemployed workers mainly self-select into the ALMP

CR IP T

of interest, individual motivation, ability and personality traits may affect treatment decisions. In such cases, the sequential unconfoundedness is not likely to hold, unless the data include survey information on these factors and/or other variables that are close proxies for individual motivation and the other factors. On the other hand if caseworkers determine assignment to ALMPs and if the data set used in the analysis contains most of the information available when caseworkers

AN US

decide on treatment enrollments, the sequential unconfoundedness is more likely to hold. Thus, the sequential unconfoundedness assumption has to be assessed on a case-by-case basis based on knowledge about the treatment assignment process and the data available. Assumption A1 is similar to assumptions made in related dynamic approaches. Biewen et al. (2014) make a dynamic conditional mean independence assumption (DCIA). They consider

M

multiple treatments, but otherwise their DCIA assumption has similar implications as our sequential unconfoundedness assumption.4 Both Assumption A1 and the DCIA assumption concern treatment

ED

assignments in a given period for individuals surviving up until this time period. For this population

the covariates.

PT

those starting treatment and those who remain non-treated should be comparable conditional on

Lechner (2009) and Lechner and Miquel (2010) consider the identification of effects of sequences

CE

of treatments under the assumption of ”weak conditional independence assumption” (W-DCIA), which implies that conditional on previous treatments and possibly time-varying covariates, treat-

AC

ment assignments in the next period are unrelated to the potential outcomes in that period. This assumption is in the same spirit as our sequential unconfoundedness assumption. In both cases treatment assignments should be unrelated to potential outcomes conditional on past events and current covariates. The main difference is that this paper considers a survival time setting, so that our sequential unconfoundedness assumption is for the population of survivors while the W-DCIA 4

One difference is that Biewen et al. (2014) consider selection on time-invariant covariates while we allow for selection on time-varying covariates.

9

ACCEPTED MANUSCRIPT

is for the full population.5 Next, Fredriksson and Johansson (2008) and Cr´epon et al. (2009) both consider identification in settings similar to the one considered in this paper. These papers consider unconfoundedness assumptions allowing for selection on time-invariant covariates. In this paper, we extend these

CR IP T

previous identification results and allow for selection on time-varying covariates. The next assumption is the familiar no-anticipation assumption (see e.g., Abbring and van den Berg, 2003)6 : Pr(Yt (s0 ) = 1) = Pr(Yt (s00 ) = 1),

∀t < min(s0 , s00 ).

(A2)

The implication of Assumption A2 is that future treatments should not affect current outcomes.

AN US

This holds if individuals are unaware of future treatments or if they do not alter their behavior as a response to knowledge of future treatments. In practice, this holds if individuals are only informed about upcoming treatments shortly before the start of the treatment, or if there is a great deal of uncertainty around future treatment assignments. On the other hand, if individuals

M

are informed about future treatments far in advance the assumption is not likely to hold, since this gives individuals time to react to future treatments. For instance, if unemployed workers are

ED

assigned to a ALMP that they would like to avoid well in advance, this is expected to affect their job search intensity and reservation wages already before the actual start of the program.7

PT

Besides Assumptions A1 and A2, an overlap condition8 and SUTVA need to hold. The latter rules out general equilibrium effects and other types of interference between the individuals in the sample. 5

AC

CE

Another difference is that the W-DCIA assumption concerns sequences of treatments, so that the assumption should hold conditional on previous treatments. In Section 3, we generalize A1 to a setting with sequences of treatments. 6 For estimation, Assumption A2 needs to hold given Xt− . 7 However, there is one special case when the assumption still holds. As pointed out by a referee, if both treated and non-treated unemployed workers with the same characteristics have the same beliefs about future treatments and respond to these beliefs in the same way, both groups are affected by any future treatments in the same way, so that any anticipatory behavior cancels out when comparing treated and non-treated. Conceptually, this is not possible in an evaluation framework with one single treatment, but it may hold in many real world settings in which individuals may be treated several times. 8 We have: Pr(Dt = t|Xt− , S ≥ t, Y t−1 = 0) < 1 for all t. That is, as for static average treatment effects on the treated the treatment propensity needs to be below one.

10

ACCEPTED MANUSCRIPT

2.3

Identification

We show that ATETt (s) is identified under Assumption A1 and A2. The survival function under treatment at s is directly identified by the outcomes of those actually treated at s. The main issue is instead how to select a proper control group in order to identify the counterfactual outcome,

CR IP T

Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0). One key problem is that the start of treatment may occur at any point in time, so that individuals not treated at t may become treated after t. Another problem is that the start of treatment is unobserved if an individual leaves the initial state before receiving treatment. This identification problem is discussed by, for example, Fredriksson and Johansson (2008) and Cr´epon et al. (2009) in a setting with selection on time-invariant covariates. The idea

AN US

in both papers is to successively use all those not-yet treated at t to estimate the exit rate under no-treatment at t for those treated at s. We now show that a similar logic applies with selection on time-varying covariates.

For the effect of treatment in the first period, ATET2 (1), our main identification result is

M

summarized in Theorem 1. Similar expressions apply for ATETt (s) for other t and s.

ED

Theorem 1 (Identification of ATET) Suppose that A1 and A2 hold then

ATET2 (1) = Pr(Y2 = 0, Y1 = 0|S = 1)−

PT

h i EX1− |S=1 EX2− |X1− ,Y1 =0,S=1 {Pr(Y2 = 0|X2− , Y1 = 0, S > 2)} Pr(Y1 = 0|X1− , S > 1) .

CE

Proof See Appendix A.

In each period, only not-yet treated individuals are used, so that the control group successively

AC

changes as some previously non-treated individuals start treatment. In the first period not-yet treated individuals with S > 1 are used and in the second period not-yet treated individuals with S > 2 are used. It is possible to use this successively changing comparison group, since, conditional on the covariates, treatment assignments among the non-treated survivors are assumed to be unrelated to potential outcomes. Note that both Assumption A1 and Assumption A2 are important for identification. Assumption A1 relates to the allocation of treatment across individuals, and assures that the treated and 11

ACCEPTED MANUSCRIPT

the not-yet treated have similar potential outcomes. Assumption A2 concerns the relationship between different potential outcomes for a given individual, and assures that the outcomes of the not-yet treated at t can be used to mimic the outcomes under never treatment even if some of the

2.4

Dynamic IPW estimation

CR IP T

not-yet treated at t are treated after t.

We propose a dynamic inverse probability weighting (DIPW) estimator for the average treatment effect on the treated. Appendix A.1 shows that if Assumptions A1 and A2 hold a consistent estimator of ATETt (s) is:

t  Y

k=s

AN US

\ t (s) = ATET

(2)

P P  Y  t  bi (s, k)Yk,i 1(Y k−1,i = 0)1(Si > k) iw i Yk,i 1(Y k−1,i = 0)1(Si = s) 1− P 1− P − bi (s, k)1(Y k−1,i = 0)1(Si > k) i 1(Y k−1,i = 0)1(Si = s) iw k=s

with the estimated weights

pbs (Xi,s− ) 1 , Qk 1 − pbs (Xi,s− ) m=s+1 1 − pbm (Xi,m− )

M

w bi (s, k) =

(3)

ED

where 1() is an indicator function, and ps (Xi,s− ) = Pr(S = t|Xi,s− , S ≥ t, Y t−1 = 0). is a propensity score. Specifically, ps (Xi,s− ) is the estimated probability of obtaining treatment in period s given

PT

survival until time period s and covariates X. Note that the weights are normalized by construction. One way to obtain standard errors is by bootstrapping. Since the selection probabilities and the

CE

weights are re-estimated in each bootstrap replication, this accounts for the variation in both the estimation of weights and the outcome equation.9

AC

Let us consider the intuition behind the estimator. If the interest lies in the average effect on the treated at s the actually observed outcomes of those treated at s, can be used to estimate

the survival rate under treatment(first part of the estimator). The counterfactual outcome under no-treatment is obtained using non-treated survivors at s, i.e. those not-yet treated at s. Under Assumption A1, the treated at s and the not yet treated at s are comparable if we adjust for the 9

See, for instance, Hirano et al. (2003) for a discussion of the implications of using estimated scores instead of true scores.

12

ACCEPTED MANUSCRIPT

fact that due to the assignment process the distribution of X differs between these two populations. Thus, the counterfactual exit rate at s is obtained by weighting the not-yet treated at risk at s and the exits among this group, and the weights essentially follow from the IPW estimator of the average effect on the treated in the static evaluation literature (see e.g., Wooldridge, 2010).

CR IP T

At s + 1, i.e. in the second period after the start of treatment, a fraction of the not-yet treated at s that survives up until s + 1 starts treatment. This creates selective censoring in the group of not-yet treated. However, under Assumption A1, assignments at s + 1 only depend on observed covariates, so that the selective censoring can be taken into account by weighting the outcomes of the not-yet treated at s + 1, and this is the purpose of the second part of the weights. The

AN US

implication is that individuals who are still not-yet treated with covariates such that they have a high probability of starting treatment are given larger weight. Again, the exit rate is obtained by dividing the weighted exits by the weighted risk set. Similar weighting occurs in subsequent periods.

Note that in each period only not-yet treated individuals are used, so that the control group

M

successively changes as some previously non-treated individuals start treatment. For evaluations of ALMPs, this means that the control group in each period consists of unemployed workers who are

ED

still unemployed and who have not enrolled in the ALMP yet. Some of these comparison workers, however, enroll into the program in subsequent periods, at which time they are removed from the

PT

control group, and the weights account for this selective censoring of the not-yet treated who are

2.5

CE

treated.

Right censoring

AC

The DIPW estimator above ignores regular right-censoring, which is common in many applications due to, for instance, drop-out from the study or a limited follow-up period. A standard way of handling this type of right censoring is to include the individual up until the censoring point. Under completely random censoring we do not have to adjust for right censoring, while if the rightcensoring depends on observed variables it will create another source of selective censoring since individuals with certain types of characteristics will be censored at a higher rate. We now consider

13

ACCEPTED MANUSCRIPT

identification and estimation under such selective censoring.10 Formally, let Ct be an indicator for censoring in period t. We consider estimation if the censoring is sequentially independent conditional on time-varying covariates t:

(A3)

CR IP T

{Yk (s); ∀k, s ≥ t} ⊥ Ct | Xt− , Y t−1 = 0).

This is similar to Assumption (A1), except that this assumption concerns right-censoring instead of treatment assignment. Importantly, the right-censoring is allowed to depend on time-varying covariates, allowing the right-censoring to depend on time-varying characteristics such as local labor

AN US

market conditions. As for Assumption A1 the plausibility of Assumption A3 depends on the exact censoring process and on the information available. In terms of identification we have: Theorem 2 (Identification with right-censoring) Suppose that A1, A2 and A3 hold then

ATET2 (1) = EX1− |S=1 [EX2− |X1− ,Y1 =0,S=1,C1 =0 {Pr(Y2 = 0|Y1 = 0, S = 1, C 2 = 0, X2− )}

M

× Pr(Y1 = 0|S = 1, C1 = 0), X1− ] −

ED

EX1− |S=1 [EX2− |X1− ,Y1 =0,S=1,C1 =0 {Pr(Y2 = 0|Y1 = 0, S > 2, C 2 = 0, X2− )} × Pr(Y1 = 0|S > 1, C1 = 0), X1− ].

PT

Proof See Appendix B.

As before, only not-yet treated individuals are used to estimate the counterfactual outcome

CE

under no treatment. In cases with right censoring, the control group successively changes both because some previously non-treated individuals start treatment and because some non-treated

AC

individuals are right-censored. In both cases the process is assumed to be random conditional on the possibly time-varying covariates, and this allows us to adjust for the selection that occurs due to the successive treatments and right censoring. The DIPW estimator can also be adjusted to accommodate for right-censoring. Under Assump10 In each period, we assume that censoring occurs before any treatments and we implicitly assume that the estimation sample consists of all observations that are not censored in the first period.

14

ACCEPTED MANUSCRIPT

tions A1, A2 and A3, the DIPW estimator is: \ t (s) = ATET

k=s



1 −

P

1 Y 1(Y k−1,i cm (Xi,m− )] k,i m=s+1 [1−b

i Qk

P

1 1(Y k−1,i cm (Xi,m− )] m=s+1 [1−b

i Qk

= 0)1(Si = s)1(C k,i = 0)

= 0)1(Si = s)1(C k,i = 0)



−

CR IP T

t Y

  P pbs (Xi,s− ) 1 Qk Y 1(Y = 0)1(S > k)1(C = 0) t i k k−1,i k,i i 1−b Y ps (Xi,s− ) pm (Xi,m− )][1−b cm (Xi,m− )]   m=s+1 [1−b 1 − P pbs (X − )  i,s 1 Q 1(Y = 0)1(S > k)1(C = 0) k=s i k−1,i k,i k i 1−b ps (X − ) i,s

pm (Xi,m− )][1−b cm (Xi,m− )] m=s+1 [1−b

with ct (Xi ) = Pr(Ci = t|Xi , Si ≥ t, Y t−1,i = 0), i.e. the censoring probability in period t is similar

AN US

to a propensity score. b ct (Xi ) denotes estimated censoring probabilities. Note that any selective

right censoring affects both the treated and the not-yet treated, so that the outcome of both groups are re-weighted.

Trimming

M

2.6

It is well known that IPW estimation is sensitive to extreme values of the propensity scores (p-

ED

scores), since single observations receive too large weight (see e.g., Fr¨olich 2004, Huber et al. 2013, Busso et al. 2014). In this paper, the weights are a function of the product of several propensity scores, which exaggerates problems with extreme values if units with extreme values

PT

for one propensity also have extreme values for other scores. It is, therefore, important to perform some kind of trimming. Here, we propose to use a variant of the trimming three-step approach in

CE

Huber et al. (2013). For average treatment effects, their approach implies setting the weights to zero for all treated (controls) whose share of the sum of all weights in the treatment (control) group

AC

is greater than t% (e.g., 4%). Then, the weights are normalized again. Finally, treated (control) observations whose propensity score is smaller than largest score in the control (treatment) group or whose score is larger than the smallest score in the control (treatment) group are removed. In this paper, the IPW weights are a function of several propensity scores and the weights given to a certain individual change with the survival time. We therefore consider a slightly modified version of the trimming approach in Huber et al. (2013). Firstly, using the t% rule, we obtain 15

ACCEPTED MANUSCRIPT

the cut-off values w(s, k) for all s ≤ k ≤ t. Thereafter, we only use individuals whose weights are below w(s, k) in all time periods (between s and t). This assures that extreme values are discarded and that the same types of individuals are discarded in both treatment arms. Secondly, we apply the minimum and maximum logic to the product of the scores. We exclude observations whose

CR IP T

product of the estimated scores if non-treated is smaller (greater) than the maximum (minimum) of the minimum (maximum) product if non-treated among the treated and controls(for time periods between s and t).

2.7

Aggregated effect

AN US

ATETt (s) provides estimates for each separate pre-treatment duration. From a policy perspective, the overall effect is also of interest, that is the average effect on the treated averaged over all pretreatment durations. Specifically, the overall effect on the probability of surviving t0 time periods after the start of the treatment can be estimated as: X

\ s+t0 (s), Pb(s)ATET

M

\ t0 = ATET

s

ED

i.e., as an average over relevant pre-treatment durations. Here, Pb(s) =

Pns , s ns

where ns is the

number of treated at s, so that Pb(s) is the fraction in the sample of treated starting treatment at

Sequential treatments with duration outcomes

CE

3

PT

s.

So far we have considered the case with one single treatment. By focusing on one single treatment,

AC

we were able to focus on the key identification and estimation issues. However, in many cases we are interested in the effects of sequences of treatments. For instance, in Sweden many unemployed workers participate in more than one ALMP during one unemployment spell, including participation in different programs and participation in the same program more than once. We therefore consider an extension of the results for one single treatment into a setting with sequences of multiple treatments.

16

ACCEPTED MANUSCRIPT

We continue to focus on a survival time setting. Formally, in each period there are M mutually exclusive treatments and no treatment. Treatment in time period t is denoted by St and we use the notation st for a particular sequence of treatments. We have the binary potential outcomes Yt (s), as indicators of a transition in period t if the treatment sequence had been s. As before, At is used

CR IP T

to denote a sequence At = {A1 , . . . , At } and Xt− is the observed covariates at t.

In a survival time setting, there are several interesting average effects of sequences of treatments. One alternative is to study the average effect on conditional transition probabilities, usually referred to as the transition rate or the hazard rate. However, it is well known that the identification of such conditional transition rates is difficult (see e.g. van den Berg, 2001). From the data we observe the

AN US

outcomes for survivors under each treatment sequence, but due to dynamic selection it is difficult to learn about one population of survivors from another population of survivors. For that reason we focus on average effects on the survival rate and the comparison of the probability to survive from a starting point to time t under one treatment sequence compared with survival throughout the same time interval under a reference treatment sequence. Formally, we

M

have the contrast between the two treatment sequences st and s∗t that can be any sequences of the M treatments11 :

ED

ATE(st , s∗t ) = E(Y t (st ) = 0) − E(Y t (s∗t ) = 0).

(4)

PT

Note that the ATE captures the full effect of the sequence of treatments. As before, let us use ALMPs to illustrate the main points. Consider the causal contrast between two periods of work

CE

practice and two periods with no treatment. The two periods of work practice may affect the twoperiod survival rate in several ways. For instance, work practice in the first period can affect the job finding rate in both the first and the second period, so that some treated unemployed workers exit

AC

from unemployment already in the first period. The framework proposed in this paper captures such first period effects as well as any effects in the second period. As a comparison, note that any analysis using only those who do not find employment before the end of an entire sequence, 11

The results of this paper concern the identification and estimation of this parameter of interest. Contrasting sequences with respect to a starting point, t > 1 can also be considered for a population that follows the same treatment sequence in the time points before the new starting point of interest. One can also consider the ATE for ∗ st s∗ subpopulations, such as ATETst ,st (s01 ) = Pr(Y t = 0|S1 = s01 ) − Pr(Y t t = 0|S1 = s01 ).

17

ACCEPTED MANUSCRIPT

by definition disregards any effect in the first period. By examining effects on the survival rate, we also avoid conditioning the analysis to a sample that completes an entire sequence, since this implies conditioning on survival under a specific series of treatments, that is conditioning on future

3.1

CR IP T

outcomes which introduces additional selection problems.

Assumptions and identification

Concerning identification, we extend Assumption A1 to a setting with sequences of treatments. For all t, st−1 and all s∗k , k ≥ t with the first t − 1 components equal to st−1 , we assume that s∗

(A4)

AN US

{Y kk , k = t, t + 1, . . .} ⊥ St | xt− , S t−1 = st−1 , Y t−1 = 0.

In other words, for individuals surviving up until t under sequence st−1 , treatment assignment in period t is random conditional on the observed covariates. This holds in all situations in which decisions are made sequentially based on the survivor experience up to a certain point in time,

M

for instance, if case workers assign unemployed individuals to ALMP programs based on time in unemployment, previous program participation and a set of observed covariates.

ED

We also generalize the no-anticipation assumption into this setting with sequences of treatments.

PT

For all t, s, and all s∗ , with the first t components equal to st we assume that Pr(Yt (s) = 1) = Pr(Yt (s∗ ) = 1).

(A5)

CE

Again, future treatments should not affect current outcomes. In this case, for sequences that are identical up until t the potential outcomes are the same in period t no matter future differences

AC

between the two sequences.12 Our main identification result is summarized in Theorem 3: Theorem 3 (Identification of ATE(st , s∗t )) Suppose that A4 and A5 hold then ATE(s2 , s∗2 ) = 12

Besides Assumptions A4 and A5 an overlap condition and SUTVA are also required. Here, the relevant overlap condition is: 0 < Pr(St = st |Xt− , S t−1 = st−1 , Y t−1 = 0) < 1, ∀t, st−1 , st .

18

ACCEPTED MANUSCRIPT

EX1− [EX2− |X1− ,Y =0,S1 =s1 {Pr(Y2 = 0|X2− , Y = 0, S 2 = s2 )} Pr(Y1 = 0|X1− , S1 = s1 )]− h i EX1− EX2− |X1− ,Y1 =0,S1 =s∗1 {Pr(Y2 = 0|X2− , Y1 = 0, S 2 = s∗2 )} Pr(Y1 = 0|X1− , S1 = s∗1 ) . Proof See Appendix B.

CR IP T

Identification results for an arbitrary number of periods follow in the same way. The identification follows from the fact that the probability of the observed survival of the individuals, conditional on the covariates and the treatment sequence can be used to estimate the corresponding probability for the potential outcomes under the same sequence. This is similar to the line of reasoning in the

3.2

Estimation

AN US

case with one single treatment.

We also generalize the DIPW estimator. If Assumptions A4 and A5 hold, then we have

=

t Y

k=1 t Y

1−

1−

PN

PN

ˆsk,i Yk,i 1(Y k−1,i = 0)1(S k,i = sk ) i=1 w PN ˆsk,i 1(Y k−1,i = 0)1(S k,i = sk ) i=1 w

ˆs∗k,i Yk,i 1(Y k−1,i = 0)1(S k,i = s∗k ) i=1 w PN ˆs∗k,i 1(Y k−1,i = 0)1(S k,i = s∗k ) i=1 w

ED

k=1

"

"

M

[ t , s∗t ) ATE(s

with the estimated weights

1

PT

pˆs1 (Xi,1− )

Qk

ˆsm (Xi,m− ,si,m−1 ) m=2 p

, w ˆs∗k =

CE

w ˆsk,i =

pˆs∗1 (Xi,1− )

Qk

#



(5)

#

1

ˆs∗m (Xi,m− ,s∗i,m−1 ) m=2 p

.

Here, pˆst (Xi,t− ,si,t−1 ) is the probability of obtaining treatment st in period t given survival up until t under treatment sequence st−1 and covariates Xt− . Thus, pst (Xt− ,st−1 ) = Pr(St = st |Xt− , S t−1 =

AC

st−1 , Y t−1 = 0). Again, one way of obtaining standard errors is to use bootstrapping. The intuition behind the weights is very similar to the case with one single treatment. In the

first period, the only purpose of the weights is to re-weight the outcomes of the individuals on st

and s∗t in order to mimic the distribution of the covariates in the full population. In the second period, the weights also correct for the selective censoring due to treatment assignment in period 2. Specifically, the weights depend on the inverse probability of remaining on the treatment sequence 19

ACCEPTED MANUSCRIPT

of interest conditional on the observed covariates, so that individuals with covariates such that they are more likely to diverge from the sequence of interest are given larger weight.

4

Simulations

CR IP T

This section examines the finite sample properties of the DIPW estimator with a focus on the case with a single treatment.13 Data are generated using a logistic model for the hazard rate out of the initial state

and for the hazard rate into treatment

AN US

Pr(Yt = 1|Y t−1 = 0, X, VY ) = [1 + exp(−(3.0 + bY X + cY VY ))]−1

Pr(S = t|S ≥ t, X, VS ) = [1 + exp(−(aS + bS X + cS VS ))]−1 , where X is assumed to be observed by the econometrician, and VS and VY are unobserved. All three

M

are independent random variables drawn from a uniform distribution on the interval [-1,1]. Thus, we allow for unobserved heterogeneity, but simulate under the restriction that the unobserved het-

ED

erogeneity in the outcome and treatment equations are uncorrelated, so that the unconfoundedness assumption holds.

PT

Three baseline settings are considered: no heterogeneity (bY = bS = cY = cS = 0), observed heterogeneity (bY = bS > 0, cY = cS = 0) and a full heterogeneity setting (bY = bS > 0, cY = cS =

CE

1). We set aS equal to −2.0 or −3.0, i.e. either a high or a low treatment rate.14 Samples with a size of 10,000 are generated, and the number of replications is 10,000. We consider ATETt (1), that

AC

is, the effect of treatment in the first period, but similar results are obtained for other enrollment times.

The propensity scores in the DIPW estimator weights are estimated with a correct logistic

model specification. The standard errors are calculated using bootstrapping (99 replications). 13

The working paper version of this paper reports simulation results for sequences of treatments (Vikstr¨ om, 2015). In the case with bY = bS = cY = cS = 1 this implies that about 14% and 6% respectively start treatment in each period. 14

20

ACCEPTED MANUSCRIPT

4.1

Related estimators

In the simulations we compare the performance of the DIPW estimator with the two-step matching estimator in Fredriksson and Johansson (2008) (FJ henceforth) and the blocking estimator in Cr´epon et al. (2009) (CFJV henceforth). Both papers propose estimators of the same average

CR IP T

effects that are considered in this paper.

The first step of the FJ estimator is one-to-one matching of treated at specific duration, s, to non-treated survivors at s. Then, the samples of matched treated and matched controls are used to construct estimates of the hazard rates under treatment and no-treatment. In this step, any matched control starting treatment after t is considered right-censored when (s)he starts treatment.

AN US

As in FJ, 1-nearest neighbor propensity score matching is used for the FJ estimator, and the scores are estimated using logistic regression models.

In the CFJV blocking estimator, propensity scores are estimated for each time to treatment, s. Then, the treated at s and the non-treated at s are divided into blocks based on the predicted

k=s

"

1−

P

i∈Jb

P

Yk,i 1(Y k−1,i = 0)1(Si = s)

i∈Jb

1(Y k−1,i = 0)1(Si = s)

ED

t Y

M

scores. For block b, the average effect on the survival function is #



t Y

k=s

"

1−

P

i∈Jb

P

Yk,i 1(Y k−1,i = 0)1(Si > k)

i∈Jb

1(Y k−1,i = 0)1(Si > k)

#

,

where Jb denotes the set of indices for all individuals in block b. Note that the hazard rate in each

PT

period is the fraction leaving the initial state among the treatment group and the control group, of which the latter only consists of not-yet treated individuals. The overall effect is obtained by

Simulation results

AC

4.2

CE

averaging over all the blocks.15

Initially, Figure 1 compares the bias of the DIPW estimator and the FJ estimator for a selection of values of bY and bS , with and without unobserved heterogeneity. We report results for survival

15 For the CFJV estimator the scores can be estimated in several different ways. CFJV uses a proportional hazard model with a piecewise constant baseline function and unobserved heterogeneity. Another way, utilized here, is to estimate separate logit models for each s, using only the survivors at each s. Note that data are generated using logistic models, so this implies using the correct functional form. We use ten blocks and standard errors are obtained using bootstrapping (99 replications).

21

ACCEPTED MANUSCRIPT

over ten periods, and note that low (high) values of bY and bS correspond to limited (extensive) observed heterogeneity. The results in the figure show that the bias of the DIPW estimator is small in all cases. In the no heterogeneity setting (with bY = bS = 0) the bias of the FJ estimator is small, but with observed heterogeneity in the model the FJ estimator is biased. This is in line with

CR IP T

the theoretical results in the working paper version of this paper, which shows that FJ ignores a selective censoring problem, leading to bias in many relevant settings (Vikstr¨om 2014).16 The bias of the FJ estimator is increases in bY and bS , so that more pronounced observed heterogeneity leads to greater bias, and the bias is reinforced by uncorrelated unobserved heterogeneity.

AN US

Figure 1 about here

These properties are confirmed by the full simulation results reported in Table 1. The results in this table for tests with a nominal size of 5% show that the DIPW estimator also has the correct size. Table 1 reports results for the CFJV estimator. We find that for our three baseline settings the bias of the CFJV estimator is small and of correct size. Additional exploration of the CFJV

for a low and high treatment rate.

M

estimator reveals that this holds for both limited and extensive observed heterogeneity as well as

ED

Table 1 about here

PT

Besides the three baseline models, a model with two covariates (both uniform on [-1,1]) with time-varying selection is considered:

AC

CE

Pr(Yt = 1|Y t−1 = 0, X1 , X2 , VY ) = [1 + exp(−(3.0 + X1 + X2 + VY ))]−1

Pr(S = t|S ≥ t, X1 , X2 , VS ) = [1 + exp(−(2.0 + X1 + VS ))]−1 ,

t=1

Pr(S = t|S ≥ t, X1 , X2 , VS ) = [1 + exp(−(2.0 + X2 + VS ))]−1 ,

t > 1,

i.e. one covariate only affects selection into treatment in the first period and the other covariate affects selection in all subsequent periods. The results from this model, reported in the fourth panel 16 The intuition is that the FJ estimator censors the not-yet treated when they become treated, but the FJ estimator does not adjust for the fact that this censoring introduces a selection problem. The consequence of this is that individuals with X characteristics that make them less likely to enter treatment will be overrepresented among the not-yet treated, and this confounds the analysis if these X characteristics also affect the outcome.

22

ACCEPTED MANUSCRIPT

of Table 1, show that the bias of the DIPW is small, but that both the FJ estimator and the CFJV estimator are severely biased. The intuition behind the results for the CFJV blocking estimator is that the blocking is only based on the selection mechanism at t = 1 and not on the subsequent selection into treatment, and this leads to bias in the current setting with a time-varying impact

CR IP T

of the covariates.17 Another advantage of the DIPW estimator compared to the CFJV estimator is that for the CFJV the researcher has to decide upon the number of blocks and it is unclear how to select the optimal number of blocks.

Application on Swedish ALMPs

AN US

5

This section illustrates the DIPW estimator using data on a work practice program and a training program governed by the Swedish public employment service (PES).

5.1

The programs

M

The aim of the work practice program is to provide long-term unemployed individuals with practical experience and employer contacts in order to maintain and strengthen their productivity. The

ED

participants should perform regular tasks at regular firms, even though they are not employed by them. The duration of the program does not normally exceed six months, and is in fact usually

PT

much shorter. Participants receive a grant for their participation in this and the other programs. Those who are entitled to unemployment insurance (UI) benefits receive a grant equal to their UI

CE

benefits.

The main purpose of the training program is to improve the skills of the unemployed person and thereby enhance their chances of obtaining a job. The contents of the courses should be directed

AC

towards the upgrading of skills or the acquisition of skills that are in short supply or that are expected to be in short supply. These can be computer skills, technical skills, manufacturing skills, and skills in services and medical health care. 17

A generalized blocking estimator is to block based on both the propensity score at t = 1 and the scores at t > 1, addressing the problem raised in our simulations. However, with small samples this is problematic, since then the blocking is based on several propensity scores.

23

ACCEPTED MANUSCRIPT

5.2

Data and estimation details

The population is taken from the register H¨andel administrated by the PES, which includes all job seekers in Sweden. The register contains daily information on the time when an individual (i) became unemployed, (ii) entered into a labor market program and (iii) exited from unemployment.

CR IP T

It also includes information on the reason for the exit (employment, education, social assistance, disability or sickness insurance programs and lost contact), and personal characteristics recorded at the beginning of the unemployment spell. To these data we merge information on marital status, household characteristics (e.g., number of children), labor income and income from various insurance schemes (e.g., sickness and disability) from the population registers. We also use the

AN US

unemployment records to construct detailed information on previous unemployment and shortterm labor market history (e.g., the number and length of previous spells). We sample all unemployed individuals in H¨andel between January 1, 1999 to December 31, 2006 who were aged between 25 and 55 at the time of entry into unemployment. The study ends in April

M

8, 2011. The spell ends when the unemployed individual finds employment for a minimum period of 30 days. Spells with exits for reasons other than employment, such as lost contact, sickness or end

ED

of study, are censored. We aggregate the daily spell data to 30-day intervals. In robustness analyses we also use 15-day and 60-day intervals. We focus on the effects of work practice programs when work practice is the first program during the unemployment spell. Individuals entering another

PT

program before work practice are right censored. We use logit regression models to estimate the propensity scores and apply the trimming rule

Identification

AC

5.3

CE

described in Section 2.6 (t is set to 1%).

In order to apply the DIPW estimator, two main assumptions have to be fulfilled: sequential unconfoundedness and no-anticipation. In this ALMP program setting, the sequential unconfoundedness assumption implies that conditional on the covariates treatment assignments should be random among the non-treated survivors. For individuals still unemployed in a given period treatment assignments in that period should be unrelated to futures outcomes once we condition on the ob24

ACCEPTED MANUSCRIPT

served covariates. The credibility of this assumption depends to a large extent on the covariates that are controlled for in the analysis. We control for baseline characteristics, including gender, age, age squared, an indicator for at least one child in the household, marital status, country of origin (3 categories), and level of education (5 categories). We also control for time of entry into

CR IP T

unemployment (inflow year dummies), regional variation (22 regions), local unemployment rate, unemployment benefits entitlement, short-term labor market history (labor income, social assistance and unemployment insurance benefits 1 and 2 years before unemployment), and medium-run unemployment history (number of unemployment days in the last 5 years).

In the baseline analysis we only use time-invariant covariates. One reason for this is that most of

AN US

the characteristics used in the analyses do not change over time. For instance, the level of education and pre-unemployment labor market history remain the same throughout the unemployment spell. Later we explore two sets of time-varying covariates. First, we allow the local unemployment rate to vary over time, since labor market conditions are important for treatment assignments and these conditions may vary over time. Second, we include time-varying covariates measuring

M

program participation before entering the work practice program. We use a time-varying indicator for any previous program participation, but we have also explored more elaborated models. In this

ED

analyses, we study all work practice episodes while we in the main analysis focus on cases where work practice is the first program during the unemployment spell.

PT

The variables used in the analysis are selected based on the results from previous studies. For instance, Heckman et al. (1998) and Heckman and Smith (1999) stress that besides using basic

CE

socioeconomic variables, it is also very important to control for previous unemployment, lagged earnings, and local labor market characteristics. More recently, Lechner and Wunsch (2013) use

AC

German data and Empirical Monte Carlo methods to perform an innovative examination of which variables are important to control for in evaluations of ALMP programs. They conclude that it is important to control for baseline characteristics, information about the timing of entry into unemployment and the program, regional information, pre-treatments outcomes (e.g., employment and earnings four years before the program) and short-run labor market history. Here, we control for similar types of variables, but there are some differences compared to Lechner and Wunsch

25

ACCEPTED MANUSCRIPT

(2013). The main difference is that they include more detailed information on the short-run labor market history. In this paper we control for labor income, social assistance and unemployment insurance benefits one and two years before the start of the unemployment spell. We have also explored using more detailed controls in a similar way as Lechner and Wunsch (2013), but the

CR IP T

results did not change in any significant way.18 In addition, we have some specific information on the program selection process in Sweden. The selection into the ALMP programs is a two-sided selection process, involving both the unemployed worker and the caseworker at the local employment office. However, there is some evidence that caseworkers have a large degree of discretionary power over enrollment in Swedish ALPM programs.

AN US

In an experiment with caseworkers, Eriksson (1997) finds that case worker heterogeneity is more important for program participation than individual heterogeneity. Using Swedish register data, Carling and Richardson (2001) show that program enrollment depends more on the local employment office affiliation than on the unemployed individuals’ own characteristics. All this suggests

the unconfoundedness assumption.19

M

that individual self-selection into the program is less of a problem, lending additional support to

However, caseworkers might also have and use additional information about the unemployed

ED

workers. For instance, since case workers interact with the unemployed workers they learn about their personality traits and individual motivation. This and other unobserved variables are likely to

PT

affect treatment assignments, which would invalidate the unconfoundedness assumption. However, using German data Caliendo et al. (2014) use survey information on personality traits, expectations,

CE

attitudes and job search behaviour. However, they conclude that such survey information makes little difference compared to only using the rich administrative data. We also argue that previous

AC

outcomes are good proxies for individual motivation, since motivation should affect both previous and current outcomes. All this suggests that the sequentially unconfoundedness is fulfilled, but we can of course never rule out that there is some other unobserved factor that affects both treatment 18

Another difference compared with Lechner and Wunsh (2013) is that we control for age using age and age squared, whereas Lechner and Wunsh (2013) use a set of age dummies for different age groups. 19 The caseworker discretion is problematic if caseworkers who frequently assign unemployed workers to ALMPs, are systematically better or worse at helping the unemployed workers to find jobs, since this will introduce a correlation between the probability of becoming treated and unemployment durations.

26

ACCEPTED MANUSCRIPT

assignments and the outcome of interest. No-anticipation holds if the unemployed workers do not alter their behavior as a response to their knowledge of future treatments, or if they are unaware of future treatments. In evaluations of ALMP programs this assumption is problematic since several recent studies have found that the

CR IP T

unemployed react to information about future treatments (see e.g., Black et al., 2003; Cr´epon et al. 2013). However, in many cases the individual is informed about the work practice program shortly before the start of the program, and this limits any substantial anticipation effects. However, if the no-anticipation assumption is violated the not-yet treated react to future treatments. If the nontreated dislikes future treatments, search intensity should increase and reservation wages decrease,

AN US

leading to higher transitions rates among the not-yet treated in the control group. This will bias the estimated effect downwards.

Besides sequential unconfoundedness and no-anticipation, the overlap condition, SUTVA and sequentially unconfounded censoring need to hold. The sequential censoring assumption implies, that, conditional on covariates, any right censoring, for instance, due to drop-out from the study,

M

should be random among the group of survivors. Here, the main reason for right censoring is that the PES has lost contact with the unemployed worker. This makes the right censoring assumption

ED

somewhat difficult to assess. However, note that any selective right censoring affects both the treated and the non-treated, having an impact the survival rate estimates under both treatment

Effects of work practice

CE

5.4

PT

and no treatment.

The estimation results for one work practice episode are presented in Figure 2. Each figure gives the

AC

results for effects after a certain number of months. Note that the results are for the effect on the fraction re-employed instead of the effect on the survival rate. The figure shows that in all cases, there are substantial locking-in effects with lower employment rates during the first months after enrollment. In the first month after assignment the re-employment rate is lower in the treatment group. After this period participants catch up and about 450 days after enrollment the employment rate on average is about 6 percentage points higher among the participants. We also see that the

27

ACCEPTED MANUSCRIPT

size of the effect varies with enrollment time, but Figure 2 indicates no clear pattern that early enrollment is better than late enrollment. Figure 2 about here

CR IP T

This analysis adjusts for a large set of time-invariant covariates. Next, Figures C.1 and C.2 in Appendix C explore time-varying covariates. Figure C.1 allows the local unemployment rate to vary over time and Figure C.1 in addition adds time-varying covariates measuring previous participation in any other program besides work practice. In both cases, the results using time-varying covariates are very similar to our baseline results with only time-invariant covariates. This holds even though

AN US

both local labor conditions and previous program participation affects treatment assignments. In our main analysis, the daily unemployment data is aggregated to 30-day intervals. As robustness analyses, Figure C.3 in Appendix C present estimates using 15-day and 60-day intervals. The estimated effect of work practice on the re-employment rate is more negative with 60-day intervals and more positive with 15-day intervals compared with 30-day intervals. However, from

M

around 200 days and onwards the estimated effects are very similar with 15-day intervals and 30-day

5.5

ED

intervals.20

Effects of sequences of programs

PT

We now consider effects of different sequences of the two programs. Specifically, we explore treatment sequences defined by a specific combination of the first and the second entry time.21 Table 2

CE

reports the effects of different sequences of the two programs when these sequences are compared with not enrolling into a program. We present results for different times to the first episode and 20

AC

These patterns are explained by a dynamic selection problem within the intervals. With 60-day intervals the treated consists of those treated within a 60-day period. The non-treated includes those who leave unemployment within the 60-day period without being treated and those who remain non-treated throughout the entire interval. Since the former group is positively selected this creates a dynamic selection, which is more severe with larger intervals. 21 Due to sample size restrictions the data in this analysis is aggregated to 45-day intervals. We also adjust the trimming rule and set t to 4% and for sequences with two treatments we set t to 20%. Moreover, for sequences defined by a single work practice episode the propensity scores are estimated separately in each time period for the censoring due to a first program episode and thereafter the propensity scores are estimated jointly. The latter joint logit model includes time period dummies. Similar joint estimations are performed when the sequence consists of two treatments.

28

ACCEPTED MANUSCRIPT

different times between the first and the second program.22 For presentation reasons, we average the effects over pre-treatment intervals and report effects on cumulated re-employment rates (first 36 months). The cumulated effects are scaled in such a way that they can be interpreted as the

Table 2 about here

CR IP T

effect on the average truncated unemployment duration in months.

Initially, consider the results in columns 2–4 of Panel A for different sequences of work practice episodes. Comparing these estimates with the estimates for one single work practice episode in column 1 we see that in all cases a second program episode leads to a smaller decrease of time in

AN US

unemployment compared with one single work practice episode. This holds for all the timings of the first program and for all the spacings between the first and the second program episode, even though these differences are not always significant.

Columns 5-7 of Panel A presents results for sequences with first work practice and then training.23 From a comparison of the results in column 1 for only one single episode of work practice

M

and the results in the other columns, we see that such combinations of work practice and training in all cases lead to a less favorable employment effect than one single work practice episode.

ED

Panel B reports results when training is the first program in the sequence. From column 1 we see that a single training program leads to shorter unemployment durations. We also see that later

PT

enrolment in training is associated with greater employment effects. Next, the results for sequences of training episodes in columns 2–4 of Panel B show that repeated episodes of training seem to be

CE

a waste of resources. One possible explanation is that another training episode only gives rise to an additional locking-in period and this is not counteracted by greater positive post-program effects

AC

of the second training episode. Finally, columns 5–7 of Panel B show results for sequences with first training and then work

practice. We find some evidence that work practice after completed training can reduce unemployment more than just one single training episode. One reason for this can be that work practice in 22

Note that the time between the two programs refers to the time between the start of the two programs, so that the actual time between the two programs is significantly shorter. 23 Since the training program normally lasts for several months we are unable to consider sequences where the second program starts less than 136 days after the start of the training program.

29

ACCEPTED MANUSCRIPT

the occupation for which the individuals have been trained can serve as a quick way into a new type of occupation or into a new type of labor market.

6

Conclusions

CR IP T

This paper has re-considered treatment evaluation under unconfoundedness in a dynamic treatment assignment setting in which treatment may start at any point in time. The outcome of interest is survival time and together with the dynamic treatment assignment this introduces well-known methodological issues. We have proposed a DIPW estimator to estimate average effects in this

AN US

setting, focusing both on the effects of single treatment and sequences of treatments. The new estimator involves separate weights for each time period and the use of the not-yet treated in each period to estimate the counterfactual survival rate.

An analysis of data from two Swedish ALMP programs illustrates the DIPW estimator. Since the ALMP programs can start at any elapsed unemployment duration and the outcome of interest

M

is the time in unemployment this offers a key application of the estimators introduced in this paper. The result is that participation in the work practice program leads to significantly increased

ED

employment rates compared with never receiving treatment. We have also studied effects of sequences of work practice and labor market training. One key

PT

result is that enrolling an unemployed individual twice in the same program or in two different programs in most cases leads to longer unemployment spells than only participating in a single

CE

program once. This holds for most timings of the first program and most spacings between the two

AC

program episodes.

30

ACCEPTED MANUSCRIPT

References Abbring J.H. and G.J. van den Berg (2003), “The non-parametric identification of treatment effects in duration models”, Econometrica, 71, 1491–1517.

CR IP T

Albanese A., Y. Thuy and B. Cockx (2015), “Working time reductions at the end of the career. Do they prolong the time spent in employment?, Differential effects of active labour market programs for the unemployed”, IZA Discussion paper 9619.

Black D., J. Smith, M. Berger and B. Noel (2003) “Is the Threat of Reemployment Services More Effective than the Services Themselves? Evidence from Random Assignment in the UI

AN US

System”, American Economic Review, 93:4, 1313–1327.

Biewen M., B. Fitzenberger, A. Osikominu and M. Paul (2014), “The Effectiveness of Public Sponsored Training Revisited: The Importance of Data and Methodological Choices”, Journal of Labor Economics, 32(4), 837–897

M

Busso M., J. DiNardo and J. McCrary (2014), “New Evidence on the Finite Sample Properties of

96(5), 885–897

ED

Propensity Score Reweighting and Matching estimators”, Review of Economics and Statistics,

PT

Carling K. and K. Richardson, K. (2001), “The Relative Efficiency of Labor Market Programs: Swedish Experience From the 1990”, Labour Economics, 26:4, 335–354.

CE

Caliendo M. Mahlstedt, R. and Mitnik, O. (2017), “Some Practical Guidance for the Implementation of Propensity Score Matching”, Labour Economics, 46, 14–25.

AC

Cr´epon, B., M. Ferracci, G. Jolivet and G.J. van den Berg (2009), “Active Labor Market Policy Effects in a Dynamic Setting”, Journal of the European Economic Association, 7, 595–605.

Dengler K. (2015), “Effectiveness of Sequences of One-Euro-Jobs for Welfare Recipients in Germany”, Applied Economics, 47:57, 6170–6190. Dyke, A., Heinrich, C. J., Mueser, P. R., Troske, K. R. and Jeon, K.-S. (2006), “The Effects

31

ACCEPTED MANUSCRIPT

of Welfare-to-Work Program Activities on Labor Market Outcomes”, Journal of Labor Economics, 24(3), 567–607. Eriksson, M. (1997), “To choose or not to choose: Choice and choice set models, Ume˚ a Economic

CR IP T

Studies 443, Department of Economics”, Ume˚ aa University. Fitzenberg B., A. Osikominu and R. V¨ olter (2008), “Get Training or Wait? Long-Run Employment ´ Effects of Training Programs for theUnemployed in West Germany”, Annales d’Economie et de Statistique, 91/92, 321-355.

Fredriksson P. and P. Johansson (2008), “Dynamic Treatment Assignment: The Consequences

AN US

for Evaluations Using Observational Data”, Journal of Business & Economic Statistics, 26:4, 435–445.

Fr¨ olich M. (2004), “Finite Sample Properties of Propensity-Score Matching and Weighting Estimators”, Review of Economics and Statistics, 77–90.

M

Gerfin M. and M. Lechner (2002), “A Microeconometric Evaluation of the Active Labour Market

ED

Policy in Switzerland”, Economic Journal, 112, 854–893. Heckman J.J and S. Navarro (2007), “Dynamic Discrete Choice and Dynamic Treatment Effects”,

PT

Journal of Econometrics, 136, 341–396. Heckman J.J, H. Ichimura and P. Todd (1998), “Matching As An Econometric Evaluation Esti-

CE

mator”, Review of Economic Studies 65, 261–294. Heckman J.J and J. Smith (1999), “The Pre-programme Earnings Dip and the Determinants

AC

of Participation in a Social Programme. Implications for Simple Programme Evaluation Strategies”, Economic Journal 109, 313–348.

Hernan M.A, B. Brumback and J.M. Robins (2001), “Marginal Structural Models to Estimate the Joint Causal Effect of Nonrandomized Treatments”, Journal of the American Statistical Association, 96, 440–448.

32

ACCEPTED MANUSCRIPT

Hirano K., G. imbens and G. Ridder (2003), “ Efficient estimation of average treatment effects using the estimated propensity score”, Econometrica, 71, 1161–1189. Hotz, V. J., Imbens, G. W. and Klerman, J. A. (2006), “Evaluating the Differential Effects of

Program”, Journal of Labor Economics, 24 (3), 521–566.

CR IP T

Alternative Welfare-to-Work Training Components: A Reanalysis of the California GAIN

Huber M., M. Lechner and C. Wunsch (2013), “The Performance of Estimators Based on the Propensity Score”, Journal of Econometrics 175, 1–21.

Kastoryano S. and B. van der Klaauw (2011), “Dynamic Evaluation of Job Search Assistance”,

AN US

IZA DP No.5424.

Lechner M.(1999), “Earnings and employment effects of continuous off-the-job training in East Germany after unification”, Journal of Business and Economic Statistics, 17, 74-90. Lechner M. (2008), “Matching Estimation of Dynamic Treatment Models: Some Practical Issues”,

M

In. Millimet D., Smith J. and Vytlacil E. (Eds.), Advances in Econometrics 21, Modelling and

ED

Evaluating Treatment Effects in Econometrics, Emerald Group Publishing Limited, 289–333. Lechner M. (2009), “Sequential Causal Models for the Evaluation of Labor Market Programs”,

PT

Journal of Business & Economic Statistics, 27:1, 71–83. Lechner M. and R. Miquel (2010), “Identification of the Effects of Dynamic Treatments by Se-

CE

quential Conditional Independence Assumptions”, Empirical Economics, 39, 111–137. Lechner M., R. Miquel and C. Wunsch (2011), “Long-run effects of public sector sponsored training

AC

in West Germany,” Journal of European Economic Association 9(4), 742–784.

Lechner M. and S. Wiehler (2013), “Does the Order and Timing of Active Labour Market Programmes Matter?,” Oxford Bulletin of Economics and Statistics, 75(2), 180–212.

Lechner M. and C. Wunsch (2013), “Sensitivity of matching-based program evaluations to the availability of control variables,” Labour Economics 21, 111–121.

33

ACCEPTED MANUSCRIPT

Osikominu A. (2013), “Quick Job Entry or Long-Term Human Capital Development? The Dynamic Effects of Alternative Training Schemes ”, Review of Economic Studies, 80:1, 313–342 Robins J. (1986), “A New Approach to Causal Inference in Mortality Studies With Sustained Exposure Periods: Application to Control of the Healthy Worker Survivor Effect”, Mathematical

CR IP T

Modelling, 7, 1293–1512.

Sianesi B. (2004), “An Evaluation of the Swedish System of Active Labour Market Programmes in the 1990s”, Review of Economics and Statistics, 86, 133–155.

Labour Economics, 15(3), 370–399.

AN US

Sianesi B. (2008), “Differential effects of active labour market programs for the unemployed”,

van den Berg G. (2001), “Duration Models: Specification, Identification and Multiple durations”, in: J.J. Heckman and E.E. Leamer (ed.), Handbook of Econometrics, volume 5, Elsevier, 3381–3460.

M

van den Berg, G.J., M. Caliendo, R. Schmidl and A. Uhlendorff (2015), “Matching or Duration

ED

Models? A Monte Carlo Study”, mimeo, University of Mannheim Vikstr¨ om J. (2015), “Evaluation of sequences of treatments with application to active labor market

PT

policies, IFAU Working paper 2015:25. Wooldridge J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.), MIT

AC

CE

Press.

34

ACCEPTED MANUSCRIPT

Tables and Figures

−.08

AN US

−.06

Average bias −.04 −.02 0

CR IP T

.02

.04

Figure 1: Simulation results 1 – by observed and unobserved covariate impact

0

.5

1 bY, bS

DIPW (cS=cY=0) DIPW (cS=cY=1)

1.5

2

F−J (cS=cY=0) F−J (cS=cY=1)

AC

CE

PT

ED

M

Note: Bias for ATET10 (1), i.e. the average effect of treatment in the first period on survival 10 periods. bY and bS measures the impact of the observed covariate in the exit rate and treatment rate equation, respectively. cY = cS = 0 implies no unobserved heterogeneity and cY = cS = 1 includes unobserved heterogeneity. Data generating processes for the logistic simulation models described in the text. DIPW is the dynamic IPW estimator and F-J the Fredriksson and Johansson (2008) estimator. Samples of sizes 10,000 and results are based on 10,000 replications.

35

ACCEPTED MANUSCRIPT

0

30

60

90 120 150 180 210 240 270 300 330 360 390 420 450 Days since start of the program DIPW DIPW low. 95%

0

30

60

DIPW upp. 95%

CR IP T

Effect on fraction with employment −.05 0 .05 .1 .15 −.1

−.1

Effect on fraction with employment −.05 0 .05 .1 .15

Figure 2: Effect of work practice on fraction reemployed. By pre-treatment duration

90 120 150 180 210 240 270 300 330 360 390 420 450 Days since start of the program DIPW DIPW low. 95%

−.1

AN US

−.1

Effect on fraction with employment −.05 0 .05 .1 .15

(b) Treatment after 6 months

Effect on fraction with employment −.05 0 .05 .1 .15

(a) Treatment after 5 months

DIPW upp. 95%

30

60

90 120 150 180 210 240 270 300 330 360 390 420 450 Days since start of the program DIPW DIPW low. 95%

0

M

0

DIPW upp. 95%

ED

(c) Treatment after 7 months

30

60

90 120 150 180 210 240 270 300 330 360 390 420 450 Days since start of the program DIPW DIPW low. 95%

(d) Treatment after 8 months

AC

CE

PT

Note: DIPW estimates with bootstraped standard errors (99 replications).

36

DIPW upp. 95%

ACCEPTED MANUSCRIPT

Table 1: Simulation results for the DIPW estimator and related estimators DIPW bias [2]

FJ se [3]

No heterogeneity (bY = bS = cY = cS = 0) 1 0.053 0.0002 0.0065 2 0.053 -0.0056 0.0089 5 0.056 0.0063 0.013 10 0.055 0.0014 0.016

bias [5]

se [6]

size [7]

bias [8]

se [9]

0.063 0.063 0.066 0.066

-0.0044 -0.0013 -0.0089 0.023

0.0092 0.013 0.019 0.026

0.060 0.056 0.051 0.050

-0.0095 -0.030 -0.011 0.019

0.0094 0.013 0.018 0.022

0.0048 -0.05 -0.38 -1.3

0.0095 0.013 0.020 0.026

0.053 0.056 0.053 0.048

0.0053 -0.0092 -0.021 -0.036

0.0068 0.0094 0.013 0.016

Observed heterogeneity (bY = bS > 0, cY = cS = 0) 1 0.055 0.0082 0.0066 0.063 2 0.051 0.0093 0.0089 0.062 5 0.057 0.015 0.013 0.068 10 0.056 0.017 0.016 0.10 Unobserved and observed heterogeneity (bY 1 0.051 -0.00008 0.0069 2 0.055 -0.0021 0.0093 5 0.055 -0.0078 0.013 10 0.052 -0.013 0.015

CFJV

size [4]

= bS > 0, cY = cS = 1) 0.075 -0.0013 0.076 -.0097 0.095 -0.72 0.16 -2.0

-0.0011 -0.11 -0.71 -1.8

M

Time-varying selection effect of covariates 1 0.054 0.0007 0.0070 0.071 2 0.062 0.017 0.0095 0.071 5 0.055 0.012 0.013 0.093 10 0.052 0.028 0.015 0.16

CR IP T

size [1]

AN US

t

0.010 0.014 0.019 0.025

0.052 0.052 0.054 0.051

0.0087 -0.0041 -0.0044 -0.037

0.0072 0.0096 0.013 0.016

0.010 0.013 0.019 0.025

0.052 0.053 0.091 0.24

-0.01 -0.11 -0.75 -1.8

0.0073 0.0096 0.013 0.015

AC

CE

PT

ED

Note: Data generating processes for the logistics simulation models described in Section 5. DIPW estimates with bootstraped standard errors (99 replications). FJ is the Fredriksson and Johansson (2008) matching estimator implemented using 1nearest neighbor propensity score matching. CFJV is the Crepon et al. 2009) blocking estimator applied using 10 blocks and bootstraped standard errors (99 replications). Bias has been multiplied by 100. Size is for 5% level tests. The results are based on 10,000 replications.

37

ACCEPTED MANUSCRIPT

Table 2: Treatment sequences and cumulative employment rates. By type of programs, timing and spacing of the programs Time between programs Work practice second program 45-135 days

[1] [2] Panel A Work practice first program 91-180 days -0.93 0.26 (0.14) (0.74) 181-270 days -1.32 -0.48 (0.20) (0.85) 271-360 days -0.78 0.44 (0.21) (0.89) Panel B: Training first program 91-180 days -1.44 (0.13) 181-270 days -1.86 (0.17) 271-360 days -1.90 (0.21)

136-225 days

226-315 days

45-135 days

136-225 days

226-315 days

[3]

[4]

[5]

[6]

[7]

0.09 (0.44) -0.75 (0.54) 0.27 (0.91)

0.17 (0.39) -0.39 (0.70)

0.16 (0.70) 0.70 (1.12) 2.36 (1.24)

3.17 (0.68) 2.12 (0.72) 1.08 (1.14)

-0.53 (0.62) 0.45 (1.19)

-1.81 (1.04) -3.34 (0.93) -2.48 (2.37)

-0.93 (0.75) -2.38 (2.08)

0.11 (0.81) 0.56 (1.12) -1.17 (1.95)

0.37 (0.74) -0.40 (1.25)

CR IP T

No second

AN US

Time to first program

Training second program

AC

CE

PT

ED

M

Note: The table reports cumulated re-employment rates for the first 36 months of unemployment. The comparison sequence is participating in no program. Swedish data for the period 1999-2006. The covariates used in in the weighting are gender, age number of unemployment days in the last 5 years, level of education (3 categories), indicator for UI entitlement, region of residence (6 regions), indicator for at least one child in the household, marital status, foreign born, labor income, social assistance and unemployment benefits one and two years before the start of the unemployment, and calendar year (for inflow). The threshold for the trimming rate is 4% for the single program estimates and 10% for combinations of two programs.

38

ACCEPTED MANUSCRIPT

Appendix A: Proofs A.1 Identification We show that if Assumption A1 and A2 hold then ATETt (s) is identified. Consider identification of

CR IP T

ATETt (s) = Pr(Y t (s) = 0|S = s, Y s−1 (s) = 0) − Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0) for t = 2 and s = 1. First, for treatment we have

Pr(Y 2 (1) = 0|S = 1) = Pr(Y2 = 0, Y1 = 0|S = 1).

(A.1)

(A.2)

E(Y 2 (0) = 0|S = 1) = EX1− |S=1 [E(Y 2 (0) = 0|S = 1, X1− )] =

AN US

Second, for the counterfactual outcome Pr(Y 2 (0) = 0|S = 1) under no treatment we have

(A.3)

EX1− |S=1 [Pr(Y2 (0) = 0|Y1 (0) = 0, S = 1, X1− ) Pr(Y1 (0) = 0|S = 1, X1− )] =

M

EX1− |S=1 [Pr(Y2 (0) = 0|Y1 (0) = 0, S > 1, X1− ) Pr(Y1 (0) = 0|S > 1, X1− )] = EX1− |S=1 [EX2− |X1− ,Y1 (0)=0,S=1 {Pr(Y2 (0) = 0|Y1 (0) = 0, S > 1, X2− )} Pr(Y1 (0) = 0|S > 1, X1− )] =

ED

EX1− |S=1 [EX2− |X1− ,Y1 (0)=0,S=1 {Pr(Y2 (0) = 0|Y1 (0) = 0, S > 2, X2− )} Pr(Y1 (0) = 0|S > 1, X1− )] = (A.4)

PT

EX1− |S=1 [EX2− |X1− ,Y =0,S=1 {Pr(Y2 = 0|Y1 = 0, S1 > 2, X2− )} Pr(Y1 = 0|S > 1, X1− )].

Note that the first and the second equality follows from the law of iterated expectations and Bayes rule, the

CE

third equality by applying Assumption A1 for the first period, the fourth equality by the law of iterated expectations and the fifth holds by applying Assumption A1 in the second period. Finally, the sixth equality holds by going from potential outcomes to observed outcomes. In this step, Assumption A2 plays an

AC

important role. This assumption assures that the observed outcomes of the not-yet treated corresponds to the potential outcome under no treatment, even though some of the not-yet treated becomes treated later on.

39

ACCEPTED MANUSCRIPT

A.2 Dynamic IPW estimation We now show that the if Assumptions A1 and A2 hold the DIPW estimator is a consistent estimator of

ATETt (s) = Pr(Y t (s) = 0|S = s, Y s−1 (s) = 0) − Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0).

(A.5)

CR IP T

First, consider estimation of the outcome under treatment, Pr(Y t (s) = 0|S = s, Y s−1 (s) = 0). For this part of the ATETt (s), the DIPW estimator is:

P  t  Y i Yk,i 1(Y k−1,i = 0)1(Si = s) 1− P i 1(Y k−1,i = 0)1(Si = s) k=s

AN US

Taking the probability limit of (A.6) gives

P  t  Y i Yk,i 1(Y k−1,i = 0)1(Si = s) 1− P = N →∞ i 1(Y k−1,i = 0)1(Si = s) k=s

p lim t  Y

k=s

(A.6)

(A.7)

P  Y  t  p limN →∞ i Yk,i 1(Y k−1,i = 0)1(Si = s) E[Yk 1(Y k−1 = 0)1(S = s)] 1− 1− = . P p limN →∞ i 1(Y k−1,i = 0)1(Si = s) E[1(Y k−1 = 0)1(S = s)] k=s

M

Next, the observed outcome, Yk , corresponds to the individuals’ actual treatment regime, so that for individuals with S = s we have that Yk = Yk (s). Then,

PT

By simplifying we obtain

ED

 Y  t  t  Y E[Yk 1(Y k−1 = 0)1(S = s)] E[Yk (s)1(Y k−1 (s) = 0)1(S = s)] 1− = 1− . E[1(Y k−1 = 0)1(S = s)] E[1(Y k−1 (s) = 0)1(S = s)] k=s k=s

CE

 Y  t  t  Y Pr(Yk (s) = 1, Y k−1 (s) = 0, S = s) E[Yk (s)1(Y k−1 (s) = 0)1(S = s)] = 1− = 1− E[1(Y k−1 (s) = 0)1(S = s)] Pr(Y k−1 (s) = 0, S = s) k=s k=s

(A.8)

(A.9)

AC

t Y   1 − Pr(Yk (s) = 1|S = s, Y k−1 (s) = 0) = Pr(Y t (s) = 0|S = s, Y s−1 (s) = 0).

k=s

Thus, from (A.7)-(A.9) we have that P  t  Y i Yk,i 1(Y k−1,i = 0)1(Si = s) = Pr(Y t (s) = 0|S = s, Y s−1 (s) = 0). 1− P N →∞ i 1(Y k−1,i = 0)1(Si = s) k=s

p lim

40

(A.10)

ACCEPTED MANUSCRIPT

Second, consider estimation of the outcome under no treatment, Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0). For this part of the AT ETt (s) the DIPW estimator is:   P p bs (Xi,s− ) Qk t Y 1(Y = 0)1(S > k) k,i k−1,i i Y i pm (Xi,m− )   m=s 1−b . 1 − P p bs (Xi,s− ) Qk 1(Y k−1,i = 0)1(Si > k) k=s i 1−b p (X )

CR IP T

i,m−

m

m=s

(A.11)

Taking the probability limit of (A.11) gives

  P p bs (Xi,s− ) Qk t Y 1(Y = 0)1(S > k) k,i k−1,i i Y i pm (Xi,m− )   m=s 1−b p lim 1 − P = p bs (Xi,s− ) N →∞ Qk 1(Y k−1,i = 0)1(Si > k) k=s i 1−b p (X ) m=s

 1 −

p limN →∞

k=s



1 −

ps (Xs )

m=s 1−pm (Xm )



h

E

pm (Xi,m− m=s 1−b

P

i

Qk

Y 1(Y k−1,i = 0)1(Si > k) ) k,i

p bs (Xi,s− )

pm (Xi,m− ) m=s 1−b

1(Y k−1,i = 0)1(Si > k) i = 0)1(S > k) i . = 0)1(S > k)

ps (Xs− ) Yk 1(Y k−1 m=s 1−pm (Xm− )

Qk

h

p bs (Xi,s− )

Qk

ps (Xs− )

m=s 1−pm (Xm− )

1(Y k−1



 =

Yk 1(Y k−1 = 0)1(S > k)], for k = 2 and s = 1. Then, we need to consider

 p1 (X1− ) Y2 1(Y1 = 0)1(S > 2) . [1 − p1 (X1− )][1 − p2 (X2− )]

(A.13)

ED

E

E

Qk

i

p limN →∞

t Y

Next, consider E[ Qk

P

AN US

k=s



M

t Y

i,m−

m

(A.12)

If no-anticipation as in Assumption A2 holds, the observed outcome Yk for individuals treated after k

PT

(individuals with S > k) corresponds to the potential outcome under no-treatment, Yk (0), so that

CE

E



 p1 (X1− ) Y2 1(Y1 = 0)1(S > 2) = [1 − p1 (X1− )][1 − p2 (X2− )]

(A.14)



 p1 (X1− ) E Y2 (0)1(Y1 (0) = 0)1(S > 2) . [1 − p1 (X1− )][1 − p2 (X2− )]

AC

Next, by the law of iterated expectations 

 p1 (X1− ) E Y2 (0)1(Y1 (0) = 0)1(S > 2) = [1 − p1 (X1− )][1 − p2 (X2− )] EX1−

  E

p1 (X1− ) Y2 (0)1(Y1 (0) = 0)1(S > 2)|X1− [1 − p1 (X1− )][1 − p2 (X2− )]

41

(A.15) 

.

ACCEPTED MANUSCRIPT

Simplifying gives EX1−

  E

p1 (X1− ) Y2 (0)1(Y1 (0) = 0)1(S > 2)|X1− [1 − p1 (X1− )][1 − p2 (X2− )]



=

(A.16)

  Y2 (0)1(Y1 (0) = 0)1(S > 2) p1 (X1− ) = E |X1− 1 − p1 (X1− ) 1 − p2 (X2− )     p1 (X1− ) Y2 (0)1(Y1 (0) = 0)1(S > 2) E |S > 1, X1− [1 − p1 (X1− )] EX1− 1 − p1 (X1− ) 1 − p2 (X2− )    Y2 (0)1(Y1 (0) = 0)1(S > 2) = EX1− p1 (X1− )E |S > 1, X1− 1 − p2 (X2− )    Y2 (0)1(S > 2) EX1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )E |Y1 (0) = 0, S > 1, X1− . 1 − p2 (X2− ) 

AN US

CR IP T

EX1−

Next, by the law of iterated expectations EX1−

(A.17)

    Y2 (0)1(S > 2) p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 (0)=0,S>1 E |Y1 (0) = 0, S > 1, X2− . 1 − p2 (X2− )

M

EX1−

   Y2 (0)1(S > 2) = |Y1 (0) = 0, S > 1, X1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )E 1 − p2 (X2− )

Simplifying gives

CE

PT

ED

    Y2 (0)1(S > 2) |Y1 (0) = 0, S > 1, X2− = EX1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 (0)=0,S>1 E 1 − p2 (X2− ) (A.18)    E [Y2 (0)1(S > 2)|Y1 (0) = 0, S > 1, X2− ] EX1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 (0)=0,S>1 = 1 − p2 (X2− )    E [Y2 (0)|Y1 (0) = 0, S > 2, X2− ] [1 − p2 (X2− )] EX1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 (0)=0,S>1 = 1 − p2 (X2− )   EX1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 (0)=0,S>1 (Pr(Y2 (0) = 1|Y1 (0) = 0, S > 2, X2− )) .

AC

If unconfoundedness as in Assumption A1 holds, then conditional on X2− the potential outcomes for individuals with S > 2 on average equals the potential outcomes for treated in period 2, i.e. Pr(Y2 (0) = 1|S > 2, Y1 (0) = 0, X2− ) = Pr(Y2 (0) = 1|S = 2, Y1 (0) = 0, X2− ), which implies that Pr(Y2 (0) = 1|Y1 (0) = 0, S > 2, X2− ) = Pr(Y2 (0) = 1|Y1 (0) = 0, S > 1, X2− ),

42

ACCEPTED MANUSCRIPT

so that   EX1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 (0)=0,S>1 (Pr(Y2 (0) = 1|Y1 (0) = 0, S > 2, X2− )) = EX1−



(A.19)  p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 (0)=0,S>1 (Pr(Y2 (0) = 1|Y1 (0) = 0, S > 1, X2− )) =

CR IP T

EX1− (p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− ) Pr(Y2 (0) = 1|S > 1, Y1 (0) = 0, X1− )) .

Next, if unconfoundedness as in Assumption A1 holds, then conditional on X1− the potential outcomes for individuals with S > 1 on average equals the potential outcomes for treated in period 1, i.e., Pr(Y2 (0) = 1|S > 1, Y1 (0) = 0, X2− ) = Pr(Y2 (0) = 1|S = 1, Y1 (0) = 0, X2− ), so that

AN US

EX1− (p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− ) Pr(Y2 (0) = 1|S > 1, Y1 (0) = 0), X1− ) =

(A.20)

EX1− (p1 (X1− ) Pr(Y1 (0) = 0|S = 1, X1− ) Pr(Y2 (0) = 1|S = 1, Y1 (0) = 0), X1− ) = EX1− (p1 (X1− ) Pr(Y2 (0) = 1, Y1 (0) = 0|S = 1, Y1 (0) = 0), X1− ) =

M

EX1− (Pr(Y2 (0) = 1, Y1 (0) = 0), S = 1|X1− ) =

ED

Pr(Y2 (0) = 1, Y1 (0) = 0), S = 1). Thus, from equations we that (A.12)-(A.20)

p1 (X1− ) Y2 1(Y1 = 0)1(S > 2)] = Pr(Y2 (0) = 1, Y1 (0) = 0), S = 1). [1 − p1 (X1− )][1 − p2 (X2− )]

(A.21)

PT

E[

CE

By similar reasoning E[ Qk

ps (Xs− )

AC

m=s 1

and

− pm (Xm− )

E[ Qk

Yk 1(Y k−1 = 0)1(S > k)] = Pr(S = s, Yk (0) = 1, Y k−1 (0) = 0)

ps (Xs− )

m=s 1

− pm (Xm− )

1(Y k−1 = 0)1(S > k)] = Pr(S = s, Y k−1 (0) = 0).

(A.22)

(A.23)

Substituting (A.22) and (A.23) into (A.12) and simplifying gives   P p bs (Xi,s− ) Qk t Y 1(Y = 0)1(S > k) k,i k−1,i i Y i pm (Xi,m− )   m=s 1−b = 1 − P p bs (Xi,m− ) Qk 1(Y = 0)1(S > k) k=s k−1,i i i 1−b p (X ) m=s

m

i,m−

43

(A.24)

ACCEPTED MANUSCRIPT

 Y t t  Y   Pr(S = s, Yk (0) = 1, Y k−1 (0) = 0) 1− = 1 − Pr(Yk (0) = 1|S = s, Y k−1 (0) = 0) = Pr(S = s, Y k−1 (0) = 0) k=s k=s t Y

Pr(Yk (0) = 0|S = s, Y k−1 (0) = 0) = Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0).

k=s

AC

CE

PT

ED

M

AN US

CR IP T

\ t (s) = ATETt (s). Together (A.10) and (A.24) implies that p limN →∞ ATET

44

ACCEPTED MANUSCRIPT

Appendix B: Additional proofs (for online publication only) B.1 Identification with right censoring We show that if Assumption A1, A2 and A3 hold then ATETt (s) is identified. Consider identification for

CR IP T

t = 2 and s = 1. For the counterfactual Pr(Y 2 (0) = 0|S = 1) under no treatment we have E(Y 2 (0) = 0|S = 1) = EX1− |S=1 [E(Y 2 (0) = 0|S = 1, X1− )] =

(B.1)

EX1− |S=1 [Pr(Y2 (0) = 0|Y1 (0) = 0, S > 1, X1− ) Pr(Y1 (0) = 0|S > 1, X1− )] =

EX1− |S=1 [Pr(Y2 (0) = 0|Y1 (0) = 0, C1 = 0, S > 1, X1− ) Pr(Y1 (0) = 0|S > 1, C1 = 0, X1− )] =

AN US

EX1− |S=1 [EX2− |X1− ,Y1 (0)=0,S=1,C1 =0 {Pr(Y2 (0) = 0|Y1 (0) = 0, S > 1, C1 = 0, X2− )} × Pr(Y1 (0) = 0|S > 1, C1 = 0, X1− )] =

EX1− |S=1 [EX2− |X1− ,Y1 (0)=0,S=1,C1 =0 {Pr(Y2 (0) = 0|Y1 (0) = 0, S > 2, C1 = 0, X2− )} × Pr(Y1 (0) = 0|S > 1, C1 = 0, X1− )] =

M

EX1− |S=1 [EX2− |X1− ,Y1 (0)=0,S=1,C1 =0 {Pr(Y2 (0) = 0|Y1 (0) = 0, S > 2, C 2 = 0, X2− )} × Pr(Y1 (0) = 0|S > 1, C1 = 0, X1− )] =

ED

EX1− |S=1 [EX2− |X1− ,Y1 =0,S=1,C1 =0 {Pr(Y2 = 0|Y1 = 0, S > 2, C 2 = 0, X2− )} Pr(Y1 = 0|S > 1, C1 = 0, X1− )]. As above, the first equality follows from the law of iterated expectations and the second equality by applying

PT

Assumption A1 for the first period. The third equality follows from the Assumption A3 in the first period. The fourth equality by the law of iterated expectations and the fifth holds by applying Assumption A1 in the second period. The sixth equality follows from Assumption A3 in the second period. The the seventh

CE

equality holds by going from potential outcomes to observed outcomes. As before, Assumption A2 plays an important role in this step.

AC

Next, using similar reasoning we have

Pr(Y 2 (s) = 0|S = 1)

EX1− |S=1 [EX2− |X1− ,Y1 =0,S=1,C1 =0 {Pr(Y2 = 0|Y1 = 0, S = 1, C 2 = 0, X2− )} Pr(Y1 = 0|S = 1, C1 = 0), X1− ].

45

ACCEPTED MANUSCRIPT

B.2 DIPW estimation with right censoring We show that if assumptions A1-A3 hold the DIPW estimator is a consistent estimator of ATETt (s). Consider, estimation of the outcome under no treatment, Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0). For this part of the ATETt (s) the DIPW estimator is:

i,s

m=s+1

i,m−

m

m

i,m−

Taking the probability limit of (B.2) gives

CR IP T

  P pbs (Xi,s− ) 1 Qk t Y 1(Y = 0)1(S > k)1(C = 0) k k−1,i i k,i Y i 1−b ps (Xi,s− ) m=s+1 [1−b pm (Xi,m− )][1−b cm (Xi,m− )]   . 1 − P pbs (X − ) i,s 1 Q 1(Y k−1,i = 0)1(Si > k)1(C k,i = 0) k=s i 1−b ps (X − ) k [1−b p (X )][1−b c (X )]

s

Next, consider E

h

m=s+1

m

m−

i = 0)1(S > k)1(C k > 0) , for k = 2

 p1 (X1− ) Y2 1(Y1 = 0)1(S > 2)1(C 2 = 0) . [1 − p1 (X1− )][1 − p2 (X2− )][1 − c2 (X2− )]

(B.4)

M



m

(B.3)

m−

ps (Xs− ) 1 Q Yk 1(Y k−1 1−ps (Xs− ) k m=s+1 [1−pm (Xm− )][1−cm (Xm− )]

and s = 1. Then, we need to consider E

AN US

 h i ps (Xs− ) 1 t Qk E 1−p = 0)1(S > k)1(C = 0) Y 1(Y k−1 k k Y s (Xs− )  m=s+1 [1−pm (Xm− )][1−cm (Xm− )] h i  1 − . ps (Xs− ) 1 E 1−ps (X − ) Qk 1(Y k−1 = 0)1(S > k)1(C k = 0) k=s [1−p (X )][1−c (X )]

(B.2)

If no-anticipation as in Assumption A2 holds, the observed outcome Yk for individuals treated after k

ED

(individuals with S > k) corresponds to the potential outcome under no-treatment, Yk (0), so that from (B.4) we have 

 p1 (X1− ) Y2 (0)1(Y1 (0) = 0)1(S > 2)1(C 2 = 0) . [1 − p1 (X1− )][1 − p2 (X2− )][1 − c2 (X2− )]

PT

E

(B.5)

  E

p1 (X1− ) Y2 (0)1(Y1 (0) = 0)1(S > 2)1(C 2 = 0)|X1− [1 − p1 (X1− )][1 − p2 (X2− )][1 − c2 (X2− )]

AC

EX1−

CE

Next, by the law of iterated expectations we obtain 

.

(B.6)

Simplifying and since the sample is selected such that all observations that are censored in the first already have been removed gives EX1−

   Y2 (0)1(S > 2)1(C2 = 0) p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )E |Y1 (0) = 0, S > 1, X1− . (B.7) [1 − p2 (X2− )][1 − c2 (X2− )]

46

ACCEPTED MANUSCRIPT

Next, by the law of iterated expectations we obtain EX1−

    Y2 (0)1(S > 2)1(C2 = 0) p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 =C1 =0,S>1 E |Y1 (0) = 0, S > 1, C1 = 0, X2− [1 − p2 (X2− )][1 − c2 (X2− )] (B.8)

Simplifying gives

CR IP T

  EX1− p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− )EX2− |X1− ,Y1 =C1 =0,S>1 Pr(Y2 (0) = 1|Y1 (0) = 0, S > 2, C 2 = 0, X2− ) . (B.9)

If assumption A1 and A3 hold, then conditional on X2− the potential outcomes for individuals with S > 2 and C 2 = 0 on average equals the potential outcomes individuals with S > 1 and C 1 = 0, so that by

AN US

simplifying (B.9) equals

EX1− (p1 (X1− ) Pr(Y1 (0) = 0|S > 1, X1− ) Pr(Y2 (0) = 1|S > 1, Y1 (0) = 0, C1 = 0, X1− )) .

(B.10)

Next, if If assumption A1 and A3 hold, then conditional on X1− the potential outcomes for individuals with S > 1 and C1 = 0 on average equals the potential outcomes for individuals with S = 1, so that (B.10)

M

simplifies to

(B.11)

ED

EX1− (p1 (X1− ) Pr(Y1 (0) = 0|S = 1, X1− ) Pr(Y2 (0) = 1|S = 1, Y1 (0) = 0, X1− )) = Pr(Y2 (0) = 1, Y1 (0) = 0), S = 1).

PT

Thus, from equations we that (B.3)-(B.11)

 p1 (X1− ) Y2 1(Y1 = 0)1(S > 2)1(C 2 = 0) = Pr(Y2 (0) = 1, Y1 (0) = 0), S = 1). E [1 − p1 (X1− )][1 − p2 (X2− )][1 − c2 (X2− )] (B.12)

CE



AC

By similar reasoning E[

ps (Xs− ) 1 Yk 1(Y k−1 = 0)1(S > k)1(C k = 0)] Q 1 − ps (Xs− ) km=s+1 [1 − pm (Xm− )][1 − cm (Xm− )] = Pr(S = s, Yk (0) = 1, Y k−1 (0) = 0)

47

(B.13)

ACCEPTED MANUSCRIPT

and E[

ps (Xs− ) 1 1(Y k−1 = 0)1(S > k)1(C k = 0)] Q 1 − ps (Xs− ) km=s+1 [1 − pm (Xm− )][1 − cm (Xm− )]

(B.14)

CR IP T

= Pr(S = s, Y k−1 (0) = 0). Substituting (B.13) and (B.14) implies that (B.3) equals Pr(Y t (0) = 0|S = s, Y s−1 (s) = 0). This and similar reasoning for estimation of the outcome under no treatment, Pr(Y t (s) = 0|S = s, Y s−1 (s) = 0) implies that \ t (s) = ATETt (s). p limN →∞ ATET B.3 Identification with sequences of treatments

AN US

In this section of the appendix we show that if Assumptions A4 and A5 hold the ATE(st , s∗t ) is identified. s2

First, consider E(Y 2 = 0) in detail: E(Y 2 (s2 ) = 0) =

EX1− [E(Y 2 (s2 ) = 0|X1− , S1 = s1 )] =

(B.15)

M

EX1− [E(Y 2 (s2 ) = 0|X1− )] =

EX1− [Pr(Y2 (s2 ) = 0|X1− , Y1 (s1 ) = 0, S1 = s1 ) Pr(Y1 (s1 ) = 0|X1− , S1 = s1 )] =

ED

EX1− [EX2− |X1− ,Y1 (s1 )=0,S1 =s1 {Pr(Y2 (s2 ) = 0|XX2− , Y1 (s1 ) = 0, S1 = s1 )} Pr(Y1 (s1 ) = 0|X1− , S1 = s1 )] = EX1− [EX2− |X1− ,Y1 (s1 )=0,S1 =s1 {Pr(Y2 (s2 ) = 0|XX2− , Y1 (s1 ) = 0, S 2 = s2 )} Pr(Y1 (s1 ) = 0|X1− , S1 = s1 )] =

PT

EX1− [EX2− |X1− ,Y1 =0,S1 =s1 {Pr(Y2 = 0|XX2− , Y1 = 0, S 2 = s2 )} Pr(Y1 = 0|X1− , S1 = s1 )]. Note that the first equality follows from the law of iterated expectations, the second equality by applying

CE

Assumption A4 for the first period, the third equality Bayes rule and the fourth equality from the law of iterated expectations and the fifth holds by applying Assumption A4 in the second period. Finally, the sixth

AC

equality holds by going from potential outcomes to observed outcomes. In this step, Assumption A5 plays an important role. This assumption assures that the observed outcomes of the not-yet treated corresponds to the potential outcome under no treatment, even though some of the not-yet treated becomes treated later on.

By similar reasoning we have that E(Y 2 (s∗2 ) = 0) =

48

(B.16)

ACCEPTED MANUSCRIPT

i h EX1− EX2− |X1− ,Y1 =0,S1 =s∗1 {Pr(Y2 = 0|X2 , Y1 = 0, S 2 = s∗2 )} Pr(Y1 = 0|X1− , S1 = s∗1 ) and by similar reasoning we have that E(Y t (st ) = 0), E(Y t (s∗t ) = 0), and thus that ATE(st , s∗t ) is identified. B.4 Estimation with sequences of treatments

CR IP T

We now show that if the assumptions A4and A5 hold the dynamic sequential inverse probability weighting estimator (DSIPW) is a consistent estimator of ATE(st , s∗t ) = Pr(Y t (st ) = 0) − Pr(Y t (s∗t ) = 0). We consider st

estimation of Pr(Y t = 0) in detail. For this part of the ATE(st , s∗t ) the DSIPW estimator is:

k=1



PN

Yk,i 1(Y k−1,i =0)1(S k,i =sk ) Q i=1 pˆs (X − ) k ˆsm (Xi,m− ,sm−1 ) m=2 p i,1 1

 1 − PN

1(Y k−1,i =0)1(S k,i =sk ) Q i=1 pˆs (X − ) k ˆsm (Xi,m− ,sm−1 ) m=2 p i,1 1

Taking the probability limit of (B.17) gives

N →∞

k=1



PN

Yk,i 1(Y k−1,i =0)1(S k,i =sk ) Q i=1 pˆs (X − ) k ˆsm (Xi,m− ,si,m−1 ) m=2 p i,1 1

 1 − PN

1(Y k−1,i =0)1(S k,i =sk ) Q i=1 pˆs (X − ) k ˆsm (Xi,m− ,si,m−1 ) m=2 p i,1 1

Next, as illustration consider E Initially,

h



 =

Yk 1(Y k−1 =0)1(S k =sk ) Q ps1 (X1− ) k m=2 psm (Xm− ,sm−1 )

M

p lim

t Y



 .

AN US

t Y

t Y

k=1

i



1 −

E

E

h

h

Yk 1(Y k−1 =0)1(S k =sk ) Q ps1 (X1− ) k m=2 psm (Xm− ,sm−1 ) 1(Y k−1 =0)1(S k =sk ) Q ps1 (X1− ) k m=2 psm (Xm− ,sm−1 )

for k = 2, i.e. consider E

h

i

i .

(B.18) i .

Y2 1(Y 1 =0)1(S 2 =s2 ) ps1 (X1− )ps2 (X2− ,s1 )

(B.19)

ED

   Y2 (s2 )1(Y1 (s1 ) = 0)1(S 2 = s2 ) Y2 1(Y1 = 0)1(S 2 = s2 ) =E , E ps1 (X1− )ps2 (X1− ,s1 ) ps1 (X1− )ps2 (X1− ,s1 ) 

(B.17)

where the equality holds since the observed outcome, Y2 , equals the actual treatment sequence. For instance,



    Y2 (s2 )1(Y1 (s1 ) = 0)1(S 2 = s2 ) Y2 (s2 )1(Y1 (s1 ) = 0)1(S 2 = s2 ) = EX1− E |X1− = ps1 (X1− )ps2 (X1− ,s1 ) ps1 (X1− )ps2 (X1− ,s1 )

CE

E

PT

Y2 = Y2 (s2 ). Next,

   Y2 (s2 )1(Y1 (s1 ) = 0)1(S 2 = s2 ) 1 E |S1 = s1 , X1− ps1 (X1− ) = ps1 (X1− ) ps2 (X1− ,s1 )    Y2 (s2 )1(Y1 (s1 ) = 0)1(S 2 = s2 ) |S1 = s1 , X1− = EX1− E ps2 (X1− ,s1 )    Y2 (s2 )1(S 2 = s2 ) EX1− Pr(Y1 (s1 ) = 0|S1 = s1 , X1− )E |Y1 (s1 ) = 0, S1 = s1 , X1− = ps2 (X1− ,s1 )     Y2 (s2 )1(S 2 = s2 ) Pr(Y1 (s1 ) = 0|S1 = s1 , X1− )EX2− |Y1 (s1 )=0,S1 =s1 ,X1− E |Y1 (s1 ) = 0, S1 = s1 , X2− = ps2 (X1− ,s1 ) 

AC

EX1−

EX1−

(B.20)

49

ACCEPTED MANUSCRIPT

EX1−

Pr(Y1 (s1 ) = 0|S1 = s1 , X1− )EX2− |Y1 (s1 )=0,S1 =s1 ,X1−

"

#!   E Y2 (s2 )|Y1 (s1 ) = 0, S 2 = s2 , X2− ps2 (X1− ,s1 ) = ps2 (X1− ,s1 )

    . EX1− Pr(Y1 (s1 ) = 0|S1 = s1 , X1− )EX2− |Y1 (s1 )=0,S1 =s1 ,X1− E Y2 (s2 )|Y1 (s1 ) = 0, S 2 = s2 , X2− Note that the first and the fifth equality follow from the law of iterated expectations, and the other equalities

CR IP T

follow from simplification. The next step is to use the sequential unconfoundedness in Assumption A.1. If this assumption holds we have that

  E Y2 (s2 )|Y1 (s1 ) = 0, S 2 = s2 , X2− = E [Y2 (s2 )|Y1 (s1 ) = 0, S1 = s1 , X2− ] .

(B.21)

Substituting (B.21) into the expression after the last equality in (B.20) and simplifying gives

AN US

    EX1− Pr(Y1s1 = 0|S1 = s1 , X1− )EX2− |Y1 (s1 )=0,S1 =s1 ,X1− E Y2 (s2 )|Y1 (s1 ) = 0, S 2 = s2 , X2− = (B.22) EX1− (Pr (Y2 (s2 ) = 1, Y1 (s1 ) = 0|S1 = s1 , X1− )).

M

From the sequential unconfoundedness (Assumption A.1) we have that

EX1− (Pr (Y2 (s2 ) = 1, Y1 (s1 ) = 0|S1 = s1 , X1− )) =

(B.23)

ED

 EX1− (Pr (Y2 (s2 ) = 1, Y1 (s1 ) = 0|X1− )) = Pr Y2s2 = 1, Y1 (s1 ) = 0 , where the first quality follows from Assumption A.1 and the second by the law of iterated expectations.

PT

Thus, from (A.8)-(B.23) we have that

CE

E



 Y2 1(Y1 = 0)1(S 2 = s2 ) = Pr (Y2 (s2 ) = 1, Y1 (s1 ) = 0) , ps1 (X1− )ps2 (X1− ,s1 )

(B.24)

AC

By similar reasoning we have

and

"

# Yk 1(Y k−1 = 0)1(S k = sk ) E = Pr(Yk (sk ) = 1, Y k−1 (sk−1 ) = 0) Qk ps1 (X1− ) m=2 psm (Xm− ,sm−1 ) "

# 1(Y k−1 = 0)1(S k = sk ) E = Pr(Y k−1 (sk−1 ) = 0). Qk ps1 (X1− ) m=2 psm (Xm− ,sm−1 )

50

(B.25)

(B.26)

ACCEPTED MANUSCRIPT

Next, substituting (B.26) and (B.25) into (A.7) and simplifying gives t Y

k=1



1 −

E E

h h

Yk 1(Y k−1 =0)1(S k =sk ) Q ps1 (X1− ) k m=2 psm (Xm− ,sm−1 ) 1(Y k−1 =0)1(S k =sk ) Q ps1 (X1− ) k m=2 psm (Xm− ,sm−1 )

i

i =

t Y

k=s

[1 − Pr (Yk (sk ) = 1|Y1 (s1 ) = 0)] = Pr(Y k (sk ) = 0).

AC

CE

PT

ED

M

AN US

CR IP T

Estimation results for Pr(Y t (s∗t ) = 0) follows using similar reasoning.

51