- Email: [email protected]

Contents lists available at SciVerse ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Filtering hidden semi-Markov chains Robert Elliott a,b,∗ , Nikolaos Limnios c , Anatoliy Swishchuk b a

University of Adelaide, Australia

b

University of Calgary, Calgary, Canada

c

Université de Technologie de Compiègne, Compiègne, France

article

info

Article history: Received 17 March 2013 Received in revised form 9 May 2013 Accepted 9 May 2013 Available online 21 May 2013

abstract In this paper, we consider hidden semi-Markov chain filters having possible applications in areas such as genomics, statistical studies of earthquakes, reliability, etc. © 2013 Elsevier B.V. All rights reserved.

Keywords: Semi-Markov chain Reference probability Modified Viterbi filters Genomics

1. Introduction A semi-Markov chain or a Markov renewal chain is a random process for which the length of time spent in any state is a random variable which depends on the current state and the arriving state. This can be a general distribution on N, the set of natural numbers, including the Dirac measure, i.e., constant duration. This is a generalization of both Markov chains and renewal processes, (see, e.g., Anselone (1960) and Howard (1971)). A special kind of semi-Markov chain has been considered, known as a Hidden semi-Markov model (HSMM), Barbu and Limnios (2008), Yu (2010) and Trevezas and Limnios (2009). This is an extension of hidden Markov models (HMMs), Elliott (1994) and Elliott et al. (1995). Many applications of these processes have been seen in several areas: ADN analysis (Barbu and Limnios, 2008; Mode and Pickens, 1998; Trevezas and Limnios, 2009), earthquake statistical analysis (Rotondi and Varini, 2003), finance and insurance (Bulla and Bulla, 2006; Stenberg, 2007), reliability (Barbu and Limnios, 2008), rainfall data (Sansom and Thomson, 2001), etc. In this paper, we consider a HSMM in discrete-time related to a finite state semi-Markov chain X . We suppose the chain X is not observed directly but only through a second, noisy, finite state process Y . Using a reference probability we derive a recursive filter. To obtain a tractable recursion we use modified Viterbi maximum a posteriori estimates. To our knowledge the results presented here are new. The paper is organized as follows. Section 2 presents the semi-Markov chain definitions, the backward recurrence chain, and martingale representations. Section 3 introduces the observed process Y and its semimartingale representation. Section 4 gives a measure change under which the process Y becomes an i.i.d. random sequence. Finally, Section 5, derives filters for the estimation of a hidden semi-Markov chain given observations of Y .

∗

Corresponding author at: University of Calgary, Calgary, Canada. Tel.: +1 403 220 3819. E-mail addresses: [email protected] (R. Elliott), [email protected] (N. Limnios).

0167-7152/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.spl.2013.05.007

2008

R. Elliott et al. / Statistics and Probability Letters 83 (2013) 2007–2014

2. Semi-Markov chain and its representation To define a semi-Markov chain we first define a renewal process. The process has a finite state space which can be identified with the set of unit vectors E = {e1 , e2 , . . . , eN }, where ei = (0, . . . , 0, 1, 0, . . . , 0)′ ∈ RN , and N is the number of elements in the state space. Time is discrete, so t ∈ N := {0, 1, 2 · · ·}. Our processes are defined on the probability space (Ω , F , P ). Suppose T0 = 0, and Ti ∈ N \ {0}, for i ∈ N \{0}, is an increasing sequence of random times. A marked point process with values in E is a sequence {(Tn , Zn ), n ∈ N} with Zn ∈ E. Definition 1. A discrete time Markov renewal process with values in E is a marked point process {(Tn , Zn )} such that P (Zn+1 = ej , Tn+1 − Tn = k|Zn = ei , Z0 , . . . , Zn , T0 , . . . , Tn ) = P (Zn+1 = ej , Tn+1 − Tn = k|Zn = ei )

=: qij (k). Definition 2. A discrete time semi-Markov process {Xt , t ≥ 0} is the process such that Xt = Z n ,

if Tn ≤ t < Tn+1 .

For 1 ≤ i ≤ N, define the sojourn time distribution in state ei hi (k) := P (Tn+1 − Tn = k|Zn = ei ) =

N

qij (k)

(1)

hi (l).

(2)

j =1

and the corresponding cumulative distribution Hi (k) = P (Tn+1 − Tn ≤ k|Zn = ei ) =

k l =1

¯ i (k) := 1 − Hi (k). Let H Lemma 1. The semi-Markov kernel has the following representation qij (k) = P (Zn+1 = ej , Tn+1 − Tn = k|Zn = ei )

= P (Zn+1 = ej |Zn = ei )P (Tn+1 − Tn = k | Zn = ei , Zn+1 = ej ) = P (i, j)fij (k). The probability matrix P := (P (i, j)) is the transition kernel of the Embedded Markov Chain (EMC), Zn and fij (k) are the conditional transition distributions. Backward recurrence time process. Suppose at time t ∈ N the chain is in state ei and that it has been in this state for exactly k time steps. We then write Ut = k. That is, Ut = k if and only if Xt = ei , Xt −1 = ei , . . . , Xt −k+1 = ei , Xt −k ̸= ei . Clearly, as claimed, Ut counts the number of time steps since the last jump, that is back to the most recent Xt −h for which ⟨Xt , Xt −h ⟩ = 0. If Tn ≤ t < Tn+1 , we have Ut = t − Tn + 1. So, Ut takes values in N \ {0}, and U0 = 1. Lemma 2. The following recurrence relations hold U1 = 1 + ⟨X1 , X0 ⟩, U2 = 1 + ⟨X2 , X1 ⟩ + ⟨X2 , X1 ⟩⟨X1 , X0 ⟩ = 1 + ⟨X2 , X1 ⟩U1 , and in general Ut = 1 + ⟨Xt , Xt −1 ⟩Ut −1 . The process (Xt , Ut ) is a Markov chain as one can see also by the last equation. The formula Ut = 1 +

t k−1

⟨Xt −j , Xt −j−1 ⟩,

k=1 j=0

shows that, for any t ∈ N, Ut is FtX measurable, where FtX := σ {X0 , X 1, . . . , Xt } ⊂ F .

R. Elliott et al. / Statistics and Probability Letters 83 (2013) 2007–2014

2009

Lemma 3 (Chryssaphinou et al. (2008) and Barbu and Limnios (2008, Proposition 6.1, p. 137)). The transition probabilities of the chain (Xt , Ut ) are given by

P (Xt +1 = ej , Ut +1

q (k + 1) ij , H¯ i (k) = ℓ | Xt = ei , Ut = k) = H¯ i (k + 1) , H¯ i (k) 0,

if ei ̸= ej , ℓ = 1, if ei = ej , ℓ = k + 1,

(3)

elsewhere.

Denote by P˜ this transition matrix. If Xt = ei and P (Ut = k|Xt = ei ) > 0, from the above transition probabilities, it is clear that hi (k)

P (Xt +1 ̸= ei |Xt = ei , Ut = k) = We shall write h1 (k) h(k)

¯ (k) H

=

¯ (k + 1) H ¯ (k) H

¯ 1 (k) H

,...,

hN (k)

¯ N (k) H

¯ i (k) H

.

(4)

,

¯ ¯ N (k + 1) H1 (k + 1) H ,..., . ¯ 1 (k) ¯ N (k) H H

=

(5)

Define

Bt (Xt ) = Xt ,

h(Ut )

¯ (Ut + 1) H + Xt , . ¯ (Ut ) ¯ (Ut ) H H

Theorem 4. The semi-Markov chain has the following semimartingale representation Xt +1 = B(Xt )Xt + Mt +1 ∈ RN ,

t ∈ N,

(6)

where Mt +1 is a vector martingale increment: E [Mt +1 |X0 , X1 , . . . , Xt ] = 0 ∈ RN . Proof. If Ut = k and j ̸= i, then a jump occurs at time t + 1. Using Lemma 3 and (4), we have E [Xt +1 |FtX ] = E [Xt +1 I (next jump is at t + 1) + Xt +1 I (next jump is > t + 1)|FtX ]

¯ (Ut + 1) H h(Ut + 1) + Xt Xt , = Xt Xt , ¯ (Ut ) ¯ (Ut ) H H N = Bt (Xt )Xt ∈ R . Therefore, E [Mt +1 |FtX ] = E [Xt +1 − Bt (Xt )Xt |FtX ] = 0 ∈ RN . Here, Bt (Xt ) is Ft -measurable. X

(7)

Remark 1. Suppose Xt , t ∈ N, is a finite state time homogeneous Markov chain with state space {e1 , . . . , eN } and pij = P (Xt +1 = ej | Xt = ei ), P = (pij ). Then Xt +1 = PXt + Mt +1 , where E [Mt +1 |Xt ] = 0 ∈ RN . This is the semimartingale representation of a Markov chain. See [1]. The recursion (6), while exact, becomes cumbersome, so moving forward in the filters below we substitute MAP (maximum a posteriori) estimates for Bt (Xt ) at each time step. 3. Observations We suppose the process X is not observed directly. Rather, there is another finite state process Y which is a noisy observation of X . Suppose the finite state space of Y is identified with the unit vectors {f1 , f2 , . . . , fM } of RM where fj = (0, . . . , 0, 1, 0, . . . , 0)′ ∈ RM . We can have M > N , M < N or M = N. Suppose cji = P (Yk = fj |Xk = ei ) ≥ 0. Note that,

M

j =1

cji = 1. Write C = (cji ), 1 ≤ i ≤ N , 1 ≤ j ≤ M and

(8)

FkY

:= σ {Yl , 0 ≤ l ≤ k}.

2010

R. Elliott et al. / Statistics and Probability Letters 83 (2013) 2007–2014

Remark 2. Note that always N ⟨Xt , ei ⟩ = 1,

(9)

i=1

and M ⟨Yt , fi ⟩ = 1.

(10)

i=1

We often multiply by 1 (=

N

i =1

⟨Xt , ei ⟩).

Lemma 5. The chain Y has the following representation. Yt = CXt + Lt ∈ RM ,

(11)

where E [Lt | FtX ] = 0 ∈ RM . Proof. By definition: E [Lt |FtX ] = E [Yt − CXt |FtX ]

= E [Yt − CXt |Xt ] = E [Yt |Xt ] − CXt M N = E [Yt ⟨Yt , fj ⟩⟨Xt , ei ⟩|Xt ] − CXt j=1 i=1 M N = ⟨Xt , ei ⟩cji fj − CXt = 0 ∈ RM . j=1 i=1

Remark 3. From Eq. (11) it follows that Lt models the ‘error’ Yt − CXt . Also, CXt is the best estimate of Yt given Xt , i.e., E [Yt |Xt ] = CXt = E [Yt |FtX ]. 4. Reference probability The above dynamics of X and Y are under the ‘real world’ probability P. Suppose we have another probability P¯ under ¯ is a process independent of X which is a sequence of i.i.d. uniformly which X has the same dynamics as under P, but under PY r.vs. at each time t so that P¯ (Yt = fj |FtX ∨ FtY−1 ) = P¯ (Yt = fj ) =

1 M

.

(12)

Remark 4. We consider here finite state processes as we have in mind applications to genomics. If the state space Y were countable we could take P¯ (Yt = fi ) = 21i , or any probability. If the state space were R we could for example take Yt to be normally distributed at each time t. Definition 3. For t = 0, 1, 2, . . . write

λt = M

M

⟨CXt , fj ⟩⟨Yt , fj ⟩

(13)

j =1

and

Λt :=

t

λk .

(14)

k=0

The real world probability P can then be defined in terms of P¯ by setting

dP¯ dP

FtX ∨FtY

= Λt .

(15)

R. Elliott et al. / Statistics and Probability Letters 83 (2013) 2007–2014

2011

Lemma 6. Under P the dynamics of X are unchanged, so Xt +1 = Bt (Xt )Xt + Mt +1 and P (Yt = fj |Xt = ei ) = cji . That is, under P Yt = CXt + Lt , where E [Lt |FtX ] = 0 ∈ RM .

¯ Consider first Proof. Let E¯ denote the expectation under P. E¯ [λt |FtX ∨ FtY−1 ] = M

M

E¯ [⟨CXt , fj ⟩⟨Yt , fj ⟩|Xt ]

j =1

=M

M ⟨CXt , fj ⟩E¯ [⟨Yt , fj ⟩|Xt ] j =1

M

=

N

cji ⟨Xt , ei ⟩ = 1.

(16)

j =1 i =1

From Bayes’ Theorem (see Elliott, Aggoun and Moore [1]) we obtain: E¯ [⟨Yt , fj ⟩|FtX ∨ FtY−1 ] =

=

E¯ [Λt ⟨Yt , fj ⟩|FtX ∨ FtY−1 ] E¯ [Λt |FtX ∨ FtY−1 ]

Λt −1 E¯ [λt ⟨Yt , fj ⟩|FtX ∨ FtY−1 ] Λt −1 E¯ [λt |FtX ∨ FtY−1 ]

=M

M

E¯ [⟨CXt , fr ⟩⟨Yt , fr ⟩⟨Yt , fj ⟩|FtX ∨ FtY−1 ]

r =1

= M E¯ [⟨CXt , fj ⟩⟨Yt , fj ⟩|FtX ∨ FtY−1 ] = ⟨CXt , fj ⟩.

(17)

Therefore, if Xt = ei , P (Yt = fj |Xt = ei ) = E [⟨Yt , fj ⟩|Xt = ei ] = cji .

(18)

Remark 5. Under P¯ the Yt are i.i.d. r.v. Under P: Xt +1 = B(Xt )Xt + Mt +1 and Yt = CXt + Lt . That is, under P the dynamics are the ‘real world’ dynamics. 5. Filters We wish to estimate Xt given the observations Y0 , Y1 , . . . , Yt . That is, we wish to calculate E [Xt |FtY ] ∈ RN . This is the conditional distribution of Xt given FtY and so is the optimal mean square estimate. By Bayes’ rule again: E [Xt |FtY ] =

E¯ [Λt Xt |FtY ] E¯ [Λt |FtY ]

∈ RN .

(19)

The numerator is an unnormalized conditional distribution of Xt given FtY . Write qt := E¯ [Λt Xt |FtY ].

(20)

2012

R. Elliott et al. / Statistics and Probability Letters 83 (2013) 2007–2014

Also note that, with 1 := (1, 1, . . . , 1)′ ∈ RN ,

⟨qt , 1⟩ = E¯ [Λt ⟨Xt , 1⟩ | FtY ] = E¯ [Λt | FtY ]. Recall that under P¯ the Y are all uniformly distributed and i.i.d., so P¯ is a nicer measure with which to work. At t = 0: q0 = E¯ [Λ0 X0 |F0Y ].

(21)

As Λ0 = λ0 this is equal to

M Y ¯ q0 = E M ⟨CX0 , fj ⟩⟨Y0 , fj ⟩X0 |F0 .

(22)

j =1

Suppose we know the initial distribution X0 ∈ RN .

p0 := E [X0 ] = E¯ [X0 ],

(23)

Then given Y0 M N

q0 = M

E¯ [⟨X0 , ei ⟩|F0Y ]⟨Y0 , fj ⟩cji ei

j=1 i=1

N M

=M

i =1

cji ⟨Y0 , fj ⟩ p0 (i),

(24)

j=1

where p0 (i) = ⟨p0 , ei ⟩. The normalized estimated distribution of X0 given Y0 is then p∗0 =

q0

,

⟨q0 , 1⟩

(25)

where 1 = (1, 1, . . . , 1)′ ∈ RN . The MAP (maximum a posteriori) estimate is then e∗0 = max⟨p∗0 , ej ⟩ = max⟨q0 , ej ⟩. ej

ej

(26)

That is, e∗0 is the state corresponding to the highest probability. In case of equality an arbitrary choice is made. At t = 1: Consider now q1 = E¯ [Λ1 X1 |F1Y ] ∈ RN . Then q1 = M

M

E¯ [Λ0 ⟨CX1 , fj ⟩X1 |F1Y ]⟨Y1 , fj ⟩

j =1

=M

M N

E¯ [Λ0 ⟨X1 , ei ⟩|F1Y ]⟨Y1 , fj ⟩cji ei

j =1 i =1

=M

M N

E¯ [Λ0 ⟨B0 (X0 )X0 , ei ⟩|F1Y ]⟨Y1 , fj ⟩cji ei .

(27)

j =1 i =1

At this stage we substitute the MAP estimate e∗0 for X0 in B0 so the recursion becomes q1 ≈ q∗1 = M

M N

E¯ [Λ0 ⟨B0 (e∗0 )X0 , ei ⟩|F1Y ]⟨Y1 , fj ⟩cji ei

j=1 i=1

= Mdiag

M

cji ⟨Y1 , fj ⟩ B0 (e∗0 )q0 .

(28)

j =1

Here, for a vector α = (α1 , α2 , . . . , αN ) ∈ RN , diag(α) is the matrix with α on the diagonal. The approximate normalized distribution of X , given F Y , is then p∗1 =

q∗1

⟨q∗1 , 1⟩

∈ RN .

(29)

R. Elliott et al. / Statistics and Probability Letters 83 (2013) 2007–2014

2013

The MAP estimate of X1 given F Y is then e∗1 = arg max⟨p∗1 , ei ⟩ = arg max⟨q∗1 , ei ⟩. ei

(30)

ei

Consider now the general case t + 1: Suppose q0 , q∗1 , . . . , q∗t and e∗0 , e∗1 , . . . , e∗t have been determined. Then qk+1 = E¯ [Λk+1 Xk+1 |FkY+1 ]

=M

M

E¯ [Λk ⟨CXk+1 , fj ⟩Xk+1 |FkY+1 ]⟨Yk+1 , fj ⟩

j =1

=M

M N

E¯ [Λk ⟨Xk+1 , ei ⟩|FkY ]⟨Yk+1 , fj ⟩cji ei

j=1 i=1

=M

M N

E¯ [Λk ⟨Bk (Xk )Xk , ei ⟩|FkY ]⟨Yk+1 , fj ⟩cji ei .

(31)

j=1 i=1

Again, we replace Bk (Xk ) by Bk (e∗0 , e∗1 , . . . , e∗k ), so qk+1 ≃ q∗k+1 = M

M N

E¯ [⟨Bk (e∗0 , . . . , e∗k )Λk Xk , ej ⟩|FkY ]⟨Yk+1 , fj ⟩cji ei

j=1 i=1

=M

N M

E¯ [⟨Bk (e∗0 , . . . , e∗k )qk , ei ⟩]⟨Yk+1 , fj ⟩cji ei

j=1 i=1

= Mdiag

M

cji ⟨Yk+1 , fj ⟩ Bk (e∗0 , . . . , e∗k )qk .

(32)

j=1

The normalized conditional estimate is then q∗k+1 . ∗ ⟨qk+1 , 1⟩ take e∗k+1 = arg maxei ⟨p∗k+1 , ei ⟩

p∗k+1 = Again, step.

(33)

= arg maxei ⟨q∗k+1 , ei ⟩. This enables us to define Bk+1 (e∗0 , e∗1 , . . . , e∗k+1 ) at the next

Remark 6. From Eq. (32) above we have

q∗k+1 (i)

= Mdiag

M

cji ⟨Yk+1 , fj ⟩

j =1

N

Bk (i, r )qk (r ).

(34)

r =1

Closer in spirit to the Viterbi algorithm would be to replace the second sum here by

q¯ k+1 (i) = Mdiag

M j =1

cji ⟨Yk+1 , fj ⟩ max Bk (i, r )qk (r ). r

(35)

This change could also be made in the recursion for q1 . Acknowledgments The authors wish to thank NSERC for continuing support. The first author wishes to thank ARC for support. References Anselone, P., 1960. Ergodic theory for discrete semi-Markov chains. Duke Math. J. 27, 33–40. Barbu, V., Limnios, N., 2008. Semi-Markov Chains and Hidden Semi-Markov Models, Toward Applications, their Use in Reliability and DNA Analysis. In: Lecture Notes in Statistics, vol. 191. Springer, New York. Bulla, J., Bulla, I., 2006. Stylized facts of financial time series and hidden semi-Markov models. Comput. Statist. Data Anal. 51 (4), 2192–2209. Chryssaphinou, O., Karaliopoulou, M., Limnios, N., 2008. On discrete-time semi-Markov chains and applications in words occurrences. Comm. Statist. Theory Methods 37 (8), 1306–1322. Elliott, R., 1994. Exact adaptive filters for Markov chains observed in Gaussian noise. Automatica 30 (9), 1399–1408.

2014

R. Elliott et al. / Statistics and Probability Letters 83 (2013) 2007–2014

Elliott, R., Aggoun, L., Moore, J., 1995. Hidden Markov Models, Estimation and Control. Springer. Howard, R., 1971. Dynamic Probabilistic Systems, Vol. II. Wiley, New York. Mode, C., Pickens, G., 1998. Computational methods for renewal theory and semi-Markov processes with illustrative examples. Amer. Statist. 42 (2), 143–152. Rotondi, R., Varini, E., 2003. Bayesian analysis of a marked point process: application in seismic hazard assessment. Statist. Method. Appl. 12, 79–92. Sansom, J., Thomson, P., 2001. Fitting hidden semi-Markov models to breakpoint rainfall data. J. Appl. Probab. 38 (A), 142–157. Stenberg, F., 2007. Semi-Markov models for insurance and option rewards. Ph.D. Thesis. Institutionen fr Matematik och Fysik, Mlardalen University, Sweden. Trevezas, S., Limnios, N., 2009. Maximum likelihood estimation for general hidden semi-Markov processes with backward recurrence time dependence. J. Math. Sci. 163 (3), 262–274. Yu, S.-Z., 2010. Hidden semi-Markov models. Artif. Intell. 174, 215–243.