- Email: [email protected]

Journal Pre-proof

Effective Privacy Preserving Data Publishing by Vectorization Chris Soo-Hyun Eom, Charles Cheolgi Lee, Wookey Lee, Carson K. Leung PII: DOI: Reference:

S0020-0255(19)30889-8 https://doi.org/10.1016/j.ins.2019.09.035 INS 14866

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

2 February 2019 2 July 2019 21 September 2019

Please cite this article as: Chris Soo-Hyun Eom, Charles Cheolgi Lee, Wookey Lee, Carson K. Leung, Effective Privacy Preserving Data Publishing by Vectorization, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.09.035

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.

Highlights • We suggested a privacy-preserving data-publishing model to balance data utility and privacy preservation. • The model is based on surrogate vectors. • The model is applicable on grid environments. • The model also protects the private location information of individuals. • The model satisfies ε-differential privacy.

1

Effective Privacy Preserving Data Publishing by Vectorization Chris Soo-Hyun Eoma , Charles Cheolgi Leea , Wookey Leea,∗, Carson K. Leungb,∗ a Inha

University, South Korea of Manitoba, Canada

b University

Abstract As smart devices and cloud services are rapidly expanding, a large amount of location information can easily be gathered. However, there is a conflict between collecting location data and protecting personal data since obtaining and utilizing the data may be restricted due to privacy concerns. Various methods for anonymity and on the original location data have been studied, but these methods have excessively reduced data utility while stressing highly on privacy preservation. In this research, we suggest a novel model to overcome this fundamental dilemma via a surrogate vector based on the grid environment. Compared to the existing approaches, our study shows a new theoretical advancement in privacy protection, and outstanding performance with respect to time complexity and data utility has been achieved. Keywords: Location-based services (LBS ), K-anonymity, Trajectory database, Surrogate vector

1

2 3 4 5 6 7 8 9 10 11

1. Introduction A generic controversy exists over anonymity and data utility in location-based services (LBS). Location data about individuals–which is growing exponentially with the increasing number of smart devices–can extensively be collected, stored, analyzed, and utilized. However, gathering and distributing this data might be legally prohibited due to widely-held privacy concerns [22]. If anonymity could be guaranteed, ground-breaking advances could be achieved, which may bring various LBS business models in terms of data analysis, services, and mashups. This is an essential task that must definitely be settled since a huge amount of data related to personal information is inevitably incorporated into the big data trend. ∗ Corresponding

authors: Wookey Lee, Carson K. Leung Email addresses: [email protected] (Wookey Lee), [email protected] (Carson K. Leung)

Preprint submitted to Elsevier

September 26, 2019

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

37 38

39 40

41 42

43 44 45 46 47 48 49 50 51 52 53

Nowadays, enormous amount of location information will be produced by digital devices such as smartphones [31], smart watches [11], smart glasses [8, 37], navigators [19], traffic information systems [4, 14], closed-circuit cameras [15], Internet of Things (IoT) sensors and robots [36], medical devices, biomimetic microtaggants [13], drones and satellite [19], etc. Meanwhile, related to the location information, sub-sequence matching, time-series analysis, and trajectory pattern mining [17] have been employed, but they have a wellknown limitation of safety assurance. To overcome the limitation, some studies have been conducted about the anonymity of location data such as K-anonymity [29], l-diversity [23], t-closeness [21], location semantics [18], differential privacy [6, 8]. They have been usually employed on the point-based location information, but they are not appropriate to the anonymize trajectory database. In addition, the researches have revealed a critical drawback in that they have greatly restricted data utility (e.g., 95% of data removal) [22]. Basically, the stronger to stress on securing anonymity, the smaller is the data utility. Without a fundamental breakthrough that solves this dilemma, it is inevitable that the negative effect of utility constraint will be continued, since personal information is the more important one to be secured (e.g., GDPR in EURO [24]). Furthermore, differential privacy for queries has been widely used to protect individual privacy. This returns a changed result by adding and removing Laplace noise to an original result for the queries to protect individual privacy included in database [4, 6]. We suggest an effective solution that guarantees not only secure database but also high data utility simultaneously. Our contributions of this article can be summarized as follows: • Extreme security on the privacy of the trajectory information by employing a novel vector model. • The fundamental dilemma between data utility and privacy can be solved by the efficient lower dimension spatial scheme. • Outstanding performance in time complexity can be achieved compared to the existing approaches. Regarding the concept of privacy, our model overwhelmingly outperforms existing methods in terms of both data utility and time complexity. Specifically, one of the most important strengths of our model is that privacy is secured fundamentally. In contrast to the point-based data (which must reveal a specific location), our model ensures that the data cannot be guessed until the anonymized trajectory is mapped into a real trajectory. The remainder of this article is organized as follows. Section 2 introduces our goal through a motivating example. Section 3 presents related works. Section 4 discusses the problem of anonymity. Section 5 explains our main method and algorithm. Section 6 validates our model by experiments. Section 7 draws conclusions.

3

54

55 56 57 58 59 60

2. Motivating Example Let us consider an example of moving object’s trajectory in Table 1, which shows the trajectories of five objects—namely, Obj 1 through Obj 5. The trajectory information can be represented as spatio-temporal data, which denote Cell ID with the timestamp. For example, the trajectory of Obj 4 reveals that it visited the cells c, h, m, h, and i corresponding to timestamps 1, 2, 3, 6, and 7, respectively, and denoted by c1, h2, m3, h6, and i7 as in Table 1. Table 1: An example trajectory database D ID Obj 1 Obj 2 Obj 3 Obj 4 Obj 5

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89

a1 k2 l1 c1 c1

f2 p3 q2 h2 h2

Trajectory k3 p4 q4 r5 v3 w4 m3 h6 m3 r4

u5 m6 x5 i7 w5

p6 r6 r6

Disease Flu Cancer Fever Cancer Flu

Regarding the trajectory database D, there might be a privacy attack. When an attacker possesses a certain background knowledge on a target person, the privacy information can be attacked for malicious use through the trajectory. Consider the anonymity approach in more detail through Table 1. We refer to an attacker as an adversary, and the target person as a victim. Assume that two people named Tommy and Hector moved to certain locations in Table 1. For example, an adversary knows that Tommy visited a cell i at Time 7, denoted by hi7i. This background knowledge hi7i is a sequence whose length is 1 (L = 1). Obj 4 is the only record at the trajectory which has hi7i. So, the attacker can easily identify Obj 4 as Tommy and can attack his privacy (disease info: Cancer). As another example, assuming that the adversary knows that Hector visited the c1 and m3, this background knowledge hc1, m3i is a sequence (L = 2). In Table 1, the number of records including c1 and m3 is two. So, the adversary cannot accurately identify Hector. Now, we will explain this scenario by using the K-anonymity and background knowledge whose length is L. An anonymized database satisfies K-anonymity if and only if the number of records is at least K, and the records include every sequence whose length is at most L. In short, the larger the number of records that have the same data, the more guaranteed the anonymity. In Table 1 with K = 2 and L = 2, let us explain how to modify and publish data. In the existing methods, regarding the original point-based information, K-anonymity is simply enforced. For example, to satisfy the 2-anonymity, all data in Obj 1 and r4, w5, and r6 in Obj 5 should be removed. Moreover, all data in Obj 2 cannot satisfy 2-anonymity. So, they should be removed since they are vulnerable to the malicious privacy attack. According to the existing methods, most of the data in the database will be removed to satisfy 2-anonymity. Consequently, out of the total 28 data points, only 6 data points c1, h2, and m3 in both Obj 4 and Obj 5 will be remained, which means about 80% of the data will be removed. Hence, the data utility will be undermined

4

92

severely. On the other hand, our model can guarantee privacy without sacrificing data utility significantly by using the surrogate vectors. Specifically, we only need to remove 14% of the data. More details are in Section 5.

93

3. Related Works

94

3.1. Trajectory Privacy

90 91

95 96

97 98 99 100 101 102 103 104 105 106 107 108 109

110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131

The trajectory privacy-preserving model has been mainly classified into the following two distinct models: 1. Privacy-preserving data publishing (PPDP) model [9], in which data will be published without disclosing identity of the data subjects. This model allows researchers to conduct de-identification which removes the relationship between the data subjects and the identified data. Kanonymity [29] is relevant to de-identification so that it tries to protect privacy by employing the number of records having the same sequences in the trajectory database. 2. Privacy-preserving data mining (PPDM) model, in which data that can be used for machine learning or statistical processing are not published, but are released as a form of statistical summary or as results based on aggregation, calculation, etc. Differential privacy [6] prevents individual information disclosure based on a mathematical definition by adding noises. For the K-anonymity, several methods have been conducted such as suppression, generalization, clustering, obfuscation, etc. Gao et al. [10] guaranteed anonymity by employing trajectory similarity. Xin et al. [35] released the dynamic trajectory satisfying K-anonymity by using adaptive clustering. Poulis et al. [28] assumed that an adversary already knows about m locations, and they adapt k m -anonymity. Peng et al. [25] obfuscated trajectories by issuing fake queries and meet the LBS requirement through user collaborations. Zhou et al. [39] identified a neighborhood attack related to the neighbors of a target victim. Although the researches handle K-anonymity, there are some limitations in that they do not publish a trajectory database specifically, nor do they consider sequences related to the adversary’s background knowledge. With other perspectives, researchers have studied privacy models according to the attack types [9]. Examples include identity linkage attack, which assumes that an adversary confirms a target’s identity and specifies sensitive information by using a trajectory database and background knowledge [5]. Moreover, we also adopt the suppression method for our surrogate vector model. Some researches [2, 26] have exploited the suppression method. For instance, Terrovitis et al. [30] proposed a suppression method that reduces exposure possibility of the real trajectory through suppressing some trajectory segments iteratively. Chen et al. [5] suggested the (K, C)L -privacy model employing local suppression method for anonymity and data utility. Al-Hussaeni et al. [1] employed a global suppression model that causes less data utility than 5

139

local suppression to reduce high computational cost. Unlike the global suppression, despite the high computational cost, the local suppression guarantees high data utility. In summary, our model is very effective in overcoming this trade-off since it exhibits low time complexity and high data utility simultaneously. Note that we focus on significantly improving the methods (K, C)L -privacy and ITSA [1] preventing an identity linkage attack by using local suppression and global suppression, respectively.

140

4. Preliminaries

132 133 134 135 136 137 138

141 142 143 144

145 146 147 148

149 150 151 152 153 154 155 156 157

158 159

160 161 162

163 164 165 166 167 168 169

170 171

To protect personal information in the trajectory database D, the database has to be anonymized prior to publishing. We suggest a creative surrogate vector-based model, named Privacy Preserving Data Publishing by Vectorization (P4V ) with the following definitions, theorems, and examples. Definition 1 (KL -privacy). For integers K and L, a trajectory database D satisfies KL -privacy for any sequence q ⊂ D if and only if q satisfies 0 < |q| ≤ L and |D (q)| ≥ K where |q| is the length of q and |D (q) | is the frequency of q in D. Ghasemzadeh et al. [40] introduced LK-privacy model to achieve anonymity in a trajectory database. Their goal was preserving the information to support passenger flow analysis. KL -privacy was originated from LK-privacy, but two key differences between our work and theirs are as follows. First, they employed the point-based location data, but we employ the surrogate vector which can reduce the data dimension significantly. Second, they handled all candidate sequences by using Flowgraph when searching MVSs and removing MVSs, but we can effectively skip a lot of tasks for unnecessary candidate sequences through LFP-Tree. Definition 2 (Violating Sequence). A sequence q is a violating sequence if q satisfies 0 < |q| ≤ L and |D (q)| < K. An adversary attacks with background knowledge whose length ≤ L. Concerning the trajectory database D, a sequence that does not satisfy KL -privacy is a violating sequence, which needs to be deleted. Example 1. Given K = 2 and L = 2. A sequence q1 = hr4i of Obj 5 in Table 1 exists only at Obj 5. It is a violating sequence because it satisfies 0 < |q1 | ≤ 2 but |D (q1 )| < 2. Another sequence with length 2 and q2 = hr4, w5i is also a violating sequence due to the same reason. However, the sequence q3 = hc1i is not a violating sequence because it exists at Obj 5 as well as at Obj 4 so that it does not satisfy |D (q3 )| < 2. Similarly, the sequence q4 = hc1, m3i is not a violating sequence. Every V S has to be removed to make the anonymized database satisfying anonymity. A naive method is to remove all violating sequences. It is possible 6

172 173 174 175

176 177 178

179 180 181 182 183 184

185 186 187

188 189

but impractical due to many V Ss. Consequently, only minimum utility will be achieved. For example, there will be nothing left in Obj 5 if we remove the violating sequences hc1, r4i, hh2, m3, w5i, and hc1, m3, r4, r6i from Obj 5. To overcome this, the concept of minimal violating sequences (M V S) is defined [5]. Definition 3 (Minimal Violating Sequence (M V S) [5]). A violating sequence q is an M V S, if all subsequences of q are not VS. An M V S q with |q| = i is denoted as M V Si . Example 2. Given K = 2 and L = 2, a violating sequence q1 = hc1, r6i is an M V S because its subsequences hc1i and hr6i are not V Ss, and it satisfies 0 < |q1 | ≤ 2, |D (q1 )| < 2. However, a violating sequence q2 = hr4, w5i is not an M V S since its subsequences hr4i and hw5i are V Ss. Note that the violating sequence q3 = hr4i of Obj 5 in Table 1 is an M V S according to Definition 3. We call it M V S1 whose length is 1. Regarding trajectory database D, the number of M V S is much smaller than the number of violating sequences, and KL -privacy is satisfied when the trajectory database D does not include any M V S. Theorem 1. A trajectory database D satisfies KL -privacy if and only if any M V S is not in D.

194

Proof. Let us assume that D does not satisfy KL -privacy even though it does not include M V S. By Definition 1, D should include at least one violating sequence. Meanwhile, by Definition 3, a violating sequence should have an M V S or be an M V S itself. It is contradictory to the assumption that D does not include M V S. So, D must satisfy KL -privacy.

195

5. Anonymization Algorithm via Surrogate Vector and LFP-Tree

190 191 192 193

196 197 198 199

200 201 202 203 204 205 206 207

208 209 210 211

In this section, we suggest an innovative methodology that guarantees anonymity while preserving data utility. For this, we treat Figure 1(a) representing the trajectory database in Table 1 and Figure 1(b) describing surrogate vectors. Table 2 shows several terminologies for our model. 5.1. Surrogate Vector We apply a grid model to the relationship between two given points by using directions in two-dimensional space. Figure 1(b) shows the eight surrogate vectors, in which each vector denotes moving direction to an adjacent cell. For example, a vector N represents north, SE represents southeast, etc. This is the result of the preprocessing step, which maps trajectory into surrogate vectors. With this process, each location data can be transformed into surrogate vectors, and then the M V S-based method will be explained to satisfy K-anonymity. Definition 4 (Doublet for an Instance). Given c, v, and t, which means cell point, surrogate vector, and time, respectively, each doublet b = ct is included in the original database D, and bV = vt is included in the collection of vector database DV . Moreover, each cell point inheres in each vector. 7

Table 2: Terminologies. Notation DV |D(q)| |q| K L VN Φ [·] Φ−1 VS

Sequence q satisfying q 0 ∈ / V S, q 0 ⊆ q

MV S M V Si Si T LF P b, bV βj βj .b |βj .b| βj (K+) h βj (K−)i βj ./ βj0 212 213 214 215 216 217

218 219 220 221

222

223

224 225 226

227 228

Definition Surrogate vector trajectory database Frequency of the sequence q in D Length of the sequence q Threshold value for K-anonymity Length of background knowledge Set of surrogate vector in N dimension Surrogate vector mapping function Inverse function on the surrogate vector Sequence q where 0 ≤ |q| ≤ L, |D (q)| < K MVS q with |q| = i A Set of M V Si A Length based frequent pattern tree A doublet for an instance in D, DV jth branch including instances in T LF P An instance (node) b in βj Frequency of βj .b A set of βj .b with |βj .b| ≥ K A set of βj .b with |βj .b| < K

L

A combination between βj and βj0 with L

An instance is a doublet included in each object. For example, in Table 1, m3 is a doublet b, and each m3 in Obj 4 and Obj 5 is an instance. We address a function Φ[·] that converts the cell-based sequence to the vector-based sequence. In contrast, the inverse function Φ−1 [·] of Φ[·] converts the surrogate vectorbased sequence to the cell-based sequence, vice versa. Note that these two functions are also utilized to publish an anonymized database. Definition 5 (Surrogate vector and mapping). For V n = {v1 , v2 , . . . , vn } which is a set of surrogate vectors v in n dimension. There is a function Φ [·] mapping cells (b1 , b2 , . . . , bn ) in original database D into the surrogate vectors b∗1 , bV2 , . . . , bVn , and it is denoted as follows: Φ [b1 , b2 , . . . , bn ] −→ b∗1 , bV2 , . . . , bVn (1) where b∗1 is an encrypted instance, and bVi ∈ V N .

Example 3. V 8 = {N, N E, E, . . . , N W } is a set of the v in 8 dimensions. Definition 6 (Inverse Function, Φ−1 ). Through the inverse function of the Φ [·] and b∗1 which is encrypted and is an inherent cell in bV2 , original point-based sequence can be found as follows: Φ−1 b∗1 , bV2 , . . . , bVn −→ [b1 , b2 , . . . , bn ] (2)

Example 4. The original trajectory of Obj 1 can be gained by inverse function Φ−1 and the inherent cell point f of S2: 8

Figure 1: The trajectory for Table 1 and Surrogate Vector

229

230

Φ−1 [a1∗ , S2, S3, S4, S5, N 6] = [a1, f 2, k3, p4, u5, p6] . Lemma 1. Φ−1 [Φ [b1 , b2 , . . . , bn ]] = [b1 , b2 , . . . , bn ].

240

Consider a trajectory based on the surrogate vector resulted from mapping function for all cell point data except the encrypted initial point. Because all the cell points are mapped to the 8-dimensional vector, the high-dimensional data of the general cell point drops to 8 dimensions. So, we could expect that it helps to satisfy KL -privacy much better since low dimension causes an increase in the frequency of the sequences from an anonymity point of view. For an anonymized database, our model follows two steps. First, we search M V Ss disrupting the anonymity in the surrogate vector trajectory database DV . In the next step, we guarantee K-anonymity by checking each instance or doublet for the underlying procedure.

241

5.2. MVS Search Procedure Using LFP-Tree

231 232 233 234 235 236 237 238 239

242 243 244 245 246 247

Table 3 shows a surrogate vector database DV (encrypted initial points b∗i are hidden) converted from the original database D in Table 1. With an encrypted initial point, each trajectory in DV has surrogate vectors, and each vector has its inherent cell point. For example, the first vector S2 in Obj 1 has its inherent cell point f. Note that the points will be given via encrypted private keys to the users. Table 3: Surrogate Vector Trajectory Database D V ID Obj 1 Obj 2 Obj 3 Obj 4 Obj 5

Surrogate S2 S3 S3 E4 S2 S3 S2 S3 S2 S3

Vector Trajectory S4 S5 N6 E5 N6 E4 E5 NW 6 N6 E7 S4 S5 N6

9

Disease Flu Cancer Fever Cancer Flu

248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285

286 287 288 289 290 291 292

Previously, we mentioned that an adversary identifies personal information by background knowledge. For example, if an adversary who possesses background knowledge even for a surrogate vector, hE7i, from DV in Table 3, Obj 4 will clearly be identified and attacked. So, we should remove hE7i from Obj 4 since it is an M V S1 in DV . Now, we consider a mechanism to search M V Ss in the surrogate vector database in DV . Algorithm 1 is the M V S search procedure. In order to identify M V S1 , we could simply consider each doublet. Meanwhile, a process to search an M V S whose length is bigger than 1 is quite different from M V S1 . For example, in Table 3, each sequence hS2i and hE5i is not an M V S1 because each frequency of the sequences is not less than K = 2. However, a sequence hS2, E5i whose length is equal to 2, is an M V S since it exists only in Obj 3. In other words, even if the individual instance satisfies the privacy threshold, the pair of instances and a sequence of instances may violate the privacy threshold. So, the frequencies of all length 2 sequences {hS2, S3i , hS2, S4i , hS2, S5i , . . . } should be counted. Alike, sequences {hS2, S3, S4i, hS2, S3, S5i , hS2, S3, N 6i , . . . } whose length are 3, sequences {hS2, S3, S4, N 5i , hS2, S3, S4, N 6i , . . . } whose length are 4, and so forth, which will be exploited up to the maximum number of sequences that should be checked. Thus, we face the needs for all the sequence combinations to be investigated whether they are M V Ss or not. In reality, the numbers of sequence combinations will be factorial. The longer the sequences, the more combinations will be enumerated. The existing methods enumerate all the instance combinations, to check the V Ss and even the M V Ss, so that they should explore the whole DB and calculate the frequency of all combinations. So, it is required to emphasize the importance of the candidate sequence size because lots of unnecessary candidate sequences consume excessive time. In contrast, our approach utilizes a novel method combining the instance sets differently through the LFP-Tree, which is extended to anonymity from the FP-tree [1, 12]. A difference between the original FP-tree and our LFP-Tree is that our method generates the FP-tree at each step to find M V S with length i, M V Si . In addition, LFP-Tree handles sequences through each branch. Fung et al. [41] generated the Critical Violation Tree (CVT ) by employing the FP-tree. They removed M V Ss (which were already searched) through CVT (which was generated by using only M V Ss). In contrast, we can effectively reduce numerous tasks for removing M V Ss and for searching M V Ss by employing LFP-Tree (which is an improved method for CVT ). 5.3. LFP-Tree Method LFP-Tree is generated by the following three steps. First, each frequency of doublets is obtained by scanning the surrogate vector database. Second, the doublets are sorted by their frequencies in descending order. Third, the LFPTree is generated by the sorted doublets. For example, we scan the surrogate vector database and obtain each frequency of doublets, and then, the doublets are sorted so that we can generate the LFP-Trees. Figure 2(a) and (b) represent 10

Algorithm 1 Identify M V S set from LFP-Tree V Input: Vector Trajectory D , thresholds K and L V Output: M V S set S D 1: C1 , SC1, SC2, ∀di and ∀Si ← ∅ V 2: C1 ← all distinct doublets in D V V 3: Scan D once to compute D (b) for all doublets b ∈ C1 V 4: S1 ← ∀b with D (b) < K 5: d1 ← C1 − S1 6: i = 2 7: while i ≤ L and di 6= ∅ do LF P 8: Generate LFP-Tree Ti−1 by gLF P T (di−1 ) LF P 9: for each βj in Ti−1 do 10: Add [βj (K+) ./ βj (K+)]i to SC1 11: for each sequence q ∈ [βj (K−) ./ βj ]i do 12: if q ∈ / SC1 then LF P 13: Add q to SC2 and update |Ti−1 (q) | 14: end if 15: end for 16: end for 17: for each sequence q ∈ SC2 do 18: if q is a super sequence of any q 0 ∈ Si−1 then 19: Remove q from SC2 20: end if 21: end for LF P 22: Si ← Q for Q = q|q ∈ SC2 ∩ |Ti−1 (q) | < K 23: di ← all distinct doublets in SC2 − Si 24: SC1, SC2 ← ∅ 25: i=i+1 26: end while 27: return S D V = S1 ∪ S2 ∪ · · · ∪ Si−1 293 294 295 296 297 298 299

300 301 302 303

LFP-Trees to search M V S2 and M V S3 , respectively. For a given node whose frequency is k, each frequency of its parent nodes is equal to or bigger than k since the LFP-Tree generates the nodes from top to bottom in descending order by the frequencies so that we can completely exclude calculations for many unnecessary sequences. Note that the function gLF P T at line 8 in the Algorithm 1 generates LFP-Tree, but we skip the description of the meaning of it. Definition 7 (LFP-Tree T LF P and the Branch βj ). LFP-Tree T LF P is a special case of Frequent Pattern Tree (F P − T ree) [12] and is generated to search an Si which is a set of M V Si . A branch βj of T LF P is a set including from a leaf node and its parent nodes to the root. βj is a set of instance b whose frequency

11

Figure 2: LFP-Tree for MVS s

304

305 306 307 308 309 310

311 312 313 314

315

316 317 318

319 320 321 322 323 324 325

is denoted by |b| and is classified as two different sets by K as follows: ( βj (K+) = {(b|b ∈ βj ) ∧ (|βj .b| ≥ K)} βj βj (K−) = {(b|b ∈ βj ) ∧ (|βj .b| < K)}

(3)

Example 5. For K = 2 and β1 = {S5, S4, N6, S2, S3} in Figure 2(a), the frequency of S5 is 2, and the frequencies of the instances above S5 are not less than 2. So, the nodes S5, S4, N6, S2, and S3 are included in β1 (K+). However, the nodes E5 and E4 in β2 = {E5, E4, S2, S3} are included in β2 (K−) because their frequency is less than K = 2. Next, let us see the instance set combination in more detail. Definition 8 (Instance set combination). Given instance sets βj0 and βj00 (βj0 , βj00 ⊆ βj ). An instance set combination βj0 ./ βj00 L generates a set of sequence q (|q| = L), and q is one of the βj0 ∪ βj00 which includes at least one element of the set βj0 . With a join operator, ./, the combination can be defined as follows: 0 βj ./ βj00 L = {q|q ⊆ (βj0 ∪ βj00 ) ∩ min(|q ∩ βj0 |) ≥ 1}

(4)

where |q| = L. Note that if βj0 6=βj00 , [βj0 ./ βj00 ]L 6=[βj00 ./ βj0 ]L .

Corollary 1. By Definition 7 and 8, all sequences generated by [βj (K+) ./ βj (K+)]L always appear more frequently than K since each frequency of their subsequences is more than K. Hence, they are not M V Ss. Let us explain how to reduce the size of the unnecessary candidate set by using the LFP-Tree in Figure 2. In order to search M V Ss whose length are greater than or equal to L, the existing methods handle needless candidate sequences by using the naive enumeration. However, we could exclude lots of unnecessary sequences by Corollary 1. Meanwhile, we cannot be sure whether the frequency of sequences including βj (K−) is bigger than K or not. This is because some combinations could exist not only in one branch but also in the 12

326 327 328 329

330 331 332 333 334

335 336 337 338 339 340

341 342 343 344 345 346 347 348 349 350 351

352 353

354 355 356 357 358

359 360 361 362 363 364 365 366 367

other branches. For example, the sequence hE5, E4i locates β2 as well as β3 . This can be processed in that the location of a branch for a doublet is identified easily by the LFP-Tree. As a result, all sequences in the LFP-Tree would be classified into two sequence cases: • Sequence Case 1 (SC1 ): For a sequence q (|q| ≥ K), we need not calculate the frequency of q anymore because q is always not an M V S. This is one of the greatest advantages of using the LFP-Tree due to [βj (K+) ./ βj (K+)]L . However, the existing method enumerates all combinations and spends a lot of time on processing it inefficiently. • Sequence Case 2 (SC2 ): It is not clear whether a sequence q is an M V S or not if q satisfies |q| < K in a branch βj . SC2 is a set of sequence including an instance whose frequency is lower than K in a branch. That is, SC2 is resulted from [βj (K−) ./ βj ]L . Note that doublets whose frequency is lower than K have already been eliminated before generating the LFP-Tree. 5.4. Secure Trajectory Database Publisher From now on, we consider how to remove M V S effectively. There are two methods to achieve this: global suppression and local suppression [5, 1]. Firstly, in order to delete an M V S, global suppression removes all instances of a target doublet in DV . From the perspective of privacy, if a doublet is removed in the whole database, no new M V S will be generated. However, it may devastate the entire database so that data utility will be severely decreased. Secondly, local suppression removes instances in the records including the M V S. It offers advantages in data utility, but is difficult to achieve optimal data utility that is proved as NP-Hard [5]. To effectively remove the M V S, we would delete a doublet p included in the M V S. The related theorem is as follows. Theorem 2. If a doublet p included in the M V S q is deleted, the sequence q 0 (q 0 = q\p) is no longer a VS. Proof. Let us assume that q 0 (q 0 = q\p) is a violating sequence. By Definitions 2 and 3, an M V S is a subsequence that does not include a violating sequence. Thus, q 0 is no longer a violating sequence because, for q 0 ⊆ q, the assumption that q 0 is not a violating sequence contradicts the fact that the sequence q is an M V S. Meanwhile, for K = 2 and L = 3, by applying Algorithm 1 to the trajectory database in Table 3, we could obtain the set of M V S, S(DV ) = {hN W 6i , hE7i, hE4, S2i, hE5, S2i , hE4, N 6i , hE5, N 6i}. Let us explain the excellent advantage of our suppression method employing LFP-Tree. At first, global suppression should be processed for each doublet whose frequency is below K, since they need not to be considered. For example, two M V S1 , hN W 6i and hE7i, should be removed by global suppression in that they consist of only one doublet. Now, consider the local suppression of hE4, S2i, hE5, S2i, hE4, N 6i, and hE5, N 6i which have doublets exceeding one. 13

368 369 370 371 372 373 374 375 376 377 378 379

380 381 382 383 384 385 386 387 388

389 390 391 392 393 394 395

396 397 398 399 400

401 402 403

404 405 406 407 408 409

The existing method [5] naively checks whether the local suppression of each instance p (p ∈ m) is valid or not for all records including the M V S, m. In order to treat the M V S, two steps are needed: First, for each record D(m), the method obtains candidate sequences that may become an M V Ss if every p is deleted and considers each sequence q (p ∈ q) that is not an M V S yet in D(m) as the candidate sequence. Next, it checks whether or not the new M V S is generated after the local suppression of each p by calculating the frequency of each q from all records including p except D(m) (i.e., D(p) − D(m)). Thus, it causes too much cost due to the calculations for the numerous records as well as all the candidate sequences. However, our method can improve the cost significantly by utilizing the following two properties based on the adoption of the LFP-Tree not only searching M V Ss but also removing them. Property 1 (Needless candidate sequences). Let us check that a local suppression of p (p ∈ m) in all the branches including m is valid for the removal of m. To do this, we can reduce the frequency of node (instance) p in βj , |βj .p|, such as |βj .p| − minb∈m |βj .b| where minb∈m |βj .b| is the frequency of m in βj . The frequency of each candidate sequence generated from [{p} ./ βj (K+)] is still greater than or equal to K if the frequency of the node p is greater than or equal to K (i.e., |βj .p| − minb∈m |βj .b| ≥ K). In other words, one of the specific advantages of our method is that we can skip the process for the candidate sequences due to LFP-Tree. Example 6. In Figure 2(a), let us assume that in order to remove M V S hE4, S2i, we have to check whether or not a local suppression of S2 of β2 . The M V S will be removed if we deduct the frequency of the node S2 by one which is the frequency of the M V S in β2 . Then, we need not calculate the number of sequences such as hS2, S3i generated from [S2 ./ β2 (K+)] anymore, since the frequency of the node S2 is 4 − 1 = 3 ≥ K so that the frequencies of the sequences are still greater than or equal to K. This is the advantage of employing the LFP-Tree. Property 2 (Huge reduction: |T LF P | ≤ |D|). The reason why |T LF P | ≤ |D| is that the duplicated data is integrated by the same branch. In our vector model, the more duplicated cases happen, since the number of vectors is dramatically smaller than that of cell points. Thus, trajectories in the different points can be incorporated into the same branch. The existing method has to check numerous records by not only M V S search but also the validity of the local suppression. However, we can reduce the costs significantly because it processes not on the records but on the branches. Example 7. The five records in Table 3 were shrunk to 3 branches in Figure 2(a). For example, in order to remove the M V S hE4, S2i and check that the local suppression of the instance S2 is valid, the existing method should touch the 3 records Obj 1, Obj 4, and Obj 5 among four records including S2 in Table 3. However, we need only check only one branch, β1 , except β2 including the M V S if we employing Figure 2(a).

14

Algorithm 2 Local suppression through LFP-Tree update Input: Vector Trajectory database DV , T1LF P , a doublet p in M V S m, thresholds K and L Output: Boolean value showing local suppression using p is proper 1: if |D V (p)| − |D V (m)| < K then 2: return false 3: end if 4: d ← ∀b with |D V (b)| ≥ K 5: generate T LF P by gLF P T (d) 6: for each branch βj (∀b ∈ βj with b ∈ m) do 7: i=2 8: if |βj .p| − minb∈m (|βj .b|) ≥ K then 9: for i ≤ L do 10: Q0 ← Q0 ∪ [p ./ βj (K+)]L 11: Q∗ ← [p ./ βj ]L − [p ./ βj (K+)]L 12: Q ← Q ∪ Q∗ 13: Update |T LF P (q) | where q ∈ Q and q ∈ Q∗ 14: i←i+1 15: end for 16: else 17: for i ≤ L do 18: Q∗ ← [p ./ βj ]L − Q0 19: Q ← Q ∪ Q∗ 20: Update |T LF P (q) | where q ∈ Q and q ∈ Q∗ 21: i←i+1 22: end for 23: end if 24: end for 25: for each sequence q where q ∈ Q and q ∈ / S(DV ) do LF P 26: if 0 < |T (q) | < K then 27: return false 28: end if 29: end for 30: return true

410 411 412 413 414 415 416 417 418

Now, how can we clearly verify whether or not the removal of the instance is valid for a certain instance about local suppression? We can obtain an answer for this question through Algorithm 2. Which instance should we remove to preserve data utility and to remove more M V Ss simultaneously? Unfortunately, while preserving the data utility at the maximum level, finding the optimal solution to remove an M V S is NP-hard [5, 1]. To select an instance for removal of an M V S, we consider a score related to both the number of M V Ss to be removed and the number of instances to be removed by the selected instances. The Score table in Algorithm 3 could be used to identify which instance should

15

419

be selected.

Figure 3: Local suppressions of S 2 and N 6 420 421 422 423 424

425 426 427 428 429 430 431 432 433 434 435

436 437 438 439 440 441 442 443 444 445

446 447 448 449

We previously checked that the global suppression causes too low data utility. So, let us explain two examples of utilizing the LFP-Tree update to check whether a local suppression is valid or not. It is because a removal of an instance in problem may provoke another M V S, cascadingly. For example, two M V S2 hE4, S2i and hE4, N 6i need to be removed. Example 8 (Removal of M V S hE4, S2i). First, if we delete the node E4 of β2 from Figure 2(a), the new M V S hE4i will be generated because the frequency of the sequence hE4i will be only one throughout LFP-Tree. So, the local suppression of E4 for the removal of M V S hE4, S2i is not valid. Second, each candidate M V S is hS2i and hS2, S3i if we do a local suppression of the node S2 in β2 . When we deduct the frequency of S2 in β2 by one, which is the frequency of the M V S in β2 , we need not calculate the candidate sequences anymore since the frequencies of them are still greater than or equal to K (i.e., 4 − 1 = 3 ≥ K). So, the local suppression of S2 in β2 is valid. After the local suppression of S2 for the removal of M V S hE4, S2i, LFP-Tree will be updated as shown in the right side of Figure 3(a). Example 9 (Removal of M V S hE4, N 6i). First, if we delete the node E4 of β3 from Figure 2(a), the new M V S hE4i also will be generated because of the same reason of the above example. Second, each candidate M V S is hN 6i , hN 6, S3i if we do a local suppression of the node N 6 in β3 . When we deduct the frequency of N 6 in β3 by one (which is the frequency of the M V S in β2 ), the node N 6 in β3 will be removed, and no new M V S will be generated. This is because both the frequencies of the candidate sequences are already 3 (≥ K) as we can see in β1 in Figure 2(a). After the local suppression of N 6 for the removal of M V S hE4, N 6i, LFP-Tree will be updated as shown in the right side of Figure 3(b). As a result, the LFP-Tree will be updated as shown in Figure 4 after removing all M V S through the above local suppressions. We define a connected trajectory to describe multiple trajectories in a record. A connected trajectory would be divided into some connected trajectories if some 16

Algorithm 3 Anonymized Trajectory Database Publisher Input: Vector Trajectory database DV , a doublet p in M V S m, thresholds K, L Output: Secure Trajectory Database DS 1: Find S(D V ) by Algorithm 1 2: Build a Score table via Algorithm 2 3: while S(D V ) 6= ∅ do 4: select an M V S m 5: T ←∅ 6: for each p with the highest score from M V S m do 7: if p can be used for local suppression then 8: S 0 ← ∀m0 for m0 ∈ S(DV ) where (p ∈ m0 ) ∧ {DV (m0 ) = DV (m)} 9: Remove p from DV (m) 10: else if m 6= ∅ then 11: T ←p 12: m←m−p 13: continue 14: else 15: select p with the highest score from T 16: S 0 ← ∀m0 with {m0 ∈ S(DV )} ∧ (p ∈ m0 ) 17: Remove all instances of p from DV 18: end if 19: Update the T LF P and Score table for removing the instances of p 20: Remove S 0 from S(DV ) 21: end for 22: end while 23: return the anonymized D V

450 451 452 453 454 455

456 457 458

459 460 461 462 463

464

instances are deleted to remove M V S. In other words, a connected trajectory in a record could be divided by some connected trajectories after removing M V Ss which were within the connected trajectory. In this case, each anonymized connected trajectory could be represented as a with surrogate vectors by employing the mapping function Φ [·] and Φ−1 [·] in Definition 5 and 6, respectively. Thus, the number of private keys is equal to the number of connected trajectories. Definition 9 (connected trajectory). In the grid model, a connected trajectory q with |q| > 1 is connected if there is a route between every pair of cells included in q. Next, we can get the anonymized trajectories from the comprehensive Algorithm 3. Note that we use |T | instead of |T LF P | which is the number of branches in LFP-Tree to avoid confusion over time complexity because the ‘LF P ’ might seem like an exponent. Suppose a worst case. The time complexity of the best L current model is O((mnt) |D|), where • m, n are the number of x-axis, y-axis cells, respectively (i.e., mn means 17

Figure 4: LFP-Tree update result

465

466 467

the total number of cells); • t is the number of Time stamp (i.e., mnt is equal to the number of dimension);

468

• L is the length of the background knowledge, and

469

• |D| is the number of records in the trajectory database D [5].

470 471 472 473 474 475 476 477 478 479 480 481 482 483

However, by our model, as explained previously, the total number of cells in the existing model can be reduced from mn to 8, which is a constant. LFPTree is generated by reading the database only twice. Even though scanning through the LFP-Tree is unnecessary, the number of all candidate sequences is notably decreased by removing the unnecessary processes. In other words, the time complexity drops from O((mnt)L |D|) to O(2|D|) for a best case that all frequencies of the leaf nodes in LFP-Tree are equal to or bigger than K. In conclusion, the time complexity of our algorithm is O (8t)L |T | where |T | ≤ |D| by employing the branches in LFP-Tree. Moreover, in many real-world cases, we expect |T | |D| because there will be numerous overlapped branches by the surrogate vector. We previously claimed that our vector-based model is much better than the cell point-based model in terms of data utility. Concerning the M V Ss, Theorem 3 verify this assumption.

485

Theorem 3. ||DV || ≥ ||D|| where each ||DV ||, ||D|| is the data utility (the number of data) of the anonymized DV , D, respectively.

486

Proof. We verify our theorem by examining each sequence case:

484

487 488 489 490 491

1. For a given sequence q, (a) if |q| = 1, then |D(q)| = |DV (Φ(q))| because the two models denote the q via cell point, not the vector data, so q = Φ(q). (b) if |q| > 1, then |D(q)| = |DV (Φ(q))| because the vector-based sequences can be converged into point-based sequences by using Φ−1 . 18

492 493 494 495 496

In other words, the number of M V S in the vector-based model is at least equal to the number of M V S in the point-based model. 2. Given a sequence set Q = {q1 , q2 , . . . , q|Q| } (|Q| > 1) where |qi | = |qj | ≥ 2 (for all i, j with i 6= j), each element of the set exists in the different records. (a) For the same reason of Case 1 above, |D(q)| = |DV (Φ(q))| if both locations and surrogate vectors of them are all different, i.e., qi 6= qj and Φ(qi ) 6= Φ(qj ).

497 498 499

(b) We i. ii. iii.

suggest three conditions to prove our theorem: All sequences in Q are connected trajectories. Original cell points of the sequences are all different. Vectors of the sequences are all the same, i.e., qi 6= qj and Φ(qi ) = Φ(qj ).

500 501 502 503

504 505

If sequences satisfy these three conditions, then |DV (Φ(qi ))| = P V qi ∈Q |D(qi )|. So, we know that |D (Φ(q))| > |D(q)| is valid. V Hence, for q (|D(q)| < K ≤ |D (Φ(q))|), q is an M V S in the point-based model, but not in the vector-based model. Thus, our theorem is verified since the smaller the number of M V S is, the smaller the number of the data that should be removed is.

Figure 5: Decrypted Trajectory Database

506 507 508 509 510 511 512 513 514

Note that we utilize the private key, which is identical to an original cell point for the decrypted trajectory database. Encryption does not allow the attacker or third party to grasp the information through the encryption of data. Instead, only the authorized (trusted) users can receive the private key and utilize the information through decryption. Likewise, our study makes it impossible for an attacker or a third party to grasp an accurate real trajectory through an encrypted trajectory satisfying privacy. More importantly, our study allows only the authorized (trusted) users to utilize a real trajectory by providing a private key. For K = 2 and L = 3, Figure 5 represents the anonymized 19

Table 4: Data Disclosure on Trust Level

Level

Release Vector Grid Key

Low

√

Medium

√

√

High

√

√

Anonymity No specific location Area Information, Not exact location

√

Individual tory

trajec-

Utility Flow, trajectory complexity, trajectory sparseness Statistics about trajectory with LBS Various LBS

541

and decrypted vector trajectories for Table 1 generated by the surrogate vector model and our algorithm. We can utilize much more data than the result of the existing method for various applications by employing the private keys. To protect privacy with private key, we introduce the level of data disclosure according to the user-level based on trust through Table 4. First, for low-level users who are not reliable, only surrogate vector data will be provided without grid information nor decryption key. Hence, from the point of view on anonymization, this is really anonymized data since the user’s physical location point cannot be provided at all (i.e., our approach will not provide even a single specific location) when compared with existing conventional methods (which inevitably disclose “specific location”, which per se can be a hint). Moreover, from the point of view on utilization, the overall trajectory flow can be guessed. For example, we can see (i) whether there is a lot of North-eastward flow as a whole; (ii) whether it is a complex flow or largely pass by; or (iii) whether it is a dense trajectory or a very sparse trajectory? For medium-level users, the area information (map grid) and surrogate vector would be provided. Through this, although it may be possible to know which area the information is, the accurate flow cannot be guessed at all. However, in terms of utilization, statistics about the trajectory can be known based on a specific area. For example, we can know that (i) about 63% of users moved toward to “South-eastward”, or (ii) about 2% of users are outliers. For high-level users, not only grid information and surrogate vectors for maps but also private keys could be provided. In terms of utilization, we can know a lot of information such as (i) “how many users pass through my store?”, (ii) “what is the distance traveled for an individual route?”, (iii) “how many people go together at a specific time?” To guarantee anonymization and protect privacy, this information will only be available to trusted users.

542

5.5. Surrogate Vector Applications

515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540

543 544

We can utilize many things with the private key and surrogate vectors. If so, without the private key, what useful applications we can play with only

20

545 546 547 548 549

surrogate vectors? We introduce some novel and useful applications by employing surrogate vectors. Let us assume that (i) Table 5 is a small part of an anonymized database through our model, and (ii) Figure 6 represents trajectories of five objects in Table 5, in which each cell in the grid does not indicate any exact location. Then: Table 5: Example for Novel Applications with Surrogate Vector

ID Obj a Obj b Obj c Obj d Obj e

Surrogate Vector Trajectory E1-N2-E3-N4-W5-S6-N7-NE8-E9-S10-S11-W12 E1-E2-E3-E4-N5-N6-N7-N8 E1-N2-N3 E1-E2-E3-N4-W5-N6-W7-W8-N9-E10 W1-S2-S3-SE4-E5-E6-E7-N8-N9-N10-W11-W12

Figure 6: Example Trajectory for Applications with Surrogate Vector

550 551 552 553 554 555

556 557 558 559

• User tendency: It is easy to grasp the user tendency on an area by using the surrogate vectors. For instance, Obj ‘a’ in Table 5 and Figure 6 shows a complicated trajectory with diverse vectors. It can be classified as a tourist who is interested in the area. In contrast, Obj ‘b’ shows a simple trajectory with repeated vectors. It can be classified as a user who simply passes the area. • Efficiency of trajectory: We can measure the efficiency of trajectory through the surrogate vectors. The length of the trajectory traveled by the user to the destination is equal to the number of surrogate vectors. In other words, the distance to the destination can be easily obtained by a simple 21

560 561 562 563 564 565 566 567

568 569 570 571 572 573 574 575 576 577 578 579

580

581 582

summation of the vectors because the result of summation will be offset by the inverse vectors. Hence, it is possible to easily measure how efficiently the user has moved to the destination when we employ the simple summation of the vectors. For example, consider Obj ‘c’ and ‘Obj ‘d’ in Table 5 who moved to their destinations. Obj ‘c’ moved to its destination through the shortest path without passing unnecessary cells, but Obj ‘d’ reached its destination in an inefficient way by passing many unnecessary cells so that the East and West vectors neutralize each other. • Cycle or quasi-cycle search: For a given trajectory, we can efficiently identify the shape of trajectory by employing surrogate vectors. Consider Obj ‘e’ in Table 5 as an example. In order to detect a cycle or quasi-cycle trajectory, the existing methods calculated the distance between the starting point and all the other points while saving all the information in each point of the trajectory. However, we can be easily identified a cycle or quasi-cycle trajectory by adding up the surrogate vectors where inverse vectors neutralize each other. In other words, the closer the summation of surrogate vectors to zero, the higher is the probability of having a quasicycle. This also can be applied to each x-axis and y-axis. Furthermore, we need not to save each point of the trajectory when searching a cycle or quasi-cycle because this can be achieved by only adding up the vectors. 6. Experiments We conducted experiments with an Intel(R) CoreTM i7-4790K CPU @ 4.00 GHz PC with 16 GB of the main memory. Table 6: Experiment Datasets Data Set Synthetic data Athens trucks T-Drive San Francisco Taxis

583 584 585 586 587 588 589 590 591 592 593 594

Data Size 73M 1.3M 227M 1.98G

Record |D| 10,000 273 8,179 11,237

Dimension 7,290 4,830 27,720 177,012

We conducted experiments using four types of datasets as summarized in Table 6: Synthetic dataset, Athens trucks [34], T-drive [38], and San Francisco Taxis [27]. Original datasets consist of records, location information (i.e., GPS coordinates - Latitude and Longitude), timestamps, etc. For the experiments, each trajectory in the datasets was generated by the mapping function Φ [·] and the grid model. First, the Synthetic dataset is a synthetic data set, which consists of cell points with timestamp in a grid model. Second, Athens trucks is a real-world trajectory dataset including 1,100 trajectories composed of 112,300 GPS positions of 50 trucks transporting concrete in Athens. In order to increase the K value in the experiment, we consider the same truck IDs which recorded in different date but moved in same spatial space as different truck IDs. Last, T-drive is a real-world trajectory dataset generated from a smart driving 22

595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615

616 617 618 619 620

service based on GPS trajectories of many taxies. It contains one-week trajectories of 10,357 taxies. Last, San Francisco Taxis dataset contains mobility traces of taxis in San Francisco, USA. It contains GPS coordinates of approximately 500 taxis collected over 30 days in the San Francisco Bay Area. Note that we consider several taxis with the same ID but different date as different taxis. First, we experimented to know how the cell area affects the accuracy of the data, the number of dimensions, the data utility, and the time complexity in the grid model through the above real-world datasets Athens trucks, T-Drive, and San Francisco Taxis. Second, we performed an experiment to compare global suppression and local suppression for data utility via T-Drive with respect to K. Third, we checked the anonymity level in terms of data removal rate for three data types. The anonymity level is defined as the number of anonymized records over that of initial vulnerable records. The initial vulnerable record is the record including M V Ss in the original database, and the anonymized record is the record not including any M V Ss while anonymizing database. Moreover, the data removal rate is the number of removed data over the number of initial data in the original database. Finally, we performed an experiment to compare the time complexity with respect to dimensionality. Note that we skipped the results for L ≥ 4 since, firstly, the M V Ss whose length are longer than 3 were few, and, secondly, the results were almost the same as the result for L = 3 as shown in [5]. 6.1. Analysis on Cell Areas In order to know how the cell area affects the accuracy of the data, the number of dimensions, the data utility, and the time complexity, we gave different conditions to each dataset such as K = 6, L = 3 for Athens trucks dataset and K = 60, L = 3 for T-Drive, San Francisco Taxis datasets.

Figure 7: Avg. data accuracy and dimension for varied cell area 621 622 623 624 625 626

First, to measure the accuracy of the data depending on the size of the cell area, we define the average data accuracy, which refers to the ratio of the average number of cells in which one data is contained. For example, “average data accuracy = 1” means one point is represented by one cell, which refers that the point shows high accuracy. On the other hand, “average data accuracy = 0.1” means that 10 points are expressed by only one cell, thus, the accuracy of 23

Figure 8: Data utility with varied cell area

627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647

the points is low. Moreover, as the size of a cell increases, the total number of cells decreases and the size of the dimension decreases. Figure 7 shows that both the average data accuracy and the number of dimension decrease as the cell area increases. Figure 8 shows the number of data remaining data after removing MVSs according to the cell area, i.e., the data utility. We could learn two implications through the two results. First, the larger the cell area, the larger the number of data remaining after anonymizing the database i.e., data utility. The reason is that representing the multiple data as one cell causes increase in the number of records satisfying K as the number of dimensions decreases. Second, our model P 4V still shows better performance in terms of data utility than that of KL. Additionally, we performed an experiment in order to know more specifically about the time complexity depending on the cell area. Figure 9 shows the run time for three datasets. Previously, we have confirmed that the larger the size of the dimension, the higher the average data accuracy. Considering these, as shown in Figure 9, it is evident that the existing method KL is inefficient in terms of time complexity for the high average data accuracy. However, our model P 4V shows much better performance than the existing method KL by using the surrogate vector. Finally, from the time complexity perspective, it is obvious that our model is theoretically superior to the existing model as we previously mentioned. In order to analyse the time complexity in terms of the

Figure 9: Runtime with varied cell area

24

648 649 650 651

dimension size in detail, we also conduct an experiment for this via synthetic datasets where each dataset has a different dimension each other. With the fixed K = 60, the result in Figure 10 demonstrates that P4V is two or three orders of magnitude faster than KL.

Figure 10: Runtime with the varied dimension

652

6.2. Global Suppression vs. Local Suppression

Figure 11: Global vs. Local suppression in data utility 653 654 655 656

Figure 11 shows a comparison of the data utility for L = 3 in the global suppression and the local suppression. Each figure shows that the local suppression is superior to the global suppression with varied K. Thus, we used only local suppression in the following experiments for data utility.

25

657

6.3. Anonymity Level vs. Data Utility

Figure 12: Anonymity Level for each dataset (L = 3) 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676

For the higher anonymity level, the data removal rate should be increased generally. It means we should remove much more data when K value increase because the greater K values is, the more M V Ss are generated. The result from the Synthetic data in Figure 12(a) shows the anonymity level values according to the varied data removal rate. Through P 4V , 100% anonymity level is guaranteed by nearly 60%, 70% data removal rate for K = 40 and K = 100, respectively. In order to achieve the same goal, however, KL should remove almost 90%, 100% of the whole data in Figure K = 40 and K = 100, respectively. Figure 12(b) is the result from the Athens trucks. Through P 4V , 100% anonymity level is guaranteed by nearly 60% data removal rate for K = 4 and K = 10. To achieve the same goal, however, KL should remove almost 80% of the whole data for the same K values. Figure 12(c) is the result from the T-Drive. Through P 4V , 100% anonymity level is highly guaranteed by about 60%, 75% data removal rate for K = 40 and K = 100, respectively. However, for the same goal, KL should remove greater than 90% of the whole data for the same K values. Figure 12(d) is the result from the San Francisco Taxis. Through P 4V , 100% anonymity level is highly guaranteed by about 60%, 70% data removal 26

678

rate for K = 40 and K = 100, respectively. However, for the same goal, KL should remove almost 100% of the whole data for the same K values.

679

7. Conclusions

677

710

We proposed an effective model to publish anonymized trajectory database. Although an individual’s location information is essential for various services, it can be abused by attackers in terms of privacy concerns. In order to protect privacy, there have been many studies which introduce various methods based on point-based data. When compared with existing studies based on point-based method, our vector-based model employs both LFP-Tree and the surrogate vectors. Our model is far superior to the existing studies in many respects. First, the generic difference between point-based methods and our vector-based method is as follows. From the point of view on anonymization, our vector-based method provides really anonymized data since the user’s physical location point cannot be provided at all (e.g., not even a single location will be provided) when compared with the existing point-based methods which inevitably disclose “specific location”. Second, from the usability perspective of our vector-based method, there are a lot of applications that the existing point-based methods cannot achieve. Vectors can be applied to some novel applications such as user tendency, trajectory efficiency, and cycle or quasi-cycle search. Also, in terms of the data disclosure according to the user-level based on trust, the overall trajectory flow can be guessed, and statistics about the trajectory can be known for a specific area. Third, in terms of time complexity, our vector-based method performed much faster than the existing point-based methods by reducing the number of data dimension. Moreover, to search and remove MVSs, the existing pointbased methods need to consider all candidate sequences, but our vector-based method can effectively skip a lot of tasks for unnecessary candidate sequences through the use of LFP-Tree. Finally, in regard to utility perspective, our vector-based method preserves much more data than the existing point-based methods through vectors while satisfying anonymity. As ongoing and future work, we plan to study the anonymity of high-level users with private key theoretically as well as experimentally in terms of our vector approach. In addition, we plan to apply various sizes of the cell to our vector-based method in grid model.

711

Acknowledgements

680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709

712 713 714 715 716

This project is supported by (i) National Research Foundation of Korea (NRF) Grant funded by the South Korean Government Ministry of Education (MOE) (NRF-2018K2A9A1A01090616); (ii) Institute of Information & Communications Technology Planning & Evaluation (IITP) Grant funded by the South Korean Government Ministry of Science and ICT (MSIT) (2019-0-00136,

27

719

Development of AI-Convergence Technologies for Smart City Industry Productivity Innovation) (iii) Natural Sciences and Engineering Research Council of Canada (NSERC); and (iv) University of Manitoba.

720

References

721

References

717 718

722 723

724 725 726

727 728

729 730 731

732 733 734

735 736

737 738

739 740

741 742 743

744 745 746

747 748

749 750

[1] K. Al-Hussaeni, B. C. Fung, W. K. Cheung, Privacy-preserving trajectory stream publishing, Data & Knowledge Engineering 94 (2014) 89–109. [2] B. Bamba, L. Liu, P. Pesti, T. Wang, Supporting anonymous location queries in mobile environments with privacygrid, in: WWW 2008, 237– 246, 2008. [3] Y. Cao, M. Yoshikawa, Y. Xiao, L. Xiong, Quantifying differential privacy under temporal correlations, in: IEEE ICDE 2017, 821–832, 2017. [4] R. Chen, B. Fung, B. C. Desai, N. M. Sossou, Differentially private transit data publication: a case study on the Montreal transportation system, in: ACM KDD 2012, 213–221, 2012. [5] R. Chen, B. C. Fung, N. Mohammed, B. C. Desai, K. Wang, Privacypreserving trajectory data publishing by local suppression, Information Sciences 231 (2013) 83–97. [6] C. Dwork, Differential privacy: a survey of results, in: TAMC 2008, 1–19, 2008. [7] C. Dwork, F. McSherry, K. Nissim, A. Smith, Calibrating noise to sensitivity in private data analysis, in: TCC 2006, 265–284, 2006. [8] E. ElSalamouny, S. Gambs, Differential privacy models for location-based services, Transactions on Data Privacy 9(1) (2016) 15–48. [9] B. Fung, K. Wang, R. Chen, P. S. Yu, Privacy-preserving data publishing: A survey of recent developments, ACM Computing Surveys 42(4) (2010) 14. [10] S. Gao, J. Ma, C. Sun, X. Li, Balancing trajectory privacy and data utility using a personalized anonymization model, Journal of Network and Computer Applications 38 (2014) 125–134. [11] F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, Trajectory pattern mining, in: ACM KDD 2007, 330–339, 2007. [12] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: ACM SIGMOD 2000, 1–12, 2000.

28

751 752 753 754

755 756

757 758 759

760 761

762 763

764 765

766 767 768

769 770 771

772 773

774 775

776 777

778 779 780 781 782

783 784 785

786 787 788

[13] S. Han, H. J. Bae, J. Kim, S. Shin, S.-E. Choi, S. H. Lee, S. Kwon, W. Park, Lithographically encoded polymer microtaggant using high-capacity and error-correctable QR code for anti-counterfeiting of drugs, Advanced Materials 24(44) (2012) 5924–5929. [14] J. Hua, Y. Gao, S. Zhong, Differentially private publication of general timeserial trajectory data, in: IEEE INFOCOM 2015, 549–557, 2015. [15] Z. Huo, X. Meng, H. Hu, Y. Huang, You can walk alone: trajectory privacypreserving through significant stays protection, in: DASFAA 2012, 351– 366, 2012. [16] Z. Jorgensen, T. Yu, G. Cormode, Conservative or liberal? personalized differential privacy, in: IEEE ICDE 2015, 1023–1034, 2015. [17] Y. Kim, J. Han, C. Yuan, Toptrac: Topical trajectory pattern mining, in: ACM KDD 2015, 587–596, 2015. [18] B. Lee, J. Oh, H. Yu, J. Kim, Protecting location privacy using location semantics, in: ACM KDD 2011, 1289–1297, 2011. [19] J.-G. Lee, J. Han, X. Li, H. Cheng, Mining discriminative patterns for classifying trajectories on road networks, IEEE TKDE 23(5) (2011) 713– 726. [20] M. Li, L. Zhu, Z. Zhang, R. Xu, Achieving differential privacy of trajectory data publishing in participatory sensing, Information Sciences 400 (2017) 1–13. [21] N. Li, T. Li, S. Venkatasubramanian, t-closeness: Privacy beyond kanonymity and l-diversity, in: IEEE ICDE 2007, 106–115, 2007. [22] T. Li, N. Li, On the tradeoff between privacy and utility in data publishing, in: ACM KDD 2009, 517–526, 2009. [23] A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, ldiversity: Privacy beyond k-anonymity, in: IEEE ICDE 2006, art. 24, 2006. [24] Parliament and of the European Union, Regulation on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (Data Protection Directive), Official Journal of the European Union, L119 (2011) 1–88. [25] T. Peng, Q. Liu, D. Meng, G. Wang, Collaborative trajectory privacy preserving scheme in location-based services, Information Sciences 387 (2017) 165–179. [26] D. J. Peuquet, Z. Ci-Xiang, An algorithm to determine the directional relationship between arbitrarily-shaped polygons in the plane, Pattern Recognition 20(1) (1987) 65–74. 29

789 790

791 792 793

794 795 796

797 798

799 800 801

802 803 804

805 806

807 808

809 810 811

812 813 814

815 816

817 818

819 820

821 822 823

824 825

[27] M. Piorkowski, N. Sarafijanovic-Djukic, M. Grossglauser, CRAWDAD dataset epfl/mobility, doi:10.15783/C7J010, 2009. [28] G. Poulis, S. Skiadopoulos, G. Loukides, A. Gkoulalas-Divanis, Aprioribased algorithms for k m -anonymizing trajectory data, Transactions on Data Privacy 7(2) (2014) 165–194. [29] L. Sweeney, k-anonymity: a model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5) (2002) 557–570. [30] M. Terrovitis, N. Mamoulis, Privacy preservation in the publication of trajectories, in: MDM 2008, 65–72, 2008. [31] L. Wang, D. Yang, X. Han, T. Wang, D. Zhang, X. Ma, Location PrivacyPreserving Task Allocation for Mobile Crowdsensing with Differential GeoObfuscation, in: WWW 2017, 627–636, 2017. [32] N. Wang, X. Xiao, Y. Yang, Z. Zhang, Y. Gu, G. Yu, PrivSuper: A superset-first approach to frequent itemset mining under differential privacy, in: IEEE ICDE, 809–820, 2017. [33] X. Xiao, G. Wang, J. Gehrke, Differential privacy via wavelet transforms, IEEE TKDE 23(8) (2011) 1200–1214. [34] M. Xie, EDS: a segment-based distance measure for sub-trajectory similarity search, in: ACM SIGMOD 2014, 1609–1610, 2014. [35] Y. Xin, Z.-Q. Xie, J. Yang, The privacy preserving method for dynamic trajectory releasing based on adaptive clustering, Information Sciences 378 (2017) 131–143. [36] F. Xu, Z. Tu, Y. Li, P. Zhang, X. Fu, D. Jin, Trajectory recovery from ash: user privacy is not preserved in aggregated mobility data, in: WWW 2017, 1241–1250, 2017. [37] T.-H. You, W.-C. Peng, W.-C. Lee, Protecting moving trajectories with dummies, in: MDM 2007, 278–282, 2007. [38] J. Yuan, Y. Zheng, X. Xie, G. Sun, Driving with knowledge from the physical world, in: ACM KDD 2011, 316–324, 2011. [39] B. Zhou, J. Pei, Preserving privacy in social networks against neighborhood attacks, in: IEEE ICDE 2008, 506–515, 2008. [40] M. Ghasemzadeh, B.C.M. Fung, R. Chen, A. Awasthi, Anonymizing trajectory data for passenger flow analysis, Transportation Research Part C: Emerging Technologies, 39 (2014) 63–79. [41] B. Fung, M. Cao, B.C. Desai, H. Xu, Privacy protection for RFID data, in: ACM SAC 2009, 1528–1535, 2009. 30