key: cord-0052831-9gorpurr
authors: Yao, Lin; Chen, Zhenyu; Hu, Haibo; Wu, Guowei; Wu, Bin
title: Sensitive attribute privacy preservation of trajectory data publishing based on l-diversity
date: 2020-11-17
journal: Distrib Parallel Databases
DOI: 10.1007/s10619-020-07318-7
sha: 8a82ec925874437c5f28cbe38c499429c55a924b
doc_id: 52831
cord_uid: 9gorpurr

The widely application of positioning technology has made collecting the movement of people feasible for knowledge-based decision. Data in its original form often contain sensitive attributes and publishing such data will leak individuals’ privacy. Especially, a privacy threat occurs when an attacker can link a record to a specific individual based on some known partial information. Therefore, maintaining privacy in the published data is a critical problem. To prevent record linkage, attribute linkage, and similarity attacks based on the background knowledge of trajectory data, we propose a data privacy preservation with enhanced l-diversity. First, we determine those critical spatial-temporal sequences which are more likely to cause privacy leakage. Then, we perturb these sequences by adding or deleting some spatial-temporal points while ensuring the published data satisfy our ([Formula: see text] )-privacy, an enhanced privacy model from l-diversity. Our experiments on both synthetic and real-life datasets suggest that our proposed scheme can achieve better privacy while still ensuring high utility, compared with existing privacy preservation schemes on trajectory.

The popularity of smart mobile devices with positioning technologies triggers the collection of location information by suppliers, corporations, individuals etc. for knowledge-based decision making. Therefore, vast amounts of trajectory data are collected with other information. Data miners have also shown great interest in analyzing these data to provide plentiful serves for people. For example, recent studies [1, 2] have shown that tracking the environmental exposure of a person with his daily trajectories helps to improve diagnose. Therefore, wearable devices have been generating tremendous amounts of location-rich, real-time, and highfrequency sensing data with the physical symptoms for remote monitoring on patients of common chronic diseases including diabetes, asthma, depression [3] . However, the original data may contain sensitive information about individuals such as health status. Let's take Table 1 to illustrate it.

Table 1 [4] is an original table without omitting any attribute. In this table, there are four typical types of attributes: explicit identifier, quasi-identifiers, sensitive attribute, and non-sensitive attribute [5] . Explicit Identifier (EI), such as the name, is used to identify an individual uniquely, which is always removed from the published table shown in Table 2 . On the other hand, single Quasi-Identifier (QI) cannot uniquely identify a specific individual, but a few QIs can be combined to achieve it. In this paper, our focused QI is Trajectory, which consists of a set of spatialtemporal trajectory points, each with a location and a time stamp. Sensitive Attribute (SA) contains private information of users, such as Disease in Table 1 . Non-sensitive attribute can be known by the public without any privacy concern.

If the attacker has limited background knowledge of a certain trajectory sequence, the following three attacks are mostly considered in current approaches, record linkage attack, attribute linkage attack and similarity attack [4, 6] : Table 1 Original table  ID. Name  Trajectory  Disease ⋅ ⋅ ⋅   1 Alice

Freeman c5 → f 6 → e9 SARS ⋅ ⋅ ⋅ 7

Georgia

Ishtar e4 → f 6 → e8 Fever ⋅ ⋅ ⋅ 

SARS e4 → f 6 → e8 Fever -Record linkage attack. An adversary could identify the unique record of the victim from the published table according to a certain trajectory sequence whose length is no more than m. For example, when an adversary gets the background knowledge of Alice's trajectory sequence d2 → e4 , the adversary can infer that the 1st record belongs to Alice in Table 2 . As a result, Alice's record in Table 2 is leaked. -Attribute linkage attack. An adversary could infer the sensitive attribute of the victim from the published table according to a certain trajectory sequence whose length is no more than m. or example, when an adversary gets the background knowledge of Bob's trajectory sequence c5 → c7 , the adversary can infer that Bob's record is either the 2nd or the 5th in Table 2 . Because the two records have the same disease Flu, the adversary can infer that Bob has the Flu. -Similarity attack. An adversary could infer the sensitive attribute category of the victim from the published table according to a certain trajectory sequence whose length is no more than m and a sematic dictionary which contains the sematic relevance among sensitive attributes. For example, when an adversary gets the background knowledge of Tom's trajectory sequence c7, the adversary can infer that Tom may suffer Flu, Fever, or SARS in Table 2 . Based on the sematic dictionary that Flu and SARAS both belong to lung infections, the adversary can learn that the probability of Tom's lung infection is 4 5 .

These attacks generally cause identity disclosure, attribute disclosure, and similarity disclosure [4] . Identity disclosure refers to re-identifying a target user from some background knowledge. Attribute disclosure occurs when some QI values can link to a specific SA value with a high probability. Similarity disclosure happens when some similar QI values can link to a type of SA values with a high probability. To prevent the above three kinds of disclosure cause by the background knowledge attack, where an adversary has some prior knowledge (or auxiliary information) about the target of his attack, some anonymization operations should be taken to modify the original table. The typical anonymization approaches in publishing trajectory data include generalization, suppression, and perturbation [4, 6] . Generalization and suppression aim to replace values of specific attributes with less specific values. For trajectory data, generalization and suppression may eliminate a certain number of moving points by replacing some spatial-temporal points with a broader category or wildcard "*". In perturbation, the data will be distorted by adding noise, swapping values, or generating synthetic data. Comparatively, perturbation can protect privacy by distorting the dataset while keeping some statistical properties [6] . Generalization and suppression causes significant loss of data utility. To protect user privacy while ensuring data utility, we propose an Enhanced l-diversity Data Privacy Preservation for publishing trajectory data (called EDPP). Compared with t-closeness, k-anonymity and l-diversity can resist identity disclosure [7] . Compared with k-anonymity, l-diversity can provide stronger privacy preservation by guaranteeing l different sensitive attributes in a group [8] . To resist attribute disclosure, and similarity disclosure, we propose our ( l, , )-privacy model, where l-diversity ensures that each trajectory sequence matches more than l types of SA values in the published table, -privacy ensures that the probability of determining each SA value is not greater than and -privacy guarantees that the probability that an attacker obtains similar SA values is not larger than . To summarize, this paper has the following contributions:

-We propose our ( l, , )-privacy model to resist the attacks based on background knowledge including the record linkage, attribute linkage and similarity attacks without changing any sensitive attribute. The three parameters, l, and , which are used to prevent identity closure, privacy closure and similarity closure respectively, can be set based on the requirements of data owners. -We design a novel perturbation approach by executing addition or subtraction operation on the chosen critical sequences based on which the attacker can infer some sensitive information of an individual. Compared with generalization and suppression, perturbation can keep the statistical property of the original trajectory data. -Privacy analysis prove that our EDPP scheme can meet l, and privacy requirements of our model. -We evaluate the performance through extensive simulations based on a realworld data set. Compared with PPTD [4] , KCL-Local [9] and DPTD [10] , our EDPP is superior in terms of data utility ratio and privacy.

The remainder of this paper is organized as follows. In Sect. 2, we discuss the related work. Privacy model is given in Sect. 3. In Sect. 4, we present the details of our approach. Privacy analysis is given in Sect. 5. Simulations on data utility are presented in Sect. 6. Finally, we conclude our work in Sect. 7.

Different from those studies which have investigated re-identification attack or semantic attack, re-identifying an individual or inferring semantic information of the victim's visited locations based on the published trajectory dataset, we aim to prevent the attacks based on background knowledge and protect the privacy of an individual's sensitive attribute such as disease linked by the frequently visited locations.

In this section, we only discuss works that are related to our approach.

Generalization replaces some QI values with a broader category such as a parent value in the taxonomy of an attribute. In [4] , sensitive attribute generalization and trajectory local suppression were combined to achieve a tailored personalized privacy model for trajectory data publication. In [11] , an effective generalization method was proposed to achieve k , -anonymity in spatiotemporal trajectory data.

Combining suppression and generalization, the dynamic trajectory releasing method based on adaptive clustering was designed to achieve k-anonymity in [12] . In [13] , a new approach that uses frequent path to construct k-anonymity was proposed. In the suppression method, a certain number of moving points are eliminated from trajectory data. In [14] , extreme-union and symmetric anonymization were proposed to build anonymous groups and avoid a moving object being identified through the correlation between anonymization groups. [9] was the first paper to adopt suppression to prevent record linkage and attribute linkage attacks. To thwart identity record linkage, passenger flow graph was first extracted from the raw trajectory data to satisfy the LK-privacy model [15] . In [16] , k m -anonymity was proposed to suppress the critical location points chosen from quasi-identifiers to protect against the record linkage attack. In [17] , location suppression and trajectory splitting were used to prevent privacy leaks and improve data utility of aggregate query and frequent sequences.

Perturbation aims to protect the privacy with limiting the upper bound of utility loss.

Recently, differential privacy has become a main form of data perturbation [18] . Differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. In [19] , differential privacy was first adopted to protect the privacy of trajectory data. Different from the traditional method that privacy was achieved by perturbation of the result [19] , sampling and interpolation were combined to achieve differential privacy [20] . Differentially private synthetic trajectory was first proposed in [21] . The original database was built as a prefix tree, where trajectories are grouped based on the length of the matching location subsequences. Then, spatial generalization was combined to protect the trajectory privacy at each tree layer. To solve the problem that frequent sequential patterns can be identified in [21] , differential privacy was applied in sequential data by extracting the essential information in the form of variable-length n-grams [21] . In [22] , a model-based prefix tree was also constructed and a candidate set of substring patterns were determined. Then, the frequency of the substring patterns was further refined to transform the original data. The problem of constructing a differentially private synopsis for two-dimensional dataset was tackled in [23] , where the uniform-grid approach as the partition granularity was applied to balance the noise error and the non-uniformity error. Based on the work [21] , a prediction suffix tree model of trajectory micro-data was proposed to automatically adapt the tree height to the data [24] and multiple prefix trees corresponding to different spatial resolutions were proposed to ensure strong privacy protection in the form of e-differential privacy [25] . Hua et al. proposed a generalization algorithm for differential privacy to merge nodes based on their distances [26] . To solve the problem of random and unbounded noises [26] , Li et al. proposed a novel differentially private algorithm with a bounded noise generation [10] . To solve the privacy of continuous publication in population statistics, a monitoring framework with w-event privacy guarantee was designed [27] including adaptive budget allocation, dynamic grouping and perturbation. In [28] , an n-body Laplace framework was proposed to prevent social relations inference through the correlation between trajectories. A methodical framework for publishing trajectory data with differential privacy guarantee as well as high utility preservation was designed by automatically splitting the privacy budget among the different trajectory sequences [29] .

As introduced before, perturbation can protect privacy by distorting the dataset while keeping some statistical properties compared with generalization and suppression, which causes less loss of data utility. We prefer perturbation technique to design our privacy preservation scheme. In our previous work, we have proposed a privacy model based on perturbation to resist attacks based on the critical trajectory sequences [30] . To the best of our knowledge, we are the first that proposes an perturbation approach to protect the sensitive attribute of the published trajectory data. However, our previous work has ignored the special case that adding points on the critical sequences may bring new critical sequences. Besides, the data owner can not set the privacy parameters flexibly based on his privacy requirement. To solve the above problems, we propose a privacy model called ( l, , )-privacy model to resist the record linkage, attribute linkage and similarity attacks without changing any sensitive attribute and further prevent identity closure, privacy closure and similarity closure.

In this paper, we focus on publishing trajectory data as in Table 1 while protecting the privacy of sensitive attribute such as Disease against attackers with background knowledge about the trajectory. In Table 1 , each record corresponds to one individual and contains an identifier as well as a set of geo-referenced and time-stamped elements or spatiotemporal points [18] . These spatiotemporal points constitute an individual's trajectory as one kind of quasi-identifier. Therefore, each trajectory is a sequence of geographical positions of each monitored individual over time in the form (ID, loc, t), where ID represents the owner' s unique identifier and loc represents the owner' s location and t represents a time stamp. The set of locations are arranged in the chronological order to form a trajectory L t which is defined as follows:

where n is the length of trajectory, t i is the time stamp and loc i represents the owner' s location at t i .

A trajectory sequence is a non-empty subset of a trajectory, and the length of the sequence is the number of spatiotemporal points contained in the sequence.

In this paper, we mainly consider record linkage attack, attribute linkage attack and similarity attack based on the background knowledge [4] . Generally, background knowledge is a part of the victim's information such as a sequence of spatiotemporal points in this paper. How different attackers can get the background knowledge is not considered in our scheme. We only need to consider the maximum background knowledge for all adversaries to design our preservation approach. The maximum background knowledge represents the maximum length of the trajectory sequence m in this work, which can ensure that all adversaries launch attacks within the range of m.

To resist record linkage attack, attribute linkage attack and similarity attack based on the trajectory sequence, we define our ( l, , )-privacy model in this paper. l-Diversity ensures that each trajectory sequence whose length is no more than m matches more than l types of SA values in the published table. -Privacy ensures that the probability of determining each SA value is not greater than .

-Privacy guarantees that the probability of obtaining similar SA values is not larger than . Given the original trajectory table T and three privacy parameters l, and , our goal is to anonymize T into T * that satisfies ( l, , )-privacy model if each record in T * simultaneously satisfies l-diversity, -sensitive-association and -similarity-association. First, we define Q = {q 1 , q 2 , … , q n } as the sequence set of an attacker's background knowledge. For each q i ∈ Q , we have For example, an adversary has known that Bob and Freeman possess the trajectory sequence f 6 → e9 . From Table 2 Definition 4 ( -similarity-association) All the records can be divided into k groups T = {g 1 , g 2 , … , g k } according to the SA value type, where g j represents the jth group. T * satisfies -similarity-association if the probability of inferring the right group g j of a record r satisfies Pr[r ∈ g j ] ≤ for 0 ≤ ≤ 1 with the background knowledge ∀q i ∈ Q.

For example, the records in Table 2 

Our main research goal is to protect the SA privacy while retaining the utility of published data. In this section, we first introduce our basic framework and then elaborate the details of EDPP. Major notations used in this section are listed in Table 3 .

Our EDPP scheme includes two processes: (1) determining the critical sequences for a given length of trajectory segment, and (2) performing the anonymization operation. A critical sequence is a part of trajectory which meets the predefined length but the matched SA values do not meet the ( l, , )-privacy model. The anonymization operation aims to make each SA value satisfy ( l, , )-privacy model by adding or deleting moving points in each sequence. EDPP includes the following procedures:

(1) Explicit Identifier (EI) is first removed from the original table to generate Table 1 . (2) To determine critical sequences, we find all possible sequences of length no more than m whose SA values do not satisfy ( l, , )-privacy model. (3) By adding or subtracting points in each sequence obtained from Step (2), we either make the corresponding SA values of this sequence satisfy l-diversity or eliminate this sequence. (4) By adding trajectory points in each sequence obtained from Step (2), we make the corresponding SA value of each sequence satisfy − sensitive − association # records whose SA value has the most records in T(q) max # records whose category has the most records in T(q) and -similarity-association. Similarly, we make all the sequences of length no more than m satisfy or by adding points.

As mentioned before, our ( l, , )-privacy model can guarantee the published data T * satisfies l, and privacy requirements to resist record linkage attack, attribute linkage attack and similarity attack. In this subsection, we aim to give the definitions of l, and requirements.

Based on any trajectory sequence q i ∈ Q , the inferred total number of distinct SA values |ASA(q i )| is larger than l.

We define c i s as the inferred total number of distinct SA values based on q i . We can get the probability of inferring the target individual's record r, Pr 

For each trajectory sequence q i ∈ Q , the probability of inferring the target individual's SA in a specific record, Pr[ASA(r)], is less than .

We define c i f as the maximum number of the same SA values and c i t as the number of inferred records based on q i . We can get that the probability of inferring the right SA value, Pr[ASA(r)], is less than the ratio between c i f and c i t ,

To satisfy − sensitive − association , each c i f should satisfy 

For each trajectory sequence q i ∈ Q , the probability of inferring the right group g j which the target individual's record r belongs to, Pr[r ∈ g j ] , is smaller than .

We define c i g as the maximum number of the same type of SA values inferred according to q i . We can get the probability of inferring the right group of r, Pr[r ∈ g j ] , must satisfy

To satisfy -similarity-association, each c i g should satisfy

In what follows, we give the detailed algorithm for each step in the above EDPP scheme.

Recall that m is the upper bound of the attacker's background knowledge on the trajectory sequence, our goal is to identify all the critical sequences of length m in T. Critical sequence is defined as follows:

where q i is a subsequence of q with ∀q i ⊂ q.

Based on the above definition, we can get the two assertions: Assertion 1 For an anonymized table T * , it satisfies l-diversity requirement if and only if it satisfies where CS(q) represents that q is a critical sequence.

Proof. Let T * satisfy CS(q) → |q| > m with ∀q ∈ T * and q be a sequence in T * with |q| ≤ m . Based on Definition 5, q is obviously not a critical sequence. Then,

we can get ASA(q) ≥ l according to Definition 5. In this case, T * satisfies l-diversity according to Definition 2.

Conversely, let q be a critical sequence in T * with |q| ≤ m . We can get T * does not satisfy the l-diversity requirement according to Definition 2.

Assertion 2 For a critical sequence q, it is no longer a critical sequence after eliminating a spatial-temporal point p with p ∈ q.

Proof. Let q be a critical sequence and p a spatial-temporal point in q. After eliminating p from the original sequence q, we can get a new sequence q i with q i ⊂ q . Obviously, we can have |ASA(q i )| ≥ l . Based on Definition 5, q i is not a critical sequence.

According to the two assertions, we can anonymize T into T * to satisfy l-diversity requirement by eliminating all critical sequences of length no more than m.

The following steps are used to determine the critical sequences:

Step 1: First, we obtain all the sequences of length no more than m from T.

Step 2: For each sequence q, if requirement or requirement is not satisfied, q is added into a list called QNAB.

Step 3: Then, we treat these sequences as vertices. If two sequences q 1 and q 2 satisfy ||q 1 | − |q 2 || = 1 ∧ (q 1 ⊂ q 2 ∨ q 2 ⊂ q 1 ) , we will add an edge between vertices v 1 and v 2 . By repeating this step, we can get a m-partite graph G, where the whole vertex set can be partitioned into m subsets according to the sequence length from 1 to m. At last, we can get a layered graph according to the sequence length, where the sequence length of each layer is the same, and the smaller length is in the upper layer. Figure 1 is an example of 3-partite graph.

Step 4: For each sequence q in G, q is deleted from the top layer to the end, if |ASA(q)| ≥ l holds.

Step 5:

Step 4 is repeated until each sequence q ∈ G in the top layer does not satisfy |ASA(q)| ≥ l . Figure 2 shows an example after a sequence is deleted from the top layer in Fig. 1 .

Step 6: For each sequence q ∈ G , q is added into a list called QNL if it is not in the top layer. Else, q is inserted into a list called QCQ. 

To achieve l-diversity better, we try to eliminate a common spatial-temporal point from sequences in QCQ. Therefore, we should make statistics of each point in all sequences of QCQ and determine which point should be deleted.

Step 1: We make statistics on the spatial-temporal points in all the sequences of QCQ and get a rank list of these points based on their occurrence frequency. Then, we eliminate the point p ranking the first from sequences including p in QCQ.

Step 2: Last step can ensure that newly generated sequences in QCQ are not critical ones and are removed from QCQ. In this step, we should delete p from the sequences including it in QNL, where the generated critical sequences are moved to QCQ and the non-critical sequences satisfying l requirement are removed from QNL. To achieve it, we rely on G to determine the newly generated critical sequences in QNL. First, we delete p in G. Then, we execute Step 5 of last section and make the newly generated top layer contain all critical sequences. Finally, we update QNL and QCQ according to the Step 6 of last section.

Step 3:

Step 1 and Step 2 are repeated until G is empty. After Step 1 to Step 3 are executed every round, the total number of sequences in G will decrease. Consequently, our algorithm is strictly convergent no matter what l is.

Step 4: If requirement or requirement is not satisfied, q will be added into QNAB.

Before publishing T * , we adopt addition operation to achieve requirement and requirement on those sequences who satisfy l-diversity. For a sequence q in QNAB, the steps of addition operation are as follows: First, we choose the records whose SA values do not belong to ASA(q) to execute addition. In order to insert a trajectory point at a time stamp, we must ensure that no point in the selected record is associated with the time already, as a person cannot appear in two different places at the same time. Otherwise, the record cannot be modified will not be chosen. Besides, adding a new point in a record may produce more than one new sequence with a limited length of m. Consequently, we must strictly choose the records that generate new critical sequences belonging to Q after addition operation.

Then, we sort the chosen records in descending order of Longest Common Subsequence (LCS). LCS is a sequence of points common to q and a chosen record. For example, the LCS of a sequence a1 → d2 → b3 and a record a1 → d2 → c5 → f 6 → c7 is a1 → d2.

Step 1: For each q, we first pick up some records to execute the addition operation. To satisfy requirement and requirement, a record satisfying the following two conditions will be chosen: (1) Its SA value is not associated with the one which has the maximum number of records, max , in T(q); and (2) It does not belong to the category which possesses the maximum number of records, max , in T(q). These two conditions ensure that the worst-case meets requirement and requirement. For example, a sequence f 6 → e8 has five corresponding records in Table 1 , the 1st, 3rd, 4th, 7th and 9th ones. The corresponding SA values are HIV, SARS, Fever, Fever and Fever. Fever possesses the maximum number of records. If we set to 50%, we should select another record, such as the 2nd one, to construct q to reduce the probability of inferring Fever. After adding e8 in the 2nd record, the probability is 50%. Similarly, we prefer the records not belonging to the category which possesses the maximum number of records.

Furthermore, all the chosen records will be sorted in a descending order of LCS between q and itself.

Step 2: For each q, we compute num p , the number of records which need the addition operation to satisfy requirement, and num g , the number of records to be added to satisfy requirement. We use max(num p , num g ) to represent the maximum of num p and num g .

According to the first max(num p , num g ) chosen records, we compute the metric PriGain to get a balance between privacy protection and utility loss. PriGain(q) is defined as follows:

where H s T * (q) and H s T (q) represent the entropy of SA values in T * (q) and T(q) respectively. H s (q) represents the entropy difference. H c T * (q) and H c T (q) represent the entropy of categories in T * (q) and T(q) respectively. H c (q) represents the difference in category entropy. k is the number of categories. is a weight constant representing the impact factor of H s (q) . Bigger H s (q) + (1 − ) H c (q) brings more privacy protection.

The utility loss W(q) after anonymization is defined as follows:

where num i represents the number of times that the i-th point needs to be added, and w i is the weight value of the i-th point. w i is defined as reciprocal of the number of the i-th point in all the critical sequences of QNAB. If one point occurs more frequently, it means the point is required by more sequences to add to meet their privacy requirements. So, its addition may benefit more sequences, and fewer overall points need to be added to make the table meet the privacy requirement. As an example, we have the sequences a1 → b3 , a1 → c5 and a1 → e4 . To process the 1st sequence, a1 may be added into several records. This may make some records contain a1 → c5 or a1 → e4 , which avoids modifying more records specific for the two sequences. Thus, adding a1 can bring more usability and cause lower utility loss. Finally, q is put into a list in which the elements are sorted in descending order of PriGain.

Step 3: In this step, we aim to add points in the above selected records to achieve requirement and requirement. We choose a sequence from the list generated in Step 1 to add points to form q until max(num p , num g ) records have been processed. During this process, we will not add points into a record if the number of records which possess the same SA value is up to max or the number of records associated with a category is up to max . Then, q is moved from QNAB. If any revised record cannot be further modified to construct a new record for next sequence(s), it will be deleted from the candidate record list of the corresponding sequences, and a new candidate needs to be selected as done in Step 1. For example, e5 has been added into one record for the 1st sequence. This record cannot be used by another sequence if a different location needs to be attached, with the time stamp 5. The above process is repeated until none is left in the list.

Step 4: Eventually, we get the anonymous data T * satisfying ( l, , )-privacy model.

As discussed in the above sections, the data owners can set three parameters, l, , and , according to their privacy requirements. In this subsection, we aim to give some instructions on how to set the parameter reasonably.

l is used to resist record linkage attack. A bigger l represents a smaller probability of successfully launching the record linkage attack. parameter is used to resist attribute linkage attack. The less , the smaller the probability of successfully launching the attribute linkage attack. parameter is used to resist similarity attack. The less , the smaller the probability of successfully launching the similarity attack.

The more privacy, the less data utility. When the data owner pays more attention on privacy than data utility, he can set a bigger l with smaller and . On the contrary, if data utility is more concerned, a smaller l with bigger and is optimal. As a result, different owners can set l, , and within their tolerance.

In this section, we prove that our EDPP can both satisfy three privacy requirements of our ( l, , )-privacy model and resist the corresponding attack. These three parameters can be set based on the data owner's privacy requirement.

We divide sequences of length no more than m into two types in the original table T. One type of sequences without satisfying l requirement are put into QNL to execute the subtraction operation and critical sequences of length no more than m should be eliminated. The second type of sequences can satisfy l requirement. After our anonymization approach, there is no critical sequence of length more than m in T * . According to Assertion 1, T * can satisfy l-diversity.

For record linkage attack, the attacker aims to infer the accurate record of the target individual(e.g., Alice) based on the trajectory sequence q i with |q i | ≤ m . l-diversity guarantees that at least l different records include q i (i.e. |ASA(q i ) ≥ l| ). Then, the probability of inferring Alice's record is less than 1 l , i.e. the probability of identity closure is less than 1 l . As a conclusion, our EDPP scheme can satisfy l privacy requirement and resist record linkage attack.

To satisfy -sensitive-association and -similarity-association, we perform addition for num p and num g records including q of length no more than m based on Definitions 2 and 3.

To simplify our algorithm, the max(num p , num g ) records are selected to construct q. Because max and max are constant, the following equations will hold, and max |T(q)| + max(num p , num g ) ≤ max |T(q)| + Num p ≤ where the equations can prove that all the sequences of length no more than m in T * can satisfy both − sensitive − association and -similarity-association. For attribute linkage attack, the attacker aims to infer the sensitive information of the target individual(e.g., Alice) based on the trajectory sequence q with |q i | ≤ m . The probability of inferring Alice's SA value of record r, Pr[ASA(r)], is no more than

implies that the probability of attribute disclosure is no more than . For similarity attack, the attacker aims to infer the accurate group of the target individual(e.g., Alice) based on the background knowledge of a trajectory sequence q i with |q i | ≤ m . The probability of inferring the right group g j of Alice's record r, Pr[r ∈ g j ] , is no more than . Based on requirement, we have ) ≤ , which implies that the risk that the probability of similarity disclosure is no more than .

Setup: We implement our EDPP algorithm in Java. We conduct all experiments on a Mac PC with an Intel Core i5 2.3GHz CPU and 8 GB RAM.

Dataset: To evaluate the performance of our EDPP, we use a real-world dataset that joins the Foursquare dataset and MIMIC-III dataset. Foursquare dataset [31] is a real-world trajectory dataset containing the routes of 140,000 users in a certain area with 92 venues every one hour, forming 2,208 dimensions. MIMIC-III [32] is a freely accessible critical care database. The SA is Disease which contains 36 possible values and 9 of them are considered as sensitive values. The SA values are divided into 6 categories, one of which is private. Similarly, we match the diseases in MIMIC-III with the trajectory in Foursquare in a uniform distribution [4] . We compare our EDPP with PPTD [4] , KCL-Local [9] and DPTD [10] .

KCL-Local adopts local suppression to achieve the privacy of sensitive information by anonymizing the trajectory data. (k, C) m -privacy model is proposed to adopt k-anonymity to prevent record linkage attack, where C is the confidence threshold to resist attribute linkage attack and the probability of each SA value is not greater than C. In PPTD, the sensitive attribute generalization and trajectory local suppression are combined to achieve a tailored personalized privacy model for the publication of trajectory data. In DPTD, a novel differentially private trajectory data publishing algorithm is proposed with bounded Laplace noise generation, and trajectory points are merged based on trajectory distances. max |T(q)| + max(num p , num g ) ≤ max |T(q)| + Num g ≤ ,

The aim of EDPP is to implement the privacy of published data while preserving the data utility. We use information loss to evaluate the utility. In this section, the following metrics are used to evaluate it: Table 4 shows that the trajectory information loss and frequent sequences loss increase slowly with l, because the substraction or addition operation aims to minimize the number of changed points in order to satisfy l-diversity, which makes the information loss not increase much. In addition, both types of loss increase with m. However, when the number of records change from 50K, 100K to 150K, both types of loss stay relatively stable.

varies from 0.1 to 0.5 for different combinations of l, , and m. Table 5 shows that the information loss increases with the decrease of , because more sequences do not satisfy -sensitive-association. As discussed before, we select records based on LCS and add points based on PriGain, which can reduce the number of points to be added. As such, the information loss increases slowly. In addition, Table 5 shows the information loss increases with m, while both types of loss have relatively stable values as the number of records change from 50K, 100K to 150K. Table 4 Effect of l 

nder different number of records, for selected parameters l, , and m, we vary from 0.1 to 0.5. Similar to the effect of , Table 6 shows the information loss increases slowly with the decrease of and increase of m.

K ′ varies from 50 to 130 with a set of random parameters l = 3 , = 0.4 , and = 0.5 . Figure 3 shows the frequent sequences loss decreases with the increase of K ′ , because the number of frequent sequences not satisfying (l, , ) begins to drop with the increase of K ′ .

We use the disclosure risk as a metric to measure the probability of privacy breach for each sequence q:

|T(q)| , and max |T(q)| represent the probability of identity disclosure, that of attribute disclosure, and that of similarity disclosure, respectively.

We randomly select 50K sub-trajectories of length no more than m from the anonymous database, and calculate the probability of privacy disclosure for these sequences. Figure 4 shows that the average disclosure probability decreases with the increase of l and decrease of or , because the privacy requirements become higher. Moreover, the average disclosure probability increases with m.

We also compare our EDPP with KCL-Local, PPTD and DPTD on trajectory information loss, frequent sequences loss and run time. Since these schemes adopt different privacy models, we cannot directly compare them. To have a fair comparison, we modify our algorithm EDPP to implement (k, C) m -privacy model as used in KCL-Local, called EDPP-KC. used in the differential privacy method DPTD is assigned as follows to keep the disclosure risk at the same level as that of other three schemes: where P dis (k, C) represents the disclosure probability under different k and C, and P dis ( ) represents the disclosure probability under different which is determined according to the disclosure risk level. 1 |ASA(q)| and max |T(q)| represent the probability of identity disclosure and attribute disclosure respectively.

k varies from 5 to 25 with C = 0.5 , m = 3 and K � = 50 under 140K records. Figure 5 shows both kinds of loss increases with k because more sequences not satisfying k-anonymity causes the higher information loss. Our EDPP-KC has the best performance because we aim to minimize the number of the changed points. KCL-Local has the worst performance loss because too much moving points are eliminated from the trajectory data in the global suppression. DPTD generates Laplace noise to achieve differential privacy. As decreases in Fig. 5 , DPTD can get better privacy. However, the larger noise causes more trajectory information and frequent sequences loss than PPTD. PPTD only handles the sensitive records which may cause the privacy disclosure, thus PPTD has a lower information loss than DPTD.

C varies from 0.1 to 0.5 with k = 5 , m = 3 and K � = 50 under 140K records. In Fig. 6 , both types of information loss decreases with the increase of C because fewer sequences do not satisfy the confidence threshold C, making the loss lower. Similar to the above discussion, EDPP-KC has the best performance. KCL-Local possesses the worst performance. As decreases, trajectory information loss and frequent sequences loss of DPTD become greater, which is slight better than KCL-Local. Compared with KCL-Local, PPTD, and DPTD the trajectory information loss of EDPP can be improved by up to 76.90% , 48.17% and 72.86% respectively and the frequent sequences loss can be improved by up to 71.03% , 28.99% and 69.32% respectively. the good performance on run time because only suppression is adopted. In PPTD, the sensitive attribute generalization and trajectory local suppression are combined to achieve the privacy, which causes the most run time. In EDPP-KC, it takes much time to determine the critical sequences.

We design and implement an anonymous technique named EDPP to protect the sensitive attribute during the publication of trajectory data. To resist record linkage, attribute linkage and similarity attack based on the background knowledge of critical sequences, we adopt perturbation to process these sequences by adding or deleting some moving points so that the published data satisfy our ( l, , )-privacy model. Our performance studies based on a comprehensive set of real-world data demonstrate that EDPP can provide higher data utility compared to peer schemes. Our privacy analysis shows that EDPP can provide better privacy for the sensitive attribute.

In the future work, we will optimize our algorithm to handle extremely large trajectory dataset with the aid of indexing and pruning.

RFID technology for IoT-based personal healthcare in smart spaces

The trajectory of recovery and the inter-relationships of symptoms, activity and participation in the first year following total hip and knee replacement

Hygeia: a practical and tailored data collection platform for mobile health

Pptd: preserving personalized privacy in trajectory data publishing by sensitive attribute generalization and trajectory local suppression

Privacy models for big data: a survey

Privacy-preserving data publishing: a survey of recent developments

t-closeness: privacy beyond k-anonymity and l-diversity

L -diversity: privacy beyond k-anonymity

Privacy-preserving trajectory data publishing by local suppression

Achieving differential privacy of trajectory data publishing in participatory sensing

Preserving mobile subscriber privacy in open datasets of spatiotemporal trajectories

The privacy preserving method for dynamic trajectory releasing based on adaptive clustering

Novel privacy-preserving algorithm based on frequent path for trajectory data publishing

Anonymizing moving objects: how to hide a mob in a crowd?

Anonymizing trajectory data for passenger flow analysis

A distributed approach for privacy preservation in the publication of trajectory data

Local suppression and splitting techniques for privacy preserving publication of trajectories

Privacy in trajectory micro-data publishing: a survey and security

Differentially private trajectory data publication

Publishing trajectory with differential privacy: a priori vs. a posteriori sampling mechanisms

Differentially private transit data publication: a case study on the Montreal transportation system

A two-phase algorithm for mining sequential patterns with differential privacy

Differentially private grids for geospatial data

PrivTree: a differentially private algorithm for hierarchical decompositions

Dpt: differentially private trajectory synthesis using hierarchical reference systems

Differentially private publication of general time-serial trajectory data

Real-time and spatio-temporal crowdsourced social network data publishing with differential privacy

Releasing correlated trajectories: towards high utility and optimal differential privacy

Differentially private and utility preserving publication of trajectory data

Publishing sensitive trajectory data under enhanced l-diversity model

Friendship and mobility: user movement in location-based social networks

Mimic-iii, a freely accessible critical care database

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations