A Solution to Privacy Preservation in Publishing Human Trajectories

Li, Xianming;Sun, Guangzhong;

doi:10.3837/tiis.2020.08.010

KSII Transactions on Internet and Information Systems (TIIS)

제14권8호
/
Pages.3328-3349
/
2020
/
1976-7277(pISSN)
/
1976-7277(eISSN)

한국인터넷정보학회 (Korean Society for Internet Information)

DOI QR Code

A Solution to Privacy Preservation in Publishing Human Trajectories

Li, Xianming (School of Computer Science and Technology, University of Science and Technology of China) ;
Sun, Guangzhong (School of Computer Science and Technology, University of Science and Technology of China)

투고 : 2017.07.09
심사 : 2019.08.15
발행 : 2020.08.31

https://doi.org/10.3837/tiis.2020.08.010 인용 PDF KSCI HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

With rapid development of ubiquitous computing and location-based services (LBSs), human trajectory data and associated activities are increasingly easily recorded. Inappropriately publishing trajectory data may leak users' privacy. Therefore, we study publishing trajectory data while preserving privacy, denoted privacy-preserving activity trajectories publishing (PPATP). We propose S-PPATP to solve this problem. S-PPATP comprises three steps: modeling, algorithm design and algorithm adjustment. During modeling, two user models describe users' behaviors: one based on a Markov chain and the other based on the hidden Markov model. We assume a potential adversary who intends to infer users' privacy, defined as a set of sensitive information. An adversary model is then proposed to define the adversary's background knowledge and inference method. Additionally, privacy requirements and a data quality metric are defined for assessment. During algorithm design, we propose two publishing algorithms corresponding to the user models and prove that both algorithms satisfy the privacy requirement. Then, we perform a comparative analysis on utility, efficiency and speedup techniques. Finally, we evaluate our algorithms through experiments on several datasets. The experiment results verify that our proposed algorithms preserve users' privay. We also test utility and discuss the privacy-utility tradeoff that real-world data publishers may face.

키워드

1. Introduction

The last few years have seen the rapid development of smartphones and emerging location-based services (LBSs). LBS providers can effectively obtain user locations by GPS sensors as well as users’ activities by various sensors [1][2] and specific applications (microblogs, tweets, etc.). LBS providers collect a large quantity of data for further use, such as point of interest (POI) recommendations and advertisements. Additionally, people are paying increasing attention to the big data collected by traditional smart card systems, which are most widely used in campuses and transportation systems. The locations and implicit activity information in the data are helpful for various data mining tasks, such as analysis of lifestyles and some personalized services.

In the real world, human trajectory data and activity information collected by an organization needs to be published to other organizations for various reasons, such as scientific research and administrative regulations [3]. Since raw data usually contain individual sensitive information, publishing such data may result in privacy leakage, which has a negative effect on both data publishers and users. Ensuring that the published data remain useful in practice while protecting individual privacy is quite challenging and thus attracts increasing attention [4].

Trajectory data are quite different from relational data, as studied in privacy-preserving data publishing (PPDP) [5]. The dependence between consecutive records may be exploited by potential adversaries to infer users’ privacy. Furthermore, the attached activities may be considered sensitive by users and they should also be considered in privacy preservation mechanisms [6]. Many studies have been conducted to study privacy-preserving trajectory data publishing [3][7], but these approaches do not consider activity information or ignore the dependence between data and therefore cannot be applied for publishing activity trajectories.

Therefore, we study the problem of privacy-preserving activity trajectories publishing (PPATP) and propose S-PPATP, a solution to PPATP. S-PPATP consists of three steps. First, we formulate the problem by making necessary assumptions and defining appropriate parameters. Second, we devise privacy checking algorithms to guarantee that the published data satisfy the privacy requirement and optimize the data quality. Finally, we adjust the algorithms to meet practical requirements. In summary, we make the following contributions:

● Study PPATP. The difference between PPATP and the previous data publishing problems lies in the fact that the data contain users’ activity information. To solve this problem, we propose a three-step solution, S-PPATP, which involves modeling, algorithm design and algorithm adjustment.

● Formulate PPATP from the aspects of privacy requirements, user and adversary behavior modeling and data quality metrics. In user behavior modeling, we propose an extended topic model for parameter learning in the hidden Markov model (Section 3).

● Propose two data publishing algorithms (PAs), PA-Markov and PA-HMM, for different user models. We prove that the algorithms both satisfy privacy requirements and optimize utility to some extent. We show that both algorithms use polynomial running time. Then, we propose several techniques to speed up the algorithms for better practical use (Section 4).

● Evaluate PA-Markov and PA-HMM on simulated and real-world datasets. The results show that both algorithms preserve privacy. We also test utility and discuss the privacy-utility tradeoff that data publishers may face in real-world scenarios (Section 5).

2. Problem Statement

Trajectories consist of a sequence of geospatial points with timestamps. However, people would like to conduct activities when staying at a place of interest. Here, we follow the semantics of activity defined in [8].

Definition 1 (Activity): An activity 𝛼𝛼 &a;aymp;aymp;aymp;iin; 𝔸𝔸 represents a type of human action that an individual can take, such as working and eating. 𝔸𝔸 is a finite set that contains all the activities that can be performed by the users.

Activity information is similar to location semantic information, which is studied in [6], since location semantic information indicates what a person does in a place. However, an activity is not limited to a location semantic. For example, a student may post a tweet at a restaurant in addition to eating. In other words, here, activity is a more general concept than location semantic information. Location, timestamp and corresponding activity together make up an event about a given user

Definition 2 (Event): Event 𝑒𝑒 is a triple that includes activity information as well as when and where the user performs this activity, i.e., 𝑒 = ⟨𝛼,𝑡, 𝑙⟩, where 𝛼 ∈ 𝔸,𝑡∈ 𝕋, 𝑙 ∈ 𝕃, 𝑒 ∈ 𝔼 = 𝔸 × 𝕋 × 𝕃, 𝔸,𝕋, 𝕃 are predefined vocabulary of activity, time and location.

The granularity of the timestamp should be specified (e.g. hourly or every half day), and its legal values make up a finite set. Location and activity should be also finite variables. An event acts as a record in the dataset which may contain many records for a given user

Definition 3 (Activity Trajectory): An activity trajectory 𝛤𝛤 is a chronological sequence of events on a user, i.e., 𝛤𝛤 = {𝑒₁, 𝑒₂, ⋯ , 𝑒_𝑛}, where it is subject to 𝑒₁.𝑡 ≤ 𝑒₂.𝑡 ≤ ⋯ ≤ 𝑒_𝑛.𝑡

Tables 1 and 2 are examples of dataset and activity trajectories.

Table 1. A sample of datasets

E1KOBZ_2020_v14n8_3328_t0001.png 이미지

Table 2. Activity trajectories of user 001 and 002

E1KOBZ_2020_v14n8_3328_t0002.png 이미지

In the real world, activity trajectories collected by an organization need to be published to other organizations for various reasons. For example, the transit data collected by smart card automated fare collection systems in transportation systems may be shared due to administrative regulations or profit sharing [3]. A typical scenario for data publishing is described in Fig. 1. The data publisher collects data from the data generators or users and releases the collected data to a data miner or the public (e.g., on the Internet) who is called the data recipient. The data recipient will then use the published data for scientific research or other purposes. For example, an LBS provider collects its application users’ check-ins and releases them to a scientific research organization to improve the recommendation algorithm in their application. In this case, the LBS provider is the data publisher, the application users are the data generators, and the public is the data recipient.

E1KOBZ_2020_v14n8_3328_f0001.png 이미지

Fig. 1. A typical scenario of data publishing

Activity trajectory data publishing benefits scientific research and helps to improve user experiences, but the individual sensitive information in raw data raises serious privacy concerns. For example, many people consider hospital as a sensitive location since it implies that he/she is ill. Additionally, sensitive information usually varies from person to person. For example, a hospital may be sensitive for a patient but may not be as sensitive to a doctor. Therefore, it is necessary to answer the following question: To preserve user privacy, what data can be published, and what data should be suppressed?

Fig. 2 illustrates the problem we want to solve and indicates the relationship between different tasks and roles. Overall, there are three tasks in PPATP. The first task helps to formulate the problem. Specifically, the following aspects should be covered.

E1KOBZ_2020_v14n8_3328_f0002.png 이미지

Fig. 2. Three tasks in PPATP

● What is privacy, and to what extent should it be preserved (privacy requirement)?

● How powerful is the adversary (adversary model)? The adversary’s background knowledge and reasoning methods have a substantial influence on the privacy preservation mechanism and thus should be well defined.

● Can user behaviors be modeled (user behavior modeling)? Everyone has his/her own living habits. If the adversary is a person who has knowledge of the user’s habits (e.g., a friend or relative), he/she may utilize the behavior patterns to infer privacy.

● How should the published data be evaluated (data quality metric)? Activity trajectories are published for certain usage. To preserve user privacy, the published data will definitely be less useful than the raw data. It is important to consider to what extent the usefulness is affected, and therefore, a data quality metric is necessary.

The second task is designing a publishing algorithm based on the above modeling. The publishing algorithm should not only satisfy the privacy requirement but also ensure the data value. The last task is adjusting the publishing algorithm to meet practical needs, including how the publishing algorithm is implemented and how the parameters are selected.

We thus propose a solution to PPATP, S-PPATP, which is shown in Fig. 3. This solution consists of three steps corresponding to the three tasks in PPATP. First, we build models for the users and the adversary, define the privacy requirement and propose a utility function to measure the published data quality. Second, we propose two data publishing algorithms that satisfy the privacy requirement and optimize the utility function. Last, we conduct extensive experiments to elucidate how the publishing algorithms should be adjusted. We discuss the three steps in S-PPATP in detail in the following sections.

E1KOBZ_2020_v14n8_3328_f0003.png 이미지

Fig. 3. Framework of S-PPATP

3. Modeling

3.1 Privacy requirement

Given user 𝑢 with user model 𝑀 and event space 𝔼, the user is required to declare the information he/she does not want to publish. Sensitive information could be a location, a timestamp or an activity. For example, it is not good for an employee to let his/her boss know that he/she is on vacation because it implies that the employee is absent from work. In this case, the sensitive information to the employee is activity information (traveling) . Formally, a sensitive set 𝑆 ⊂ 𝔸 ∪ 𝕋 ∪ 𝕃 is identified for each user.

Informally, if the adversary gets no more knowledge about the sensitive information of 𝑢𝑢 after accessing the published activity trajectory, we say it preserves the privacy of user 𝑢. Here, we apply the probabilistic attack model [5], which aims to achieve the uninformative principle [9]. Under this principle, the posterior probability of each type of sensitive information at every sampling point is not much larger than the prior probability. Here, we apply 𝛿-privacy [7] and extend its semantics by also regarding activity and timestamp as possible sensitive information.

Definition 4 (𝛿-privacy): For an publishing algorithm 𝒜, it preserves 𝛿-privacy if for all possible input activity trajectories generated from user model 𝑀, for all possible output 𝑂 and all types of sensitive information 𝑠 ∈ S

\(P\left[e_{i} \cdot x=s \mid O\right]-P\left[e_{i} \cdot x=s\right] \leq \delta\) (1)

where \(1 \leq i \leq n, x \in\{a, t, l\}\) is the type of 𝑠.

Publishing lgorithm 𝒜 checks each event of the given user and determine whether to publish or suppress it. If the decision is “publish”, 𝒜 produces the event. If the decision is “suppress”, 𝒜 produces “NIL”. In real-world applications, if the decision of an event is “NIL”, the data should be replaced by “Unknown” or other default values.

We say the user’s privacy is breached at position 𝑖𝑖 of some output 𝑂 if \(\exists s \in S, P\left[e_{i} \cdot x=\right.s \mid O]-P\left[e_{i} \cdot x=s\right]>\delta\), where 𝑥 is an element of event. To measure the extent to which the privacy is breached in a dataset, we define breach rate. Assume that dataset 𝐷 consists of data from 𝑈 users. The breach rate refers to the ratio of events where the privacy is breached, i.e.,

\(\text { Breach rate }(D)=\frac{1}{U} \sum_{u} \frac{\left|\left\{i \mid P\left[e_{i} \cdot x=s \mid D\right]-P\left[e_{i} \cdot x=s\right]>\delta\right\}\right|}{n}\) (2)

In the following, the breach rate of algorithm 𝒜 refers to the breach rate of 𝒜’s output dataset.

3.2 Data quality

Intuitively, if we always suppress the whole activity trajectory, the privacy would never be breached, which does not make sense since nothing is published. The more truthful data is published, the more useful the dataset is. We measure the quality of a dataset 𝐷 by defining the utility as the expected fraction of truthful events in the output activity trajectory.

\(\text { Utility }(\mathrm{D})=\mathrm{E} \text { (Fraction of truthful events) }\) (3)

3.3 User model

Despite different lifestyles, a deep-rooted regularity is hidden behind human daily behaviors [10]. Since it has been proven that the Markov chain is useful in modeling human behavior patterns [11], we utilize two optional models to describe user behaviors: a Markov-based model and an HMM-based model.

3.3.1 Markov-based model

We use a Markov model to characterize individual behavior patterns. Specifically, we regard an activity trajectory 𝛤 as three independent sequences: activity sequence (A), time sequence (T) and location sequence (L). The sequences are generated from Markov chains 𝑀_𝑎, 𝑀_𝑡, and 𝑀_𝑙, respectively. According to the property of Markov chain, the current state only depends on the the previous state, i.e.,

\(P\left[x_{-} i \mid x_{-} 1, x_{-} 2, \cdots, x_{-}(i-1)\right]=P\left[x_{-} i \mid x_{-}(i-1)\right], 2 \leq i \leq n, x \in\{a, t, l\}\) (4)

3.3.2 HMM-based model

Diverse human behaviors sometimes share something in common. For example, some people go to a restaurant for breakfast, while others may go to the bakery or make other choices. Nevertheless, all of these actions imply the same topic: having breakfast. Such hidden topic usually cannot be explicitly observed from the data. The HMM is suitable for this situation where a latent variable exists. A typical HMM model is defined by initial state distribution (𝜋), transition probabilities (𝑇r) and emission probabilities (𝐵). 𝜋 and 𝑇r determine the state sequence, which indicates individual behavior patterns and cannot be observed. 𝐵 determines the observation sequence that corresponds to a user’s activity trajectory.

In the HMM, the space of hidden states is usually much smaller than the space of observation variables. Therefore, dimension reduction must be leveraged for modeling. Here, we resort to topic modeling [12], which extracts a set of common “topics” from activity trajectories. Drawing an analogy, activity trajectories can be regarded as documents, and events can be regarded as words. Thus, topics can be explained as users’ behavior patterns. For example, eating and entertainment after work could be two topics. In the former topic, words such as {8:00, having a bill, McDonald’s} may appear. In the latter topic, words such as {18:40, watching TV, home} may appear. In other words, topic models reduce the complexity of activity trajectories by providing an interpretable low-dimensional representation.

A typical implementation of topic model is latent Dirichlet allocation (LDA). However, LDA does not consider word dependence . Although topic constraint of the same sentence can be added to LDA [13], the topics of different sentences are still independent so the method does not fit our situation. Therefore, we extend LDA to incorporate topic transition behind consecutive words to capture the temporal dependence between consecutive events which is common in daily lives. For instance, people usually start working after breakfast and watch TV after dinner. These correlations reflect people’s regularity in daily life.

The samples can be generated by the following process:

1. For each document 𝑢 ∈ {1,2, ⋯ ,𝑈} and topic 𝑘 ∈ {1,2, ⋯ ,𝐾}, draw 𝜃_𝑢,𝑘 ∼ 𝐷ir(𝛼_𝑘).

2. For each topic 𝑘 ∈ {1,2, ⋯ ,𝐾}, draw 𝜙_𝑘,𝑡 ∼ 𝐷ir(𝛽_𝑡) , 𝜙_𝑘,𝑎 ∼ 𝐷ir(𝛽_𝑎) , and 𝜙_𝑘,𝑙 ∼ 𝐷ir(𝛽_𝑙).

3. For each word 𝑤_𝑖 ∈ 𝑢, draw topic 𝑧_𝑢,𝑖 ∼ \(\operatorname{Mult}\left(\theta_{u, z_{u, i-1}}\right)\), time 𝑡_𝑢,𝑖 ∼ \(\operatorname{Mult}\left(\phi_{z_{u, i} t}\right)\), activity 𝑎_𝑢,𝑖 ∼ \(\operatorname{Mult}\left(\phi_{z_{u_{i,}}, a}\right)\) and location 𝑙_𝑢,𝑖 ∼ \(\operatorname{Mult}\left(\phi_{z_{u, i}, l}\right)\).

Mult and Dir represent multinomial distributions and Dirichlet distributions, respectively. 𝜃 is a distribution over topics for a document. 𝜙_𝑘 is a discrete distribution over time, activity or location of topic 𝑘. 𝛼 and 𝛽 are hyperparameters for Dirichlet distributions.

For parameter learning, it is intractable to optimize the log-likelihood of observed samples, as analyzed in [12]. There are many approximation algorithms to solve this problem, such as variational expectation-maximization (EM) and Gibbs sampling. Due to the easier derivation of Gibbs sampling, we also use Gibbs sampling for approximation inference. Gibbs sampling is one method of Markov chain Monte Carlo, which approximates the posterior distribution over topic sequence \(P\left(\mathbf{z} \mid a_{1: n}, t_{1: n}, l_{1: n}\right)\), with an activity trajectory of length 𝑛𝑛 . Gibbs sampling samples the topic 𝑧_𝑖 of a word 𝑖, i.e., the pair ⟨𝑎_𝑖,𝑡_𝑖, 𝑙_𝑖⟩ conditioned on all the other words given and iteratively samples the topic of every word until convergence. Particularly, the following posterior probability is used to sample the topic 𝑧_𝑖 of the word :

\(\begin{array}{l} P\left[z_{i} \mid \mathbf{z}^{-i}, a_{1: n}, t_{1: n}, l_{1: n}\right] \propto P\left[z_{i} \mid \mathbf{z}^{-i}, \alpha\right] \cdot P\left[t_{i} \mid z_{i}, t_{1: n}^{-i}, \mathbf{z}^{-i}, \beta_{t}\right] \\ \cdot P\left[a_{i} \mid z_{i}, a_{1: n}^{-i}, \mathbf{z}^{-i}, \beta_{a}\right] \cdot P\left[l_{i} \mid z_{i}, l_{1: n}^{-i}, \mathbf{z}^{-i}, \beta_{l}\right] \end{array}\) (5)

where superscript −𝑖 means ignoring the 𝑖𝑖th token. Considering topic transition, the topic of the 𝑖t sample 𝑧_{𝑖_{depends on both the multinomial parameter and the previous topic 𝑧_𝑖−1. Therefore, the first term on the right side of ∝ is:}}

\(P\left[z_{i} \mid \mathbf{z}^{-i}, \alpha\right] \propto \frac{n_{z_{i-1}, z_{i}}^{u}+\alpha}{n_{z_{i-1}}^{u}+K \alpha} \frac{n_{z_{i}, z_{i+1}}^{u}+I\left(z_{i-1}=z_{i}=z_{i+1}\right)+\alpha}{n_{z_{i}}^{u}+I\left(z_{i-1}=z_{i}\right)+K \alpha}\) (6)

where 𝐾 is the number of topics. The second term on the right side of ∝ in (6) adjusts the transition count from 𝑧_𝑖−1 to 𝑧_𝑖 since 𝑧_𝑖 is excluded. The other terms in (5) can be calculated using the following probabilities:

\(P\left[t_{i} \mid z_{i}, t_{1: n}^{-i}, \mathbf{z}^{-i}, \beta_{t}\right] \propto \frac{n_{z_{i}} t_{i}+\beta_{t}}{n_{z_{i}}+T \beta_{t}}\) (7)

\(P\left[a_{i} \mid z_{i}, a_{1: n}^{-i}, \mathbf{z}^{-i}, \beta_{a}\right] \propto \frac{n_{z_{i}, a_{i}+\beta_{a}}}{n_{z_{i}+A \beta_{a}}}\) (8)

\(P\left[l_{-} i \mid z_{-} i, l_{-}(1: n)^{\wedge}(-i), \mathbf{z}^{\wedge}(-i), \beta_{-} l\right] \propto\left(n_{-}\left(z_{-} i, l_{-} i\right)+\beta_{-} l\right) /\left(n_{-}\left(z_{-} i\right)+L \beta_{-} l\right)\) (9)

where 𝑇, 𝐴 and 𝐿 represent the vocabulary size of time, location and activity, respectively. The above sampling is performed iteratively until the sampling result change little. Then, we can estimate the document-specific topic distribution

\(p[k \mid u]=\frac{n_{k}^{u_{k}+\alpha}}{\sum_{k}\left(n_{k}^{u}+\alpha\right)}\) (10)

and topic-specific vocabulary distribution

\(\phi_{k, w}=\frac{n_{\mathrm{k}, \mathrm{w}}+\beta_{\mathrm{w}}}{\sum_{\mathrm{w}}\left(\mathrm{n}_{\mathrm{k}, \mathrm{w}}+\beta_{w}\right)}, w \in\langle t, a, l\rangle\) (11)

and the topic transition probability of each user

\(p\left[z_{i} \mid z_{i-1}, u\right]=\frac{n_{z_{i-1}, z_{i}}^{u}+\alpha}{n_{z_{i-1}}^{u}+K \alpha}\) (12)

The output distributions of Gibbs sampling act as the three basic parameters of a HMM, i.e., state transition probability \(\left(p\left[z_{i} \mid z_{i-1}, u\right]\right)\), emission probability (𝜙_𝑘,𝑤) and prior state distribution (𝑝[𝑘|𝑢]) for all \(1 \leq i \leq n, k \in \mathbb{E}, j \in \mathbb{E}.\)

3.4 Adversary model

We study a powerful adversary who accesses the whole published dataset and owns two types of background knowledge. 1) User model. It means that the adversary knows well about the users’ habits. 2) Publishing algorithm. Such an adversary knows when sensitive information is suppressed in the algorithm and can infer users’ privacy from the additional information leaked by the suppression rules.

This type of adversary exists in our daily lives, e.g. our close friends, because an individual’s behavior patterns can be easily learned by his/her close friends. Bayesian inference can be used by the adversary to infer sensitive information. Given a user model 𝑀𝑀, the adversary estimates the prior probability of the sensitive information at each position, i.e., \(P\left[e_{i} \cdot x=s\right](1 \leq i \leq n, x \in\{a, t, l\}, s \in S)\). Upon observing a published activity trajectory 𝑂, the adversary updates his/her inference by computing the posterior probability.

One technique an adversary can use to infer sensitive information is by event dependence. There are two types of dependence: external dependence and internal dependence, or the correlation between consecutive events and among the information inside an event. For example, if 𝑙 is a sensitive location to user 𝑢 where he/she usually goes after lunch on Monday, then it makes no sense to merely suppress “go to 𝑙 on Monday afternoon” because this event can be inferred using the dependence from the lunch event. In this case, the adversary infers sensitive information by external dependence. For another example, assume user 𝑣 usually goes to 𝑙′ for drugs; then, merely suppressing the activity “taking drugs” makes no sense if the adversary knows the correlation between 𝑙′ and drugs. In this case, the adversary infers sensitive information by internal dependence.

Another technique for inferring is usage of the privacy-preserving mechanism. Sometimes “suppression” implies “publishing”. For instance, government officials are not allowed to enter casino for gambling. Consider a publishing algorithm tries to preserve this privacy by suppressing the event if and only if it contains “gambling” activity. When the adversary observes an activity suppression, he/she will infer easily the privacy due to his/her knowledge of the suppression rule.

4. Privacy-preserving Algorithms for Publishing Activity Trajectories

4.1 Algorithm for the Markov-based model

The pseudocode of the publishing algorithm for the Markov-based model (PA-Markov) is shown in Algorithm 1. For the four input of PA-Markov, 𝛤 can be extracted from the raw dataset. 𝛿 and 𝑆 are provided by data publishers or users. 𝑀 is learned from the user’s historical records. PA-Markov finally outputs a modified activity trajectory 𝑂 , which preserves 𝛿-privacy. When PA-Markov is executed on all users, the whole dataset is published. Here, we assume that activity trajectory of each user is independent. In other words, the adversary cannot infer a user’s privacy with the help of other users’ activity trajectories.

At each position 𝑖, PA-Markov first checks external dependence by external Check and then internal dependence by internal Check. Procedures of external Check and internal Check are shown in Algorithm 2 and 3. Given a position 𝑖 and current temporary output 𝑂, external Check outputs true if and only if for all the possible values at position 𝑖 and all the possible values in 𝑆, publishing the information will not breach 𝛿-privacy.

Probability estimation of sensitive information is the key steps of external Check. Prior probability can be computed by:

\(P\left[e_{j} \cdot x=s\right]=\left(\vec{b}^{\prime} T^{j-1}\right)_{s}\) (13)

where 𝑥 ∈ {𝑎, 𝑡, 𝑙} and \(\vec{b}\) is the initial distribution of 𝑥, 𝑇 is the one-step transition matrix of 𝑀, and subscript 𝑠 means the position where the information is 𝑠.

For the posterior probability, 𝑃[𝑒_𝑗. 𝑥 = 𝑠|⟨𝑂, 𝑦⟩], we consider a temporary output 𝑂 = ⟨𝑜₁, 𝑜₂, ⋯ , 𝑜_𝑖⟩. Let 𝑗′ be the last position before or at position 𝑗 at which an event was published. Let 𝑗″ be the first position after position 𝑗 at which an event was published. If no such position exists, then 𝑗″ = 𝑛 + 1. It was proven in [14] that:

\(P\left[e_{j} \cdot x=s \mid\langle 0, y\rangle\right]=P\left[e_{j} \cdot x=s \mid e_{j^{\prime}} \cdot x=o_{j^{\prime}}, e_{j^{\prime \prime}} \cdot x=o_{j^{\prime \prime}}\right]\) (14)

Using Markov assumption of the activity trajectory, the probability in (14) is:

\(\begin{aligned} P_{\text {post }_{e}}\left[e_{j} \cdot x=s \mid\langle 0, y\rangle\right] &=P\left[e_{j} \cdot x=s \mid e_{j^{\prime}} \cdot x=o_{j^{\prime}}, e_{j^{\prime \prime}} \cdot x=o_{j^{\prime \prime}}\right] \\ &=\frac{P\left[e_{j^{\prime}}{ }^{\prime \prime} \cdot x=o_{j^{\prime \prime}} \mid e_{j \cdot x=s}\right] P\left[e_{j} \cdot x=s \mid e_{j^{\prime}} \cdot x=o_{j^{\prime}}\right]}{P\left[e_{j^{\prime \prime}} \cdot x=o_{j^{\prime \prime}} \mid e_{j^{\prime}} \cdot x=o_{j^{\prime}}\right]} \end{aligned}\) (15)

Here, external Check checks complete value set in case of leaking additional information because the adversary knows the publishing algorithm. For example, if 𝑒_𝑛−1 is published but 𝑒_𝑛 is suppressed, the adversary can infer that 𝑒_𝑛 contains sensitive information based on (14). The complete value set at position 𝑖 is determined by 𝑀, starting from the nearest position before 𝑖 where the true data is published. Specifically, assume 𝑖′ is the nearest position where the output≠NIL; we can get 𝑃[𝑥_𝑖|𝑥_𝑖′] by 𝑖 − 𝑖′ steps of transition.

After checking the external dependence of a given activity trajectory, PA-Markov start to check the internal dependence. Here, we resort to the concept of frequent pattern mining. We define the posterior probability of sensitive information 𝑠 given another information 𝑦₀ as the confidence of rule: 𝑒. 𝑦 = 𝑦₀ ⇒ 𝑒. 𝑥 = 𝑠, i.e.,

\(P_{\text {post }_{i}}\left[e . x=s \mid e . y=y_{0}\right]=\operatorname{con} f\left(e . y=y_{0} \Rightarrow e . x=s\right)=\frac{\left|\left\{e \mid e \cdot x=s, e \cdot y=y_{0}\right\}\right|}{\left|\left\{e \mid e . y=y_{0}\right\}\right|}\) (16)

As shown in Algorithm 3, given an event 𝑒, internalChack will check each component of 𝑒 that is not NIL. During each check, internalChack iterates on all the sensitive information in 𝑆 to compute the posterior probability. If for some 𝑠 ∈ 𝑆, the posterior probability 𝑃[𝑒. 𝑥 = 𝑠|𝑒. 𝑦] is 𝛿 larger than the prior probability 𝑃[𝑒. 𝑥 = 𝑠], then 𝑒. 𝑦 is suppressed, where 𝑦 ∈ {𝑎,𝑡, 𝑙}. Thus, internalChack ensures that for all the sensitive information 𝑠 ∈ 𝑆, all the posterior probability 𝑃[𝑒. 𝑥 = 𝑠| ⋅] is no 𝛿 larger than the prior probability 𝑃[𝑒. 𝑥 = 𝑠], which preserves 𝛿-privacy. Additionally, publishing or suppressing data in interChack does not breach the result of externalChack since we do not use the current event for external dependence check in externalChack.

Proposition 1: externalChack preserves 𝛿-privacy in Definition 4.

Proof: We consider that user 𝑢𝑢 has a sensitive set 𝑆 and externalChack produces 𝑂. At any position 𝑗, we consider two cases: 1) If 𝑗″ ≤ 𝑛, externalChack ensures that the posterior probability is not 𝛿 larger than the prior probability, so it publishes 𝑒_𝑗″. 𝑥. After 𝑒_𝑗″ is published, whether the events after 𝑒_𝑗″. 𝑥 in 𝛤 are published or suppressed has no influence on the posterior probability according to (14). Thus, externalChack preserves 𝛿-privacy until termination of the algorithm. 2) If 𝑗″ = 𝑛 + 1, consider how externalChack works when publishing 𝑒_𝑗′. 𝑥 . ExternalChack deciding to publish 𝑒_𝑗′. 𝑥 implies that the posterior probability of 𝑒_𝑗. 𝑥 = 𝑠 is not 𝛿 larger than the prior probability given 𝑗″ = 𝑛 + 1 during the check.

Since both externalChack and internalChack preserve 𝛿-privacy, we have Proposition 2.

4.2 Algorithm for HMM-based model

We assign a probability \(p_{i}^{j}\) for each event 𝑖 with which 𝑖𝑖 is published at position 𝑗. With probability 1 − \(p_{i}^{j}\), event 𝑖 is suppressed at position 𝑗. We further define a publishing vector 𝒑 containing all the publishing probabilities. Given the publishing vector 𝒑, our publishing algorithm outputs each event with its publishing probability in order. The pseudocode is shown in Algorithm 4. By defining the publishing probability, whether an output activity trajectory breaches 𝛿 −privacy can be checked using a publishing vector. The pseudocode is shown in Algorithm 5. If and only if 𝛿 −privacy is not breached at any position for all the sensitive information, the algorithm returns true.

In privacyChack, calculations of the prior and posterior probability of sensitive information are also crucial points. Prior probability can be computed by the Markov property. Assume we want to compute the prior probability of some sensitive information 𝑠 at position 𝑖. The prior distribution of 𝑒_𝑖 can be computed by 𝑖 − 1 step transition probability, i.e.,

\(P\left[e_{i}=e\right]=\left(\pi T r^{i-1} B\right)_{e}\) (17)

Then, the prior probability of 𝑠 can be computed by

\(P\left[e_{i} \cdot x=s\right]=\sum_{e . x=s}\left(\pi T r^{i-1} B\right)_{e}\) (18)

Computing posterior probability is more complex. Since an event will not be partially published in PA-HMM, there is no need to check internal dependence in particular. As each event has a publishing probability, the adversary cannot infer sensitive information from the original HMM. From the adversary’s point of view, the output is generated from a new HMM, i.e., 𝑀′ = (𝜋′, 𝑇r′, 𝐵′). The initial state distribution and transition probabilities do not change, i.e., 𝜋′ = 𝜋, 𝑇r′ = 𝑇r. However, emission probabilities, 𝐵′, are changed to

\(b_{i, j}^{\prime k}=\left\{\begin{array}{ll} b_{i, j} p_{j}^{k} & j \in \mathbb{E} \\ 1-\sum_{l} b_{i, l} p_{l}^{k} & j=N I L \end{array}, 1 \leq k \leq n\right.\) (19)

According to (3) in [7], the distribution of latent state 𝑌 is

\(P\left[Y_{i}=y \mid O\right]=\frac{\alpha_{y}(i) \beta_{y}(i)}{\sum_{y^{\prime} \alpha_{y^{\prime}}(i) \beta_{y^{\prime}}(i)}}, 1 \leq y \leq K\) (20)

\(\alpha_{y}(i)=P\left[Y_{i}=y, o_{1}, o_{2}, \cdots, o_{i-1}\right], \beta_{y}(i)=P\left[o_{i}, o_{i+1}, \cdots, o_{n} \mid Y_{i}=y\right]\) (21)

𝛼 and 𝛽 can be obtained by the forward-backward algorithm. Thus, we have

\(\begin{aligned} P\left[e_{i} \cdot x=s \mid O\right] &=\sum_{y} P\left[Y_{i}=y \mid O\right] P\left[e_{i} \cdot x=s \mid Y_{i}=y\right] \\ &=\sum_{y} P\left[Y_{i}=y \mid O\right] \cdot \sum_{e} I(e . x=s) b_{y, e}^{\prime i} \end{aligned}\) (22)

Obviously, if we increase the publishing probability, the utility will also increase. Our goal is to search the publishing vector that maximizes utility while preserving 𝛿 −privacy. However, privacyCheck is neither convex nor concave. We observe that if we decrease the publishing probability, we can certainly improve privacy. We say vector 𝐩 dominates 𝐪 if for all \(i, j, p_{i}^{j} \leq q_{i}^{j}\) . Then, we make the following proposition.

Proposition 3: If 𝐩 preserves 𝛿-privacy, then so does any 𝐪 dominated by 𝐩.

Proof: Our proof is quite similar to the proof of the monotonicity property of the probabilistic check in [7]. Consider two publishing vectors, 𝐩 and 𝐪. 𝐩 is larger by 𝜖 in exactly one dimension: \(p_{i}^{j}=q_{i}^{j}+\epsilon\). Assume that 𝐩 preserves 𝛿-privacy. It was proven in [7] that for all outputs 𝑂 and sensitive information 𝑠 ∈ 𝑆, the maximum posterior probability of 𝑒_𝑗, 𝑃[𝑒_𝑗|𝑂], does not increase when the publishing vector goes from 𝐩 to 𝐪. Thus, we have

\(\begin{aligned} P_{\mathbf{q}}\left[e_{j} \cdot x=s \mid O\right]-P\left[e_{j} \cdot x=s\right] &=\sum_{i} I(i . x=s) P_{\mathbf{q}}\left[e_{j}=i \mid O\right] q_{i}^{j}-P\left[e_{j} \cdot x=s\right] \\ & \leq \sum_{i} I(i . x=s) P_{\mathbf{p}}\left[e_{j}=i \mid O\right] p_{i}^{j}-P\left[e_{j} \cdot x=s\right] \\ & \leq P_{\mathbf{p}}\left[e_{j} \cdot x=s \mid O\right]-P\left[e_{j} \cdot x=s\right] \leq \delta \end{aligned}\) (23)

Since the gap between posterior and prior probability does not increase when the publishing vector goes from 𝐩 to 𝐪, the privacy is preserved. In other words, privacy is an anti-monotone property.

The range of 𝐩 is [0,1]^𝑛|𝔼|, which contains infinite vectors. We discretize the space [0,1] to [0,0.1, ⋯ ,0.9,1] and use a greedy algorithm, ALGP [15], to optimize 𝐩, as shown in Algorithm 6. We call a privacy-preserving publishing vector 𝐩 an extreme point if increasing any dimension of 𝐩 will breach privacy. The idea of vectorSearch is seeking all the extreme points that are maintained in MaxTrueSet by iteratively using binary search. Then, the publishing vector with the highest utility is chosen from MaxTrueSet and returned.

PrivacyChack determines whether a publishing vector preserves privacy by checking 𝛿-privacy. Thus, PA-HMM preserves 𝛿-privacy, as shown in the following proposition.

4.3 Comparison of PA-Markov and PA-HMM

4.3.1 Utility

We have proven that privacy is anti-monotone and utility is monotone; thus, we use a greedy search algorithm, ALGP, to maximize the publishing vector 𝑝𝑝. We optimize the publishing vector of PA-HMM. While PA-Markov is locally optimal, i.e., if the current information is published despite externalCheck deciding to suppress it, then the privacy may be breached if the temporary sequence is exactly the same as the final output. Although internalCheck can guarantee publishing as much information as possible, the whole algorithm, PA-Markov, is not globally optimal in utility.

4.3.2 Efficiency

We denote that 𝑚 = max{|𝐴|, |𝑇|, |𝐿|}. The time complexity of computing prior and posterior probability of PA-Markov are 𝑂𝑂(𝑛𝑚³). Sensitive set 𝑆 is given and we denote its size by |𝑆|. Since the innermost iteration of externalChack runs 𝑂(𝑛²𝑚|𝑆|) times, the total time complexity of PA-Markov is 𝑂(𝑛³𝑚⁴|𝑆|).

For PA-HMM, the running time of Gibbs sampling is 𝑂(RKn), where 𝑅 refers to the number of iterations. Since we exploit a greedy binary search to maximize 𝑝, the number of calls to privacyCheck in vectorSearch is 𝑂(𝑛|𝔼|log(𝑑)), where 𝑑 is the number of intervals we divide [0,1] into. The time complexity of prior and posterior probability estimation is 𝑂(𝑛|𝔼|𝐾²) and 𝑂(𝑛|𝔼|²). In addition, privacyCheck iterates for 𝑂(𝑛|𝑆|) times. Since 𝐾 is usually considerably less than |𝔼|, the total running time of PA-HMM is 𝑂(𝑛³|𝔼|³|𝑆|log(𝑑)).

4.3.3 Speedup

To speed up PA-Markov, we can further improve the procedures of dependence check:

a) At position 𝑖, externalChack checks privacy breach by computing P_prior and P_post on all the positions. Assume 𝑖′ is the nearest position before 𝑖 where the event is published, according to (15), the events after 𝑖′ does not affect the posterior probability of 𝑒_𝑖″. 𝑥 = 𝑠 for all 𝑖″ ≤ 𝑖′ and all 𝑠. Thus, Line 3 in externalChack could be improved by:

1: 𝑖′ ← the last position before 𝑖 where 𝑒_𝑖′. 𝑥 is published

2: for 𝑗 = 𝑖′ + 1 → 𝑛 do

b) In externalChack, each possible 𝑦 and each sensitive information 𝑠 ∈ 𝑆 are checked separately, therefore we can use data parallelism to further accelerate the process. Thus, Line 1 to Line 2 in externalChack can be improved by:

1: for each possible value 𝑦 parallel do

2: for each 𝑠 ∈ 𝑆 parallel do

To speed up PA-HMM, we can run vectorSearch offline since it is independent of the input activity trajectory. Then, we just need to start at Line 3 in Algorithm 4 online. Thus, the online running time is reduced to 𝑂(𝑛), which is acceptable for real-time applications.

5. Evaluation

5.1 Datasets

A small-scale simulated dataset and two real-world datasets are built in our experiments. The simulated dataset (denoted by SD) is generated randomly, while the real-world datasets (accessible at https://github.com/PPATP/Campus-smart-card) are collected using the campus smart card system of a university in China. In the system, hundreds of point of sale (POS) machines are set up where payment is needed (e.g., canteens, stores) or for check in/out (e.g., libraries, dormitory). When a student swipes smart card on POS machine, a record including timestamp, expense, location and some other metadata is saved temporarily in the POS machine and later uploaded to centralized database. We use these records as students’ activity trajectories in the experiment. We choose two datasets. One is collected from September to December 2011 (denoted by D11), and the other is from September to December 2015 (denoted by D15). The statistics are shown in Table 3.

Table 3. Statistics of the datasets

E1KOBZ_2020_v14n8_3328_t0003.png 이미지

5.2 Baselines

a) noMask. NoMask publishes raw activity trajectory without any suppression. According to the adversary model, the adversary knows this mechanism. Therefore the posterior probability of 𝑠 is:

\(P\left[e_{i} \cdot x=s \mid 0\right]=\left\{\begin{array}{ll} 1 & o_{i} \cdot x=s \\ 0 & o_{i} \cdot x \neq s \end{array}, 1 \leq i \leq n\right.\) (24)

b) sensitiveMask. SensitiveMask is a naive approach that suppresses an event when there is sensitive information in the event and publishes it if there is not. When a suppression occurs, the adversary knows that it has sensitive information, but does not know the exact event. Here, the HMM is also applied for updating adversary’s posterior probabilities. What the adversary observes is a new HMM 𝑀″. The initial state distribution and transition probabilities of 𝑀″ remain the same, while the emission probabilities are changed to:

\(\begin{equation} b_{j}(k)=P\left[o_{i}=k \mid e_{i}=j\right]=\left\{\begin{array}{ll} 1 & k=\mathrm{NIL} \text { and } \exists j . x \in S \text { or } k=j \\ 0 & \text { others } \end{array}\right. \end{equation}\) (25)

for all \(\begin{equation} 1 \leq i \leq n, k \in \mathbb{E} \cup\{\mathrm{NIL}\}, j \in \mathbb{E} \end{equation}\). The posterior probability can be computed by (20). c) PA-Markov without an internal check. We use PA-Markov without an internal check (denoted as nointernal) to test whether internalCheck is necessary. NoInternal is the same as PA-Markov except for deleting Line 10 in Algorithm 1.

5.3 Results on the simulated dataset

We randomly choose an activity as sensitive information (|𝑆| = 1), and the experimental results on SD are shown in Table 4. Performing better than the baselines, the breach rates of PA-Markov and PA-HMM are always 0, which demonstrates that PA-Markov and PA-HMM preserve users’ privacy. Additionally, the breach rate of sensitiveMask is also 0, which means that merely suppressing sensitive information is sufficient to preserve the privacy of SD. In Section 5.4, we conduct a more detailed analysis of the experimental results on D11 and D15 and show that sensitiveMask sometimes causes breaches in users’ privacy.

5.4 Results on the real-world datasets

5.4.1 Privacy Breaches

The breach rates of PA-Markov and PA-HMM are shown in Figs. 4, 5, 6 and 7. We conducted experiments on two types of sensitive information: 1) a sensitive activity, a sensitive time stamp and a sensitive location 2) a sensitive activity or a sensitive time stamp or a sensitive location. All the sensitive sets are chosen randomly. NoMask has a high breach rate even when 𝛿 = 0.5. When 𝛿 is small, noInternal has a very high breach rate, which indicates the risk of internal dependence attacks by the adversary. PA-Markov considers both external and internal dependence and therefore, always preserves privacy. For the HMM-based user model, we find that PA-HMM also performs the best. Simply suppressing all the events with sensitive information merely lowers the breach rate but cannot guarantee that all the privacy is preserved. As shown in Figs. 5 and 7, the adversary can sometimes still infer sensitive information using the correlation between events.

E1KOBZ_2020_v14n8_3328_f0004.png 이미지

Fig. 4. Breach rate of PA-Markov, noMask and nointernal on D11

E1KOBZ_2020_v14n8_3328_f0005.png 이미지

Fig. 5. Breach rate of PA-HMM, noMask and sensitiveMask on D11

E1KOBZ_2020_v14n8_3328_f0006.png 이미지

Fig. 6. Breach rate of PA-Markov, noMask and nointernal on D15

E1KOBZ_2020_v14n8_3328_f0007.png 이미지

Fig. 7. Breach rate of PA-HMM, noMask and sensitiveMask on D15

Fig. 8 shows the distribution of published and suppressed data on D11 and D15. For PA-Markov, the bars show the fraction of sensitive or nonsensitive information in the whole dataset. For PA-HMM, since it can determine only whether to publish an event, the bars show the fraction of events that contain sensitive information. We choose 𝛿 = 0.2 for PA-HMM and 𝛿 = 0.65 for PA-Markov and randomly tag sensitive information. PA-Markov and PA-HMM both publish some of the sensitive information or events with sensitive information. Based on Definition 4, publishing 𝑒_𝑖 preserves privacy if the prior probability of 𝑠 exceeds 1 − 𝛿 at position 𝑖. To explain that, if the adversary is very sure about some sensitive information, publishing or not would not shake his/her belief. However, it is counterintuitive for PA-Markov and PA-HMM to suppress a large number of nonsensitive information or events without sensitive information. According to Definition 4, although these events preserves privacy at its position, publishing them will breach 𝛿-privacy at other positions. Fig. 8 again demonstrates that PA-Markov and PA-HMM can protect against correlation attacks.

E1KOBZ_2020_v14n8_3328_f0008.png 이미지

Fig. 8. The composition of published and suppressed data

5.4.2 Utility

The utility of output datasets (for simplicity, the utility of an algorithm and the utility of its output are synonymous) by varying 𝛿 is shown in Figs. 9 and 10. For PA-Markov, we test the utility of four types of sensitive information. For PA-HMM, we test the utility when the number of latent states varies (𝐾 = 4,6,8,10). We observe that the variation trends of utility under different settings are similar. When we enhance privacy preservation, the utility decreases. This result implies that we must sacrifice some utility to satisfy privacy requirements in practical use. One proper tradeoff is setting a different 𝛿 for different users. We can assign a low 𝛿 for those who care more about their privacy (e.g., famous singers, actors) and a high 𝛿 for those who care less about their privacy (e.g., civilians). Contract theory seems to be a useful approach to balancing privacy and data utility [16].

E1KOBZ_2020_v14n8_3328_f0009.png 이미지

Fig. 9. Privacy-utility tradeoff of PA-Markov

E1KOBZ_2020_v14n8_3328_f0010.png 이미지

Fig. 10. Privacy-utility tradeoff of PA-HMM

5.4.3 Running time

To compare the efficiency of PA-Markov, PA-HMM and the baselines, each algorithm is run for 10 times. Their average running time are shown in Table 5. PA-Markov and PA-HMM need much more time than sensitiveMask and noMask as they need multiple iterations while sensitiveMask and noMask take constant time. Additionally, after we conduct vectorSearching offline, the running time of PA-HMM is reduced significantly.

6. Related Work

Recent years saw many inspiring works on PPDP. We provide a brief introduction of these works and compare S-PPATP with their approaches from the aspects of user model, adversary model, privacy requirement and data quality.

PPDP solutions can be broadly classified into two categories based on their attack principles [5]. The first category defends against linkage attacks, which considers that a privacy threat occurs when an attacker is able to link a record owner to a record in a published data set or to a sensitive attribute [17]. 𝑘-anonymity is a famous model to prevent linkage attack, which requires that each record is indistinguishable from at least k-1 other records. Li et al. [18] applied the partitioning-based and the clustering-based algorithms in the real-world privacy soccer fitness data publication to achieve the k-anonymity model. Gao et al. [19] propose a personalized 𝑘-anonymity model which take trajectory similarity and direction into account in the selection of anonymity set. Gurung et al. [20] and Dong et al. [21] adopted clustering-based anonymization algorithm to group similar trajectories and select representative trajectories. Their method can guarantee strict 𝑘-anonymity of published data, but probably reduced the utility of published data. Originally proposed for relational data, 𝑘-anonymity does not make assumption on the patterns of victims’ data. Compared with these approaches, S-PPATP assumes that trajectory data indicate human mobility patterns which can be characterized by user model.

Many works have proposed improved technique for 𝑘-anonymity. It is found that in some cases, another type of privacy leakage, homogeneity attack, may still exist in 𝑘-anonymity data [17]. To address homogeneity attack, 𝑙-diversity has been proposed, which requires each sensitive attribute has to possess at least 𝑙 distinct values in each anonymity group [9]. Wang et al. [22] proposed a novel privacy-preserving framework for LBS data publication. The framework considers the topological properties of the road network when providing privacy-preserving mechanisms for a single user and a batch of users. They also proposed two cloaking algorithms to achieve both 𝑘-anonymity and 𝑙-diversity. Zhu et al [23] proposed a noise technique to publish anonymized data and fulfilled the 𝑙-diversity requirements. Li. et al [24] proposed a data partitioning method in PPDP under the constraint of 𝑘-anonymity and 𝑙-diversity. However, 𝑙-diversity does not prevent attribute linkage attacks when the overall distribution of a sensitive information is skewed [5][25]. As a result, some works used a more strict privacy model called 𝑡-closeness to anonymize published data [26], which requires the distribution of sensitive information in any anonymity group to be close to the distribution in the overall dataset. To sum up, 𝑘 -anonymity and corresponding improvement preserve privacy well only when the adversary has limited background knowledge about trajectory generator and provide undifferentiated protection for sensitive information. By contrast, S-PPATP solves a more serious problem that the adversary may have a background knowledge of user model and provides a flexible protection level by using 𝛿-privacy.

The other category defends against probabilistic attack, which studies how the adversary changes the probabilistic belief on the privacy of a victim after accessing the published dataset [9]. Privacy preserving solutions of this category normally have a strong assumption on the adversary’s background knowledge. Gotz et al. [7] originally assumed that the adversary knows both the temporal correlations and the publishing algorithm. They proposed a framework, MASTIT, to filter user data that preserves 𝛿-privacy. Li et al. [27] proposed a data publishing algorithm to prevent the attack based on a naive Markov user model. Gramaglia et al. [28] used both spatiotemporal generalization and suppression to ensure that the adversary gains little samples of the target user’s trajectory with a significant data loss. S-PPATP belongs to this category. Compared with these works, S-PPATP extends the boundary of sensitive information and strengthens the adversary’s power that he/she is aware of a victim’s hidden life style (topic) and he/she can infer sensitive information by internal dependence inside an event. Additionally, S-PPATP applies suppression without generalization to ensure that the published data make more sense.

ϵ-Differential privacy is an extremely strict privacy model which was first proposed in [29]. Instead of comparing the prior probability and the posterior probability, ϵ-differential privacy proposes a strict requirement that the addition or removal of any single database record does not significantly influence the outcome of any inference. Some works applied ϵ-Differential privacy model to trajectory data publication [3][30][31]. ϵ-differential privacy seems to be an ultimate solution because it is proven that ϵ-differential privacy can protect against attackers with arbitrary background knowledge [29]. However, the privacy requirement is too rigorous for the LBS scenario [32]. Therefore, in S-PPATP, we relax the privacy requirement by ensuring that adversary’s belief in sensitive information does not increase too much without taking further data removal or addition into account.

7. Conclusion and Future Work

In this paper, we propose a solution for PPATP, S-PPATP, which consists of modeling, algorithm design and algorithm adjustment. Although S-PPATP is an effective approach to privacy-preserving activity trajectories publishing, it can be further improved on each step. During modeling, more sophisticated user models can be used to better describe user behavior patterns, and a more powerful adversary can be assumed to have more background knowledge. During algorithm design, since we have discussed that neither PA-Markov nor PA-HMM is globally optimal in utility, a hybrid of PA-Markov and PA-HMM may further enhance the utility. During algorithm adjustment, the privacy-utility tradeoff is an important issue for data publishers and can be further studied for more different application scenarios.

참고문헌

H. Ghasemzadeh, P. Panuccio, S. Trovato, G. Fortino, and R. Jafari, "Power-aware activity monitoring using distributed wearable sensors," Human-Machine Systems, IEEE Transactions on, vol. 44, no. 4, pp. 537-544, 2014. https://doi.org/10.1109/THMS.2014.2320277
B. Zhou, Q. Li, Q. Mao, W. Tu, and X. Zhang, "Activity sequence-based indoor pedestrian localization using smartphones," Human-Machine Systems, IEEE Transactions on, vol. 45, no. 5, pp. 562-574, Oct 2015. https://doi.org/10.1109/THMS.2014.2368092
R. Chen, B. Fung, B. C. Desai, and N. M. Sossou, "Differentially private transit data publication: a case study on the montreal transportation system," in Proc. of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 213-221, 2012.
J. Wan, C. Byrne, M. OGrady, and G. OHare, "Managing wandering risk in people with dementia," Human-Machine Systems, IEEE Transactions on, vol. 45, no. 6, pp. 819-823, Dec 2015. https://doi.org/10.1109/THMS.2015.2453421
B. Fung, K. Wang, R. Chen, and P. S. Yu, "Privacy-preserving data publishing: A survey of recent developments," ACM Computing Surveys (CSUR), vol. 42, no. 4, p. 14, 2010.
B. Agir, K. Huguenin, U. Hengartner, and J.-P. Hubaux, On the privacy implications of location semantics, Ph.D. dissertation, EPFL, 2015.
M. Gotz, S. Nath, and J. Gehrke, "Maskit: privately releasing user context streams for personalized mobile applications," in Proc. of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, pp. 289-300, 2012.
K. Zheng, S. Shang, N. J. Yuan, and Y. Yang, "Towards efficient search for activity trajectories," in Proc. of Data Engineering (ICDE), 2013 IEEE 29th International Conference on. IEEE, pp. 230-241, 2013.
A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, "l-diversity: Privacy beyond k-anonymity," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1, p. 3, 2007. https://doi.org/10.1145/1217299.1217302
C. Song, Z. Qu, N. Blumm, and A. Barabasi, "Limits of predictability in human mobility," Science, vol. 327, no. 5968, pp. 1018-1021, 2010. https://doi.org/10.1126/science.1177170
A. Mannini and A. M. Sabatini, "Accelerometry-based classification of human activities using markov modeling," Computational intelligence and neuroscience, vol. 2011, p. 10, 2011.
D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," the Journal of machine Learning research, vol. 3, pp. 993-1022, 2003.
A. Gruber, Y. Weiss, and M. Rosen-Zvi, "Hidden topic markov models," in Proc. of International Conference on Artificial Intelligence and Statistics, pp. 163-170, 2007.
M. Gotz, On user privacy in personalized mobile services, Ph.D. dissertation, Cornell University, 2012.
A. Arasu, M. Gotz, and R. Kaushik, "On active learning of record matching packages," in Proc. of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp. 783-794, 2010.
L. Xu, C. Jiang, Y. Chen, Y. Ren, K. R. Liu, "Privacy or utility in data collection? a contract theoretic approach," IEEE Journal of Selected Topics in Signal Processing, 9 (7), 1256-1269, 2015. https://doi.org/10.1109/JSTSP.2015.2425798
S. Yu, "Big privacy: Challenges and opportunities of privacy study in the age of big data," IEEE access, 4, 2751-2763, 2016. https://doi.org/10.1109/ACCESS.2016.2577036
R. Li, S. An, D. Li, J. Dong, W. Bai, H. Li, Z. Zhang, Q. Lin, "K-anonymity model for privacy-preserving soccer fitness data publishing," in Prco. of MATEC 565 Web of Conferences, Vol. 189, p. 03007, 2018.
S. Gao, J. Ma, C. Sun, and X. Li, "Balancing trajectory privacy and data utility using a personalized anonymization model," Journal of Network and Computer Applications, vol. 38, pp. 125-134, 2014. https://doi.org/10.1016/j.jnca.2013.03.010
S. Gurung, D. Lin, W. Jiang, A. Hurson, R. Zhang, "Traffic information publication with privacy preservation," ACM Transactions on Intelligent Systems and Technology (TIST), 5 (3), 44, 2014.
Y. Dong, D. Pi, "Novel privacy-preserving algorithm based on frequent path for trajectory data publishing," Knowledge-Based Systems, 148, 55-65, 2018. https://doi.org/10.1016/j.knosys.2018.01.007
Y. Wang, Y. Xia, J. Hou, S.-m. Gao, X. Nie, Q. Wang, "A fast privacy-preserving framework for continuous location-based queries in road networks," Journal of Network and Computer Applications, 53, 57-73, 2015. https://doi.org/10.1016/j.jnca.2015.01.004
H. Zhu, S. Tian, M. Xie, M. Yang, "Preserving privacy for sensitive values of individuals in data publishing based on a new additive noise approach," in Proc. of 2014 23rd International Conference on Computer Communication and Networks (ICCCN), IEEE, pp. 1-6, 2014.
S. Li, H. Shen, Y. Sang, H. Tian, "An efficient method for privacy-preserving trajectory data publishing based on data partitioning," The Journal of Supercomputing, 1-25, 2019.
P. R. M. Rao, S. M. Krishna, A. S. Kumar, "Privacy preservation techniques in big data analytics: a survey," Journal of Big Data, 5(1), 33, 2018. https://doi.org/10.1186/s40537-018-0141-8
Z. Tu, K. Zhao, F. Xu, Y. Li, L. Su, D. Jin, "Protecting trajectory from semantic attack considering k-anonymity, l-diversity, and t-closeness," IEEE Transactions on Network and Service Management, 16(1), 264-278, 2018. https://doi.org/10.1109/tnsm.2018.2877790
X. Li, S. Wei, and G. Sun, "A scheme for activity trajectory dataset publishing with privacy preserved," in Proc. of UIC-ATC-ScalCom-CBDCom-IoP 2015. IEEE, pp. 247-254, 2015.
M. Gramaglia, M. Fiore, A. Tarable, A. Banchs, "Preserving mobile subscriber privacy in open datasets of spatiotemporal trajectories," in Proc. of IEEE INFOCOM 2017-IEEE Conference on Computer Communications, IEEE, pp. 1-9, 2017.
D. Cynthia, "Differential privacy," Automata, Languages and Programming, 1-12, 2006.
Y. Xiao, L. Xiong, "Protecting locations with differential privacy under temporal correlations," in Proc. of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ACM, pp. 1298-1309, 2015.
Q. Miao, W. Jing, H. Song, "Differential privacy-based location privacy enhancing in edge computing," Concurrency and Computation: Practice and Experience, 31(8), e4735, 2019. https://doi.org/10.1002/cpe.4735
K. Chatzikokolakis, C. Palamidessi, M. Stronati, "A predictive differentially-private mechanism for mobility traces," in Proc. of International Symposium on Privacy Enhancing Technologies Symposium, Springer, pp. 21-41, 2014.

피인용 문헌

Dispersed dummy selection approach for location‐based services to preempt user‐profiling vol.33, pp.20, 2021, https://doi.org/10.1002/cpe.6361

KSII Transactions on Internet and Information Systems (TIIS)

A Solution to Privacy Preservation in Publishing Human Trajectories

초록

키워드

1. Introduction

2. Problem Statement

3. Modeling

3.1 Privacy requirement

3.2 Data quality

3.3 User model

3.3.1 Markov-based model

3.3.2 HMM-based model

3.4 Adversary model

4. Privacy-preserving Algorithms for Publishing Activity Trajectories

4.1 Algorithm for the Markov-based model

4.2 Algorithm for HMM-based model

4.3 Comparison of PA-Markov and PA-HMM

4.3.1 Utility

4.3.2 Efficiency

4.3.3 Speedup

5. Evaluation

5.1 Datasets

5.2 Baselines

5.3 Results on the simulated dataset

5.4 Results on the real-world datasets

5.4.1 Privacy Breaches

5.4.2 Utility

5.4.3 Running time

6. Related Work

7. Conclusion and Future Work

참고문헌

피인용 문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)