DOI QR코드

DOI QR Code

Improving efficiency of remote data audit for cloud storage

  • Fan, Kuan (School of computer and Communication Engineering, Northeastern University) ;
  • Liu, Mingxi (School of computer and Communication Engineering, Northeastern University) ;
  • Shi, Wenbo (School of computer and Communication Engineering, Northeastern University)
  • Received : 2018.02.06
  • Accepted : 2018.11.13
  • Published : 2019.04.30

Abstract

The cloud storage service becomes a rising trend based on the cloud computing, which promotes the remote data integrity auditing a hot topic. Some research can audit the integrity and correctness of user data and solve the problem of user privacy leakage. However, these schemes cannot use fewer data blocks to achieve better auditing results. In this paper, we figure out that the random sampling used in most auditing schemes is not well apply to the problem of cloud service provider (CSP) deleting the data that users rarely use, and we adopt the probability proportionate to size sampling (PPS) to handle such situation. A new scheme named improving audit efficiency of remote data for cloud storage is designed. The proposed scheme supports the public auditing with fewer data blocks and constrains the server's malicious behavior to extend the auditing cycle. Compared with the relevant schemes, the experimental results show that the proposed scheme is more effective.

Keywords

1. Introduction

Cloud computing is a type of distributed computing that automatically splits a largenumber of computational processes into numerous smaller subprograms through a network and sends them to a huge system of multiple servers for searching. After calculating and analyzing, the processing results are transmitted back to the user. In 2011 the National Institute of Standards and Technology(NIST) defines that cloud computing is a model forenabling ubiquitous, convenient, on-demand network access to a shared pool of configurablecomputing resources (e.g., networks, servers, storage, applications, and services) that can berapidly provisioned and released with minimal management effort or service providerinteraction [1]. In 2015, the typical cloud service market represented by IaaS, PaaS and SaaSreached 52 billion 240 million US dollars, and it will be expected to 143 billion 530 million US dollars by 2020 [2].

Although cloud computing has many advantages, there are still some debatdata block picked by c matches one of thes and hesitations before it is widely used. Data security and privacy are the major considerations when users use cloud computing [3]. Firstly, the users store personal data on the cloud server and then delete the local copy, which means that they lose control of theirown data. The cloud server may intentionally delete part of the users’ data for his own financial benefit [4,5]. Secondly, because of server hardware damage and some irresistiblereasons, the user’s data will be damaged. When the problems above occur, the cloud servermay try to hide these errors, and make the users believe that the data is still stored on the cloud server correctly. Thirdly, when the third part auditor (TPA) is interested in certain data of a user, he may constantly challenge these data. The TPA obtain the data through calculating the proof returned by CSP, which can lead to privacy leakage of user data [6]. Thus, how to efficiently verify correctness of outsourced data without the local copy and protect the privacy of users data become a big challenge for data storage security in cloud computing. In order to solve the challenge, many studies present different schemes and security models [7-18], these schemes improve the data audit continuously, and strive to achieve the goal of universal, high security, strong privacy and efficient.

In 2007 Ateniese et al. proposed a provable data possession (PDP) model [7]. The PDP protocol supports public auditability to ensure that data is securely stored on untrusted servers, but it only supports static auditing and does not support dynamic data verification. In 2008, Ateniese et al. proposed another scalable PDP protocol that supports dynamicauditability but cannot support full dynamic, such as insert operations [8]. The next year, Wang et al. proposed the challenge-response protocol, which can detect the correctness of the data and the location of the errors, but it still cannot support full dynamic verification [9]. In the same year, Erway et al. proposed a dynamic provable data possession that extends the PDP model to support fully dynamic data [10]. However, public auditing is not supported. In 2007, Juels and Kaliski proposed the proof of retrievability protocol (POR) that allows datarecovery, which not only checks data integrity but also restores the original data [11]. However, their scheme only supports static data storage. The next year, Shacham and Watersproposed an improved POR protocol that uses BLS signatures instead of RSA signatures to shorten the length of evidence in auditing, but the protocol also considers only static data [12]. In 2011, Wang et al. proposed a new protocol to solve the problem of full dynamic public auditing based on the above agreement [13]. In 2013, Wang et al. [14] pointed out that the Wang et al.’s [13] protocol had privacy issues. The TPA can calculate user data throughaudit proof and Wang et al. proposed the scheme to avoid this privacy issue. As a result, the dynamic public verification agreement was basically achieved. Since then, several researchers have studied the factors of auditing efficiency, auditing background and user key security based on the above schemes [15-19]. For example, in the dynamic auditing process, the audit data block size is fixed, which affects the efficiency of updating. In 2014, Liu et al. proposed a protocol that supports audit data blocks are not fixed in size during dynamicauditing [20]. In 2017, Song et al. proposed a signature mechanism with additive homomorphic operations to handle the modification of shared data on the server in the process of dynamic auditing [21]. This signature mechanism supports correctness and completeness verification of multi-user modifications without requiring an al ways-onlinedata owner. At the same year, Fu et al. proposed a new privacy-aware public auditing mechanism to address the issue of public auditing on the integrity of shared data that may reveal data owners' sensitive information [22]. The Fu et al.’s protocol designed that t managers of member groups cloud jointly generate a trace key to prevent abuse of auditrights and construct a binary tree structure. Group members could trace the data changes androll back the data. In 2017, Yu et al. proposed an anti-key compromise protocol to solve the problem of user key leakage in the auditing process [23]. At different stages of the Yu ’sprotocol, users generated different key signatures. As long as the attacker cannot obtain the current key, the attacker does not pose a threat to the security of the data. In 2017, Wang et al. put forward the incentive and unconditionally anonymous identity-based public PDP(IAID-PDP) protocol to solve the problem of user identity privacy in the auditing process [24]. The Wang et al.’s scheme treats agencies as a judiciary, providing identity protection and incentives to users who provide important information. At present, most of auditschemes mainly focus on the auditing performance, auditing role changes, user privacy and other aspects based on the classic audit protocols. However, few researchers focus onselecting audit data to improve audit efficiency. Our scheme puts attention to this issue.

In the majority of audit schemes, the random method is used to extract challenge datablocks. No one cares about how to choose data to improve audit efficiency. Is randommethod the most efficient way of selecting data? It is generally known that data is accessed with a certain frequency and cycle. For this feature, the proposed scheme uses the PPS method to select challenge data blocks. The PPS refers to the probability sampling that the population is divided into primary sampling units (PSU) of unequal capacity based on the auxiliary information [25]. In multi stage sampling, especially two stage sampling, the probability of sampling a PSU depends on the size of PSU in the PPS. The larger the PSU, the greater the probability of the PSU being selected. The smaller the size of the PSU, theless the probability of the PSU being drawn. For this reason, the PPS has a wide range of applications [26-28]. In the proposed scheme, the cloud server will delete the data that is not commonly used by users, so the frequency of data is chosen as ancillary information fordividing users data to form PSU. Therefore, PSU consisting of uncommonly used data has alarge scale, and the probability of picking such PSU as challenge data blocks is very high. The experimental results show that the PPS-selecting applied by our scheme is more efficient than random-selecting in general audit scheme. It can generate the same auditing result as therandom method with fewer challenge blocks.

In public auditing, after the TPA completes the audit, it will store a lot of auditing results about users data. At present, few articles consider how to use these results to feedback forimproving audit scheme’s efficiency. In the proposed scheme, TPA is an honest party that cancollect the auditing results. The TPA may have a certain judgment on the cloud server's creditaccording to auditing results after a period of time. Announcing these auditing results canaffect the reputation of the cloud server, which forces the server to improve the data integrity and give users enough confidence to increase or decrease their auditing cycle. Specifically, the contribution in this work can be summarized as the following three aspects:

1) CSP deletes the data that user infrequency used for the economic or benefit reasons [6,7,29,30]. The PPS is used to extract data blocks as challenge data blocks. Under the same auditing conditions, the PPS has a higher accuracy than the common random sampling to identify the malicious behavior of CSP.

2) TPA collects auditing results of users who use CSP storage server. After a period of time, TPA can estimate the credit of CSP based on the users’ auditing results, which can force CSP to improve service quality.

3) In order to illustrate effectiveness and security of the proposed scheme, security analysis and experimental comparison of the proposed scheme show that the scheme is indeedsafe and effective.

The rest of this paper is organized as follows: Section 2 introduce some definitions. Section 3 presents the system model and the design goals. The detailed description of the proposed scheme are introduced in Section 4. Section 5 gives the proposed scheme providethe security analysis of the proposed scheme. The evaluation of performance is shown in Section 6. Finally, the concluding remark of the whole paper is given in Section 7.

2. Preliminaries

2.1 Bilinear maps

Let G1 , G2 be two multiplicative cyclic groups of prime order p , and g be a generator of G1. A bilinear map\(\begin{equation} e: G_{1} \times G_{1} \rightarrow G_{2} \end{equation}\) with the following properties [31 ]:

1) Computability: there exists an efficiently computable algorithm for computing map\(\begin{equation} e: G_{1} \times G_{1} \rightarrow G_{2} \end{equation}\)

2) Bilinearity: for all \(\begin{equation} u, v \in G_{1} \text { and } a, b \in Z_{p}^{*}, e\left(u^{a}, v^{b}\right)=e(u, v)^{a b} \end{equation}\)

3) Non-degeneracy: \(\begin{equation} e(g, g) \neq 1 \end{equation}\)

2.2 The PPS

The PPS is belong to Unequal Probability Sampling [25]. The PPS is described as follows [32]: Let there be N first-stage sampling units, \(\begin{equation} C_{1}, C_{2}, \ldots, C_{n} \end{equation}\) of sizes \(\begin{equation} M_{1}, M_{2}, \ldots, M_{t}, \ldots, M_{N} \end{equation}\) , respectively. n first-stage sampling units are drawn from N units. Further, let \(\begin{equation} \left(i_{1}, i_{2}, \dots, i_{n}\right) \end{equation}\)denote any combination of n integers taken from N integers,1,2,..., N . In drawing nfirst-stage sampling units, we employ the probability proportionate to the sum of their sizes, that is, \(\begin{equation} C_{a 1}, C_{a 2}, \ldots, C_{a n} \end{equation}\) is drawn with the probability proportional to  

\(\begin{equation} \sum_{i=1}^{n} M_{a i}=M_{a 1}+M_{a 2}+\ldots+M_{a n} \end{equation}\)       (1)

\(\begin{equation} \operatorname{Pr}\left(C_{a 1}, C_{a 2}, \ldots, C_{a n}\right)=\frac{\sum_{i=1}^{n} M_{a i}}{\left(\begin{array}{c} {N-1} \\ {n-1} \end{array}\right) M} \end{equation}\)       (2)

Where \(\begin{equation} M=\sum_{i=1}^{M} M_{i} \end{equation}\)For the second-stage sampling, l second-stage sampling units are subsampled employing the probability which is equal for any combination of l second-stage sampling units in \(\begin{equation} C_{a} \end{equation}\) . Thespecific PPS algorithm is as follows:

As described above, the total number of units is N , n units are drawn from N in the first-stage sample. The value of interval sampling is \(\begin{equation} K=\frac{M}{n} \end{equation}\)= . A value R is extracted randomly between1~ K , so the corresponding unit containing R is the extracted unit, thenthe units at every K size measurement are drawn (e.g., \(\begin{equation} R+K, R+2 K, R+3 K, \ldots, R+(n-1) K) \end{equation}\).

3. Problem statement

3.1 System model

The proposed scheme considers a cloud data storage service involving four different entities, as illustrated in Fig. 1: the cloud users who have large amount of data that need to store in the cloud, and these users are independent of each other; the CSP who is the cloud serveradministrator, has certain amount of storage space and computing resources to provide the users & rsquo; data; the TPA, which is a third party and has a certain computation resources and communication capability, is trusted by users can interact with CSP to audit user data [30]; the public which are using storage services or want to use storage services provided by the CSP.

 

Fig. 1. The system model

In order to improve the auditability efficiency, user uses the PPS method to choosechallenge data blocks. Therefore, in the preprocessing stage, user sorts the data by frequency and then generates a dataset which contains data frequency and id information for PPS. The data frequency is chosen as auxiliary information to divide the outsourced data into a number of PSU. Meanwhile, user generates homomorphic verifiable tags and sends these tag and the outsourced data blocks to the CSP. When the user initiates an audit request, he uses the PPS to select a number of challenge data blocks’ id sent to the TPA along with audit authorization. Then, the TPA sends an auditing challenge together with the audit authorization to the CSP. After receiving these messages, the CSP checks the legitimacy of the TPA's authorization. If valid, the CSP responds to the TPA with a proof; otherwise, is not. Finally, The TPA checks the correctness of the proof and sends an auditing result to user, then it stores the auditing result. In order to reduce the calculation pressure of the auditing system, the TPA counts the auditing results and publishes them regularly, which can constrain the CSP malicious behavior and help the public make some decisions such as selecting storage servers and determining auditing cycles.

3.2 Threat model

In this work, assuming CSP is “semi-honest” [7,33]. That means CSP will forge proof when the outsourced data to be audited is destroyed. The TPA considered to be &ld quo; honest-but-curious & rd quo;[33]. That is to say, it audits the data correctly, collects and publishes the auditing resultshonestly, but it is curious about users data. There are two types of attackers: 1) An internalattacker refers to the CSP; 2) A privacy attacker refers to the TPA. In this paper, threepotential security threats are consider:

1) Data corruption: The adversaries, for their own benefit, might neglect to keep ordeliberately delete infrequently accessed data which are owned by cloud users. Their goal is corrupting users data without being checked by TPA. The adversaries could bethe CSP or other internal attacker.

2) Forgery attack: The adversaries may destroy some data from a user and forge the sedata tags without knowing the user’s private key for maintaining their reputation. Theadvers aries could be the CSP or other internal attacker.

3) Privacy disclosure: The adversaries may infer some users data from the proof during the auditing process. The adversaries aim at getting users data without informing users. The adversaries could be the TPA or other privacy attacker.

3.3 Design goals

To efficiently check the integrity of user data, the proposed scheme should be designed to achieve the following properties:

1) Public auditability: to allow that the TPA verify the correctness of users data storage on the cloud without a copy of the whole data, which reduce the users' computational burden.

2) Storage correctness: to ensure that the CSP that does not store users’ data cannot pass the auditing.

3) Privacy-preserving: to ensure that there is no way for TPA to infer users’ data content from the proofs during the auditing process.

4) Effectiveness: to ensure that the PPS method can detect malicious behavior of CSPusing fewer data blocks than random method. The TPA is allowed to publish the statistical data calculated from auditing results for reducing the CSP’s malicious behavior.

4. The proposed scheme.

4.1 Description of the Scheme

Fig. 2 is the algorithm flow chart of the scheme, the detailed algorithm is as follows:

 

Fig. 2. Algorithm flow chart

1) Initialization parameters

Let G1 and G2 be two multiplicative cyclic groups of prime order p .The global securityparameters of the proposed scheme are \(\begin{equation} \left(G_{1}, G_{2}, e, p, g, u, H\right) \end{equation}\) ,where the \(\begin{equation} e: G_{1} \times G_{1} \rightarrow G_{2} \end{equation}\) be a bilinear map as introduced in preliminaries, g be a random generator of G1 , u be arandom element of G1 ,and \(\begin{equation} H:\{0,1\}^{*} \rightarrow G_{1} \end{equation}\) is a secure one-way hash function from thearbitrary strings to the elements in G1 .

2) Algorithm pretreat() :

The pretreatment concludes two algorithms. The first algorithm generates a new dataset Mlocal , that is keep locally for PPS .The Mlocal consists of the frequency and id of user datablock. The input is \(\begin{equation} M=\left\{m_{1}, m_{2}, \ldots, m_{n}\right\} \end{equation}\)that will be uploaded to the CSP by user; the outputis Mlocal. The algorithm is as follow:

Algorithm 1 the dataset for PPS

 

The second algorithm generates a cumulative Table cum T for PPS. The frequency issampled in this cumulative table using PPS. The input are the Minimum frequency min f , the Maximum frequency max f and the frequency interval interv ; the output is cum T . Thealgorithm is as follow:

Algorithm 2 cumulative Table for PPS

 

The table Tcum generated by algorithm 2 is as follow:

Table 1. Cumulative Table Tcum

 

3) Algorithm keyGen \(\begin{equation} \left(1^{k}\right) \end{equation}\):

The user first chooses a random value\(\begin{equation} \alpha \leftarrow Z_{p} \end{equation}\) as his private key \(\begin{equation} s k_{u} \end{equation}\) , then he computes \(\begin{equation} v \leftarrow g^{\alpha} \end{equation}\) as his public key \(\begin{equation} p k_{u} \end{equation}\) . After that, the user uploads his public key \(\begin{equation} p k_{u} \end{equation}\) to the TPA and stores his private key \(\begin{equation} s k_{u} \end{equation}\) locally. The CSP generates his key pair{ \(\begin{equation} s k_{u} \end{equation}\),\(\begin{equation} p k_{u} \end{equation}\) }, he sends his public key \(\begin{equation} p k_{u} \end{equation}\) to the TPA, and keeps his private key \(\begin{equation} s k_{u} \end{equation}\) locally.

4) Algorithm \(\begin{equation} \operatorname{sig} \operatorname{Gen}\left(M, s k_{u}\right): \end{equation}\)

For each \(\begin{equation} m_{i} \in M(i \in[1, n]) \end{equation}\) , user computes signature \(\begin{equation} \sigma_{i}=\left(H\left(m_{i}\right) u^{m_{i}}\right)^{\alpha} \end{equation}\) by his private key & alpha; and collects a signature set \(\begin{equation} \Phi=\left\{\sigma_{i}\right\}_{1 \mathrm{sis} N} \end{equation}\) . The user generates the tag\(\begin{equation} \operatorname{tag}=\text { name }\|\mathrm{n}\| u \| \operatorname{sig}_{s k_{u}}(\text { name }\|\mathrm{n}\| u) \end{equation}\) of M , where \(\begin{equation} \text { name } \in Z_{P}^{*} \end{equation}\) is a random value which ischosen as the identifier of M by the user. The user sends \(\begin{equation} \{M, \Phi, t a g\} \end{equation}\) to the CSP and deletes M and Φ from his local storage.

5) Algorithm \(\begin{equation} \operatorname{sign} T P A\left(s k_{u}, t a g\right) \end{equation}\) :

The user asks the TPA for his ID in order to grant it audit permission. The TPA return PID that is the ciphertext of his ID using the CSP’s public key. The user computes\(\begin{equation} \operatorname{sig}_{T P A}=\operatorname{sig}_{s k_{w}}(A U T H\|P I D\| \operatorname{tag}) \end{equation}\) , where the AUTH is a random value selected by the user. Then, the user sends AUTH to the CSP and \(Sig _{T P A}\) to the TPA.

6) Algorithm \(\begin{equation} \text { PPSample(M_{\mathrm{local} } }\left., T_{\text {cum}}, \boldsymbol{n}, \boldsymbol{m}\right): \end{equation}\)

The proposed scheme samples n units using PPS from the cumulative Table Tcum in the first stage. Normally, n units have lower frequency. In the second stage, m elements are drawn from every unit using random sampling. Therefore, a total of n m∗ elements are sampled. The input are \(\begin{equation} M_{\text {local}}, T_{\text {cum}} \end{equation}\), n and m. The output is a dataset IChal that contains the id of n m∗ elements. The algorithm is as follow:

Algorithm 3 PPSample

 

7) Algorithm \(\begin{equation} \text { challenge( } p k_{\infty p}, s i g_{T P A}, I \text { Chal) } \end{equation}\):

According to the audit requirements in the public auditing process, TPA chooses the random coefficient  \(\begin{equation} v_{i} \leftarrow Z_{p} \end{equation}\) where \(\begin{equation} i \in I C h a l \end{equation}\) , and then it generates the challenge message \(\begin{equation} \text {chal}=\left\{\operatorname{sig}_{T P A},\{P I D\}_{p k_{\mathrm{csp}}},\left\{i, v_{i}\right\}_{i \in I \text { chal }}\right\} \end{equation}\) The TPA sends the challenge message to the CSP.

8) Algorithm \(\begin{equation} \text { proof }(\operatorname{chal}, \Phi, M): \end{equation}\)

Upon receiving the challenge message chal , the CSP decrypts\(\begin{equation} \{P I D\}_{P K_{\mathrm{cop}}} \end{equation}\) with its private key \(\begin{equation} s k_{c p p} \end{equation}\) , then he uses AUTH , PID , tag and user’s public key \(\begin{equation} p k_{u} \end{equation}\) to verify whether this TPA is indeed authorized by user through the following equation \(\begin{equation} \operatorname{sig}_{T P_{4}}=e\left((H(A U T H\|P I D\| \operatorname{tag}))^{\alpha}, g\right)=e(H(A U T H\|P I D\| \operatorname{tag}), v) \end{equation}\)If equation equals, the CSP computes \(\begin{equation} \mu=\sum_{i=I_{1}}^{I C h a l} v_{i} m_{i} \in Z_{P} \end{equation}\)  and & nbsp;\(\begin{equation} \sigma=\prod_{i=I_{1}}^{I C h a l} \sigma_{i}^{v} \end{equation}\), where IChal is drawn by the PPS. Then the CSP will respond to the TPA with a proof \(\begin{equation} P=\left\{\mu, \sigma, H\left(m_{i}\right)\right\}_{I_{1} \leq i \leq I C h a l} \end{equation}\) 

9) Algorithm \(\begin{equation} \text {verify}\left(p k_{u}, \text {chal}, P\right): \end{equation}\)

The TPA checks the received proof :

\(\begin{equation} e(\sigma, g)=e\left(\prod_{i=I_{1}}^{I_{e}} H\left(m_{i}\right)^{v_{i}} u^{\mu}, v\right) \end{equation}\)       (3)

If it equals, the algorithm returns TRUE, and the TPA believes that the data stored on the cloud is complete, otherwise, it returns FALSE. The TPA sends the result to the user.

10) Algorithm collect (auditingResults) :

The TPA collects the auditing results of each user using the CSP storage service and classifies auditing results according to the size of challenge data blocks. After a period of time, the TPA will get statistics about storage behavior of the CSP.

5.Security analysis

The safety of the proposed scheme will be discussed from several aspects:

Theorem 1. When the user data are unfaithfully stored in the cloud, the CSP cannot generatevalid proof P through the TPA audit.

Proof: Based on the properties of bilinear maps, the correctness of the proposed scheme can be proved by the verification equation through deducing the left hand side from the righthand side.

\(\begin{equation} \begin{aligned} e(\sigma, g) &=e\left(\prod_{i=I_{i}}^{\text {Chal }} \sigma_{i}^{v}, g\right) \\ &\left.=e\left(\prod_{i=I_{i}}^{\text {Chal }}\left(H\left(m_{i}\right) \cdot u^{m_{i}}\right)^{\alpha}\right)^{v}, g\right) \\ &=e\left(\prod_{i=I_{i}}^{\text {LChal }}\left(H\left(m_{i}\right) \cdot u^{m_{i}}\right)^{v_{i}}, g^{\alpha}\right) \\ &=e\left(\prod_{i=I_{1}}^{\text {IChal }}\left(H\left(m_{i}\right)^{n_{i}}\right) \cdot u^{(t)} \cdot v\right) \\ &=e\left(\prod_{i=I_{1}}^{\text {IChal }}\left(H\left(m_{i}\right)^{n_{i}}\right) \cdot u^{\mu}, v\right) \end{aligned} \end{equation}\)

 

Therefore, as long as the cloud server faithfully stores the user data, the audit equation will be valid.

Theorem 2. In the authorization process, the TPA cannot receive the CSP’s response with anintegrity proof P unless this TPA obtain the authorization from user.

Proof: The TPA cannot forge audit authorization \(\begin{equation} \operatorname{sig}_{T P A} \end{equation}\) by himself in the proposed scheme.

The TPA don’t know the private keys sku that generated the authorized signature. If the TPA forged the authorization, the CSP will judge it during the bilinear maps verification processusing the user-generated public key pku . The CSP will not generate evidence as long as the TPA authorization is not validated.

Theorem 3. Though the process of selecting the challenge data blocks, PPS method is betterthan simple random sampling in dealing with the situation of CSP deletes the data that the user has not used for a long time.

Proof: The proof consists of two steps. Firstly, it is necessary to prove that the PPS effect is the same as the simple random sampling if the size of the group is the same or similar.

Assuming there are n pieces of data, and m pieces of data are extracted from n. The probability of simple random sample is\(\begin{equation} \frac{m}{n} \end{equation}\). In the PPS, the data is divided into a groups, each group has \(\begin{equation} \frac{n}{a} \end{equation}\) data. Two-stage sampling, in the first stage, extract b groups from a groups, and in the second stage \(\begin{equation} \frac{m}{b} \end{equation}\) data will be extracted from b groups. So the probability of extracting n pieces of data can be expressed as \(\begin{equation} p(\alpha \beta)=p(\alpha) p(\beta | \alpha) \end{equation}\), where \(\begin{equation} p(\alpha) \end{equation}\) is the probability of extract b groups from a , and \(\begin{equation} p(\beta | \alpha) \end{equation}\) is a conditional probability of extracting \(\begin{equation} \frac{m}{b} \end{equation}\) data in the second stage when the b group has been extracted in the first stage:

\(\begin{aligned} p(\alpha \beta) &=p(\alpha) p(\beta | \alpha) \\ &=\frac{b}{a} \times \frac{\frac{m}{b}}{\frac{n}{a}} \\ &=\frac{m}{n} \end{aligned}\)

Thus, if there is as much data as possible in each group, the probability of PPS drawing data is the same as simple random sampling.

Secondly, it is need to prove that PPS tends to extract large groups instead of small ones, if the group sizes are different. Assuming there are n pieces of data, and m pieces of data are extracted from n, the probability of sample is \(\begin{equation} p(\alpha \beta)=\frac{m}{n} \end{equation}\). The number of groups drawn in the first stage is fixed at b, then in the ideal state, no matter the size of the group, each groupextracted from the c pieces of data. So \(\begin{equation} m=b \times c \end{equation}\) and \(\begin{equation} p(\beta | \alpha)=\frac{c}{k_{i}} \end{equation}\), where\(\begin{equation} k_{i} \end{equation}\) indicates thesize of the group. According to the formula \(\begin{equation} p(\alpha \beta)=p(\alpha) p(\beta | \alpha) \end{equation}\) , calculate \(\begin{equation} p(\alpha) \end{equation}\)

\(\begin{aligned} p(\alpha) &=\frac{p(\alpha \beta)}{p(\beta | \alpha)} \\ &=\frac{m / n}{c / k_{i}} \\ &=\frac{b \times c / n}{c / k_{i}} \\ &=\frac{b \times k_{i}}{n} \end{aligned}\)

Thus, if a group have more data, then the probability of sampling data in this group is relatively large by the PPS method.

The first step of proof shows that the probability of PPS extracting data is the same assimple random sampling if the amount of data in each group is the same; the second step of proof means that PPS prefers to the group of more data in the sampling process under the condition of the same probability. So compared to simple random sampling, PPS prefers grouping with more data and extracts data from it.

6. Performance evaluation

6.1 Experimental results

In order to evaluate the efficiency of the proposed scheme, the experiment is conducted on a Linux server with 2.7 GHz CPU and 4 GB memory on the Ubuntu system. In the experiment, the GNU Multiple Precision Arithmetic (GMP) [34] and Pairing-Based Cryptography (PBC) [35] library are used to achieve the proposed audit algorithm. The PPS algorithm isimplemented by Python language and the rest are based on C language. The experimentemploys a 59 MNT d curve, which has a 160-bit group order. Thus, |p| is 160 bits. Thesize of each data block is the same, 167 bits.

Pretreatment and authenticators generation: According to section 4.1, the user have topre-process the data blocks, including grouping the data blocks by frequency and generating data signatures, and then sending the processed dataset to the CSP. In order to prove that the process of the pretreatment costs little extra computation overhead, the following experimentis designed. Fig. 3 shows the time of grouping data blocks and generating authenticators with different number of data blocks. The time spend on the proposed scheme is compared with the SW scheme [12] that does not support data grouping by frequency. From Fig. 3, it can be seen that the time of our scheme is basically the same as the SW scheme. Processing 10,000 data blocks, our program spends 28.221 seconds, while the SW spent 27.8 seconds. The time difference between the proposed our scheme and the SW's scheme is 0.01 secondsto 0.42 seconds from the treatment of 1000 to 100000 data blocks.

 

Fig. 3. Computation cost of grouping and generating authenticators

Performance of Auditing: As mentioned in Section 4.1, when users want to audit datastored on CPS, they use the PPS method to select a certain number of data ID from the dataset they keep and then authorize the TPA to prevent disclosure of user privacy. In order to evaluate the overhead of auditing computation in the proposed scheme, three diagrams are illustrated, which are sample time, proof generation time and audit proof time. Fig. 4 shows that the time of of extracting 100-1000 data blocks from 1 million data. The time spend on the PPS is compared with the random sample that is used in most auditing schemes. Because the PPS has two sampling processes, it spend more time than random sample. The timedifference between the PPS and the random sample is 0.19 seconds to 0.2 seconds from the treatment of 100 to 1000 data blocks. Fig. 5 manifest the time spend on generating proof with different number of data blocks. It compared with the SW’s scheme. It can be seen thatour scheme’s time is close to the SW’s. Fig. 6 shows the time of verify proof. From Fig. 6, it can be seen that the time of our scheme is basically the same as the SW’s scheme. The timedifference between the our scheme and the SW's scheme is 0.001 seconds to 0.05 seconds from the treatment of 100 to 1000 data blocks.

 

Fig. 4. Computation cost of sample

 

Fig. 5. Computation cost of proof generation

 

Fig. 6. Computation cost of audit proof

Detection CSP Deleting Data Probability: Assuming that the users store n data blocks in the CSP, and the CSP deletes the l data blocks. Let c is the number of challenge data blocks . Let X be a discrete random variable that is defined to be the number of data blocks chosenby c that detect the CSP deletes user data behavior. 

PX is the probability that at least one of the data block picked by c matches one of the datablocks deleted by the CSP. So:

\(\begin{equation} P_{x}=P\{X \geq 1\}=1-P\{X=0\}=1-\frac{n-l}{n} \cdot \frac{n-1-l}{n-1} \cdot \frac{n-2-l}{n-2} \ldots \ldots \frac{n-c+1-l}{n-c+1} \end{equation}\)

since, 

\(\begin{equation} \frac{n-j-l}{n-j} \geq \frac{n-j-1-l}{n-j-1} \end{equation}\)

 it follows that:

\(\begin{equation} 1-\left(\frac{n-l}{n}\right)^{c} \leq P_{X} \leq 1-\left(\frac{n-j-1-l}{n-j-1}\right)^{c} \end{equation}\)

 PX is the probability that, if the CSP deletes l user data blocks of n, then the TPA will detect the CSP malicious behavior with c challenge. When the number of l is certain, c canverify the CSP misbehavior with a certain probability by asking proof for a certain amount of data, independently of the total number of data n: e.g., if l = 1% of n, 460 and 300 datablocks are selected by the TPA in order to achieve PX of at least 99% and 95%, respectively [7].

Assuming n is 10000 data blocks, the CSP randomly deletes 100 data blocks, which is 1% of n. The PPS and random sampling are used to extract the challenge data blocks c . From Fig. 7, it can be seen that the probability of detecting the CSP’s malicious behavior by PPS is not significantly different from the probability of detecting by random sampling. Whenc=100, the probability of using random sampling is 58% and the probability of using PPS is 61%; when c=400, the probability of using random sampling and PPS are 97% and 98 % respectively; when c=500, probability of using random sampling is 100% and the probability of using PPS is also 100%. We can conclude that there is no significantly difference between the probability of PPS and the probability of random in the case of CSP randomly deletinguser data.

 

Fig. 7. Comparison of the error probability between PPS and random under the condition of CPS deleting data randomly

Assuming n is 10000 data blocks, the CSP randomly deletes 100 infrequency data blocks, which is 1% of n. The PPS sampling and random sampling are used to extract the challengedata blocks c. Fig. 8 shows the probability of detecting the CSP’s malicious behavior using the PPS and random sampling. It can be seen that the probability of using random sampling is 57% and using the PPS is 78% in the case of c=100. When c=300, probability of usingrandom sampling is 94%, while probability of using PPS is 99%. When c=460, the probability of using random sampling can reach 99%, however, when c=350, the probability of using the PPS is 100%; Obviously, if the CSP randomly deletes infrequency data, the probability of detecting the CSP’s malicious behavior by the PPS is higher than that by random sampling. Specifically, The proposed scheme can use 300 challenge data blocks instead of 460. According to the Fig. 4, Fig. 5 and Fig. 6, the computation cost of PPS, proof generation and proof verification are 0.231 seconds ,0.115 seconds and 0.679 seconds in the proposed scheme when c=300, while the computation cost of PPS, proof generation and proof verification are 0.037 seconds ,0.125 seconds and 0.91 seconds in the SW’s scheme

 

Fig. 8. Comparison of the error probability between PPS and random under the condition of CPS deleting infrequency data

The TPA Collects Auditing Results: The TPA collects auditing results from many users and publishes the aggregated auditing results periodically. Assuming the CSP will delete 85% ofuser & rsquo;s infrequently used data, the proportion of deleting every user's data is 0.1% -1% of usertotal data, and then the credit of the CSP is 15%. In this experiment, 200 users are selected, the CSP deletes 170 users’ data, and the number of challenge data blocks are 100-1000. The TPA collects auditing results of these 200 users. Table 2 is the auditing results collected by the TPA. With the increase of challenge data blocks, the probability of checking the CSP delete data gradually increased. When the number of challenge data blocks are 1000, the probability is 83.5%, which is close to the hypothetical CSP deletion probability 85%. The CPS credit is related to the error probability, their sum is 100%. When the deleted data is between 0.1% to 1%, the probability of checking error is given by Table 3. Under the condition of 300 challenge data blocks, the accuracy of checking gradually improved as the number of deleted data increases. This is the reason that the error rate is 69.5%, not 99% with the 300 challenge data blocks.

Table 2. The error probability and credit of the CSP with different challenge blocks

 

Table 3. The check error probability with different ratio of deleting datathe ratio of delete data  

 

7. Conclusion

In this paper, we propose an improving audit efficiency scheme for cloud storage. The PPS is used to extract the challenge data blocks to handle the situation that the CSP deletes the infrequency data of users. The proposed scheme uses fewer challenge data blocks than the ordinary auditing scheme to achieve the same auditing result. Considering the CSPattaches great importance to his reputation, the TPA collects the auditing results of users and publishes the statistical results regularly. This behavior constrains the server's malicious behavior and extends the auditing cycle, which can reduce the computing pressure of the whole system. The experimental results demonstrate the high efficiency of the proposed scheme.

Acknowledgement

The authors thank the editors and the anonymous reviewers for their valuable comments. This research was supported by National Natural Science Foundation of China (Grant No.61472074,U1708262) and the Fundamental Research Funds for the Central Universities (No.N172304023).

References

  1. Mell P, Grance T, "The NIST definition of cloud computing[J]," Communications of the Acm, vol. 5(6), pp. 50-50, 2011.
  2. https://en.wikipedia.org/wiki/Cloud_computing
  3. Fox, Armando, et al., "Above the clouds: A berkeley view of cloud computing," Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS 28, vol. 13 (2009), 2009.
  4. Customer Presentations on Amazon Summit Australia, Sydney, 2012, accessed on: March 25, 2013.
  5. D. Zissis and D. Lekkas, "Addressing Cloud Computing Security Issues," Future Gen. Comput. Syst., vol. 28, no. 3, pp. 583-592, Mar. 2011. https://doi.org/10.1016/j.future.2010.12.006
  6. M. A. Shah, R. Swaminathan, and M. Baker, "Privacy-preserving audit and extraction of digital contents," Cryptology ePrint Archive, Report 2008/186, 2008.
  7. G. Ateniese, R. Burns, R. Curtmola, J. Herring,L. Kissner, Z. Peterson, and D. Song, "Provable data possession at untrusted stores," in Proc. of the 14th ACM Conference on Computer and Communications Security, pp. 598-609, Virginia, USA, 2007.
  8. G. Ateniese, R. D. Pietro, L. V. Mancini, and G. Tsudik, "Scalable and efficient provable data possession," in Proc. of the 4th International Conference on Security and Privacy in Communication Netowrks, pp. 9:1-9:10, Istanbul, Turkey, 2008.
  9. C. Wang, Q. Wang, K. Ren, and W. Lou, "Ensuring data storage security in cloud computing," in Proc. of the 17th International Workshop on Quality of Service (IWQoS'09), pp. 1-9, South Carolina, USA, 2009.
  10. C. Erway, A. K, C. Papamanthou, and R. Tamassia, "Dynamic provable data possession," in Proc. of the 16th ACM Conference on Computer and Communications Security, pp. 213-222, Illinois, USA, 2009.
  11. A. Juels and J. Burton S. Kaliski, "Pors: Proofs of retrievability for large files," in Proc. of the 14th ACM Conference on Computer and Communications Security, pp.584-597, Virginia, USA, 2007.
  12. H. Shacham and B. Waters, "Compact proofs of retrievability," in Proc. of the 14th International Conference on the Theory and Application of Cryptology and Information Security (ASIACRYPT'08), pp 90-107, Melbourne, Australia, 2008.
  13. Q. Wang, C. Wang, K. Ren, W. Lou, and J. Li, "Enabling public auditability and data dynamics for storage security in cloud computing," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 5,pp. 847-859, 2011. https://doi.org/10.1109/TPDS.2010.183
  14. C. Wang, S. S. M. Chow, Q. Wang, K. Ren, and W. Lou, "Privacy-preserving public auditing for secure cloud storage," IEEE Transactions on Computers, vol. 62, no. 2, pp. 362-375, 2013. https://doi.org/10.1109/TC.2011.245
  15. Wang J, Chen X, Huang X, et al., "Verifiable Auditing for Outsourced Database in Cloud Computing[J]," IEEE Transactions on Computers, vol. 64, no. 11, pp. 3293-3303, 2015. https://doi.org/10.1109/TC.2015.2401036
  16. C. Liu, C. Yang, X. Zhang, and J. Chen, "External integrity verification for outsourced big data in cloud and iot: A big picture," Future Generation Computer Systems, vol. 49, no. 6, pp. 58-67, 2015. https://doi.org/10.1016/j.future.2014.08.007
  17. J. Yu, R. Hao, H. Zhao, M. Shu, and J. Fan, "IRIBE: Intrusion-Resilient Identity-Based Encryption," Information Sciences, vol. 329, pp. 90-104, 2016. https://doi.org/10.1016/j.ins.2015.09.020
  18. Y. Zhang and M. Blanton, "Efficient Dynamic Provable Possession of Remote Data via Balanced Update Trees," in Proc. of Department of the 8th ACM SIGSAC symposium on Information, computer and communications security. ACM, pp.183-194, 2013.
  19. W.Shen, J.Yu,G. Yang, Y.Zhang, Z.Fu, and R.Hao. "Access-Authorizing and Privacy-Preserving Auditing with Group Dynamic for Shared Cloud Data," KSII Transactions on internet and information systems, vol. 10, no. 7, pp. 3319-3338, Jul. 2016. https://doi.org/10.3837/tiis.2016.07.025
  20. C. Liu, J. Chen, L. T. Yang, X. Zhang, C. Yang,R. Ranjan, and K. Ramamohanarao, "Authorized public auditing of dynamic big data storage on cloud with efficient verifiable fine-grained updates," IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 9, pp. 2234-2244, 2014. https://doi.org/10.1109/TPDS.2013.191
  21. W. Song, B. Wang, Q. Wang, Z. Peng, and W. Lou, "Tell me the truth: Practically public authentication for outsourced databases with multiuser modification," Inf. Sci., vol. 387, pp. 221-237, May 2017. https://doi.org/10.1016/j.ins.2016.07.031
  22. A.Fu,S.Yu, Y. Zhang, H. Wang, and C. Huang, "NPP: A New Privacy-Aware Public Auditing Scheme for Cloud Data Sharing with Group Users," IEEE Transactions on Big Data, pp. (99)1-1, 2017.
  23. J.Yu, K. Ren, C. Wang, "Enabling Cloud Storage Auditing With Key-Exposure Resistance," IEEE transactions on information forensics and security, vol. 11, no. 6, pp. 1362-1375, june 2016. https://doi.org/10.1109/TIFS.2016.2528500
  24. H.Wang, D. He, J.Yu, Z.Wang, "Incentive and Unconditionally Anonymous Identity-Based Public Provable Data Possession," IEEE Transactions on Services Computing, pp. (99)1-1, 2016.
  25. Cochran W G, "Samplimg techniques:Willaim Gemmell Cochran.[J]," New York, John Wiley & Sons, 1963.
  26. A.J.R. Cotter, G. Course, S.T. Buckland, C. Garrod, "A PPS sample survey of English fishing vessels to estimate discarding and retention of North Sea cod,haddock, and whiting," Fisheries Research, vol. 55, no. 1-3, pp. 25-35, 2002. https://doi.org/10.1016/S0165-7836(01)00306-X
  27. Myint T, Htoon M T, Shwe T, "Estimation of leprosy prevalence in Bago and Kawa townships using two-stage probability proportionate to size sampling technique.[J]," International Journal of Epidemiology, vol. 21, no. 41, pp. 778-783, 1992. https://doi.org/10.1093/ije/21.4.778
  28. Hoogduin L A, Manager S, Statistician, et al., "Modified Sieve Sampling: A Method for Single-and Multi-Stage Probability-Proportional-to-Size Sampling[J]," Auditing A Journal of Practice & Theory, vol. 29, no. 1, pp. 125-148, 2010. https://doi.org/10.2308/aud.2010.29.1.125
  29. Q. Wang, C. Wang, J. Li, K. Ren, and W. Lou, "Enabling public verifiability and data dynamics for storage security in cloud computing," in Proc. of ESORICS'09, Saint Malo, France, pp. 355-370, Sep. 2009.
  30. C. Wang, Q. Wang, K. Ren, and W. Lou, "Privacy-Preserving Public Auditing for Data Storage Security in Cloud Computing," in Proc. of IEEE INFOCOM, pp. 1-9, 2010.
  31. D. Boneh, B. Lynn, and H. Shacham, "Short signatures from the weil pairing," in Proc. of the 7th International Conference on the Theory and Application of Cryptology and Information Security: Advances in Cryptology (ASIACRYPT'01), pp. 514-532, Gold Coast, Australia, 2001.
  32. Midzuno H, "On the sampling system with probability proportionate to sum of sizes[J]," Annals of the Institute of Statistical Mathematics, vol. 3, no. 1, pp. 99-107, 1951. https://doi.org/10.1007/BF02949779
  33. WF Hsien , CC Yang , MS Hwang, "A Survey of Public Auditing for Secure Data Storage in Cloud Computing," International Journal of Network Security, vol.18, no.1, pp.133-142, Jan. 2016.
  34. Free Software Foundation, The GNU multiple precision arithmetic library, 2015.
  35. B. Lynn, The pairing-based cryptographic library, 2015.