DOI QR코드

DOI QR Code

The Novel Efficient Dual-field FIPS Modular Multiplication

  • Zhang, Tingting (School of Computer and Information, Anhui Normal University) ;
  • Zhu, Junru (School of Computer and Information, Anhui Normal University) ;
  • Liu, Yang (School of Computer and Information, Anhui Normal University) ;
  • Chen, Fulong (School of Computer and Information, Anhui Normal University)
  • Received : 2019.06.03
  • Accepted : 2019.11.13
  • Published : 2020.02.29

Abstract

The modular multiplication is the key module of public-key cryptosystems such as RSA (Rivest-Shamir-Adleman) and ECC (Elliptic Curve Cryptography). However, the efficiency of the modular multiplication, especially the modular square, is very low. In order to reduce their operation cycles and power consumption, and improve the efficiency of the public-key cryptosystems, a dual-field efficient FIPS (Finely Integrated Product Scanning) modular multiplication algorithm is proposed. The algorithm makes a full use of the correlation of the data in the case of equal operands so as to avoid some redundant operations. The experimental results show that the operation speed of the modular square is increased by 23.8% compared to the traditional algorithm after the multiplication and addition operations are reduced about (s2 - s) / 2, and the read operations are reduced about s2 - s, where s = n / 32 for n-bit operands. In addition, since the algorithm supports the length scalable and dual-field modular multiplication, distinct applications focused on performance or cost could be satisfied by adjusting the relevant parameters.

Keywords

1. Introduction

Cryptography provides a strong support for data security with encryption and decryption. In particular, the public key cryptography represented by RSA (Rivest-Shamir-Adleman) and ECC (Elliptic Curve Cryptography) is the core component of PKI/CA system. It solves the critical issues such as key distribution and identity recognition of symmetric cryptography, and ensures data confidentiality, integrity and authenticity of both parties in the e-commerce, e-government, e-bank and social networks applications [1]. Compared with RSA, ECC has a shorter key length at the same security strength, and has multiple advantages in a variety of wireless devices, smart cards and other resource-constrained device applications.

In our work, we will study on the design of a kind of efficient dual-field FIPS (Finely Integrated Product Scanning) modular multiplication which supports both RSA and SM2 cryptosystems in prime and binary fields. It takes full account of the modular multiplication of the public key cryptography algorithm under the condition of equal operands, reduces the multiplication, addition and reading operations in the case, improves the operation efficiency of modular multiplication, decreases the power consumption, and provides a unified dual-field modular multiplication algorithm for resource constrained devices to implement both RSA and SM2 algorithms. Moreover, by reorganizing the operation flow of dual-field modular square and optimizing its logic structure, with pipeline architecture which takes small area and low power consumption as the focus, using dual-field multiplier, dual-field adder and modular multiplication controller to control the data operation process in the key data path, an efficient modular exponentiation circuit is designed.

2. Related Works

RSA and ECC algorithms involve a large number of complex operations. Implemented using software alone, their efficiency is very low. In order to improve the efficiency and security of the public key cryptography algorithm, usually using FPGA (Field -Programmable Gate Array) or VLSI (Very Large Scale Integration) to implement the key operations involved in the algorithm [2], the cryptographic algorithm is integrated into the circuit design of related cryptographic coprocessor, cipher chip and so on. It focuses on complex operations such as modular multiplication, modular exponentiation and modular inversion.

In [3] and [4], the remainder system is used to improve the parallelism of modular multiplication and modular exponentiation, and the operation is accelerated by the efficient arithmetic unit architecture. Using the large number subtraction to optimize modular multiplication [6,7], although the internal loop has been increased for one time, the period of modular multiplication and modular exponentiation is reduced. Li [8] and Chen [9] have made effective improvements in the design and implementation of dual finite field modular multiplication. Kadar [10] and Qi [11] use reversible logic gates to design modular multiplication and modular inverse circuits of cryptographic algorithms to prevent the loss of energy during computation. However, the security chip supporting both RSA and SM2 algorithms requires higher resource consumption. In recent years, with the development of chip manufacturing technology, hardware design of cryptographic systems emphasizes high speed and ignores the area, power and resource consumption. Therefore, more efficient implementations are needed to overcome the environmental requirements in the application process.

Due to the efficiency and security of the algorithm, most of the key operations of RSA and ECC, especially the high complexity multiplication, are implemented in hardware. Using a simple addition, multiplication and shift operation to solve the tedious division problems in traditional algorithms, the Montgomery modular multiplication algorithm [12] is suitable for hardware implementation. In the recent study, the dependency graph and multiple process elements (PEs) are the research hotspots in the Montgomery modular multiplication algorithm. According to the algorithm, Lin et al [5, 13, 14] proposed a hardware architecture consisting of multiple PEs to work in parallel for reducing the delay and the memory bandwidth requirement, and achieving higher throughput. Renardy et al [15] designed an iterative modular architecture on FPGA and achieved better AT2 (area delay). A novel iterative architecture [16] for the prime field GF(p) is proposed lately by Morales-Sandoval to reduce the area. However, the above studies only consider the GF(p) field or the binary field GF(2m) in a single field situation. Of the design of less dual-field models, Savas [17], Zheng [18] and Liao [19] et al have done much work on the design of modular multiplication hardware with significant contributions to scalability, speed, and the RSA/ECC coprocessors.

In general, on the one hand, the above research on the Montgomery modular algorithm is less supportive for the dual-field. On the other hand, these works focus on the pursuit of speed, and in terms of area, power consumption and other aspects, they are not suitable for the applications in wireless devices, smart cards and other resource-constrained devices. In order to meet the parameter requirements of resource-constrained devices, we re-design the Montgomery modular multiplication algorithm, and optimize the modular square based on the characteristics of the FIPS so as to increase the speed of the modular square, and reduce the operations of the operand access, multiplication and addition. Furthermore, the improved algorithm is extended to the GF(2m) field to support both the RSA and ECC cryptosystems, and the corresponding circuits are implemented on FPGA.

3. FIPS Algorithm Optimization

In the RSA cryptosystem, the modular exponentiation algorithm is implemented by the repeated modular multiplication. The modular multiplication is also the basic operation of the SM2 crytosystem. The modular multiplication is the lowest level operation in the encryption and decryption algorithm, and its performance determines the overall speed and efficiency. In the implementation of modular multiplication, Blakley [20], Barrett [21], Montgomery [12] and other algorithms are commonly used. Modular multiplication plays an important role in the basic operations in RSA and SM2. The probability of modular square operations is close to 50%. However, the computing speed of modular square is limited by modular multiplication. It is necessary to reduce the cycle of modular square operation so as to improve the efficiency of encryption and decryption.

3.1 Basic FIPS Algorithm

Through the analysis of different modular multiplication algorithms, the Montgomery modular multiplication algorithm is more suitable for hardware design, and the highest efficiency can be achieved. As a kind of efficient implementation, its FIPS modular multiplication algorithm is described as shown in Algorithm 1.

Algorithm 1. Basic FIPS modular multiplication algorithm

It can be seen that the FIPS modular multiplication algorithm based on product scanning mainly consists of two loops for calculating respectively the most s significant bits and least s significant bits of the final result. Taking the 6 6 × word operands A and B for example, the calculation process of the FIPS modular multiplication algorithm is shown in Fig. 1. The red box shows i from 0 to 1, that is, mi is computed from the right to the left by the product scanning mode. The result mi of the current column i is used to calculate the operations that involve the ith row afterwards, as shown in the black solid line arrow. The low 6 bit mi is replaced by a high 6 bit mi update after the operation is completed.

E1KOBZ_2020_v14n2_738_f0001.png 이미지

Fig. 1. FIPS Modular Multiplication​​​​​​

3.2 Optimized FIPS Algorithm

Assumed that only multiplication operations are considered, when A≠B , it only needs to simply accumulate the partial products and takes s2 word-based modular multiplication operations, as shown in Algorithm 2.

Algorithm 2. Optimized FIPS modular multiplication algorithm

When A=B , the accumulated operations on the partial product can be simplified, as shown in Algorithm 3. Through the 2 times accumulation of ajbi in the original algorithm, the calculation flow can be optimized, and ultimately, it only takes (s2+s)/2 word-based multiplication operations. As a result, the computation of A=B compared with A≠B decreases (s2+s)/2 multiplication operations.

Algorithm 3. FIPS modular square algorithm

With the modular square optimization, as shown in Fig. 1, the FIPS modular multiplication algorithm can be optimized. Since the partial product in the grey is symmetric to the current column, when accumulating the grey partial product, the symmetric partial product can be accumulated ahead of time, i.e., when A=B, ajbi-j = ai-jbi. The ai-jbi accumulation of the previous and posterior s columns can be written as \(\sum_{j=0}^{i} a_{j} b_{i-j}\) and \(\sum_{j=i-s+1}^{s-l} a_{j} b_{i-j}\) respectively. \(\sum_{j=0}^{i} a_{j} b_{i-j}\) can be expanded as

\(\left\{\begin{array}{ll} a_{0} b_{1}+a_{1} b_{0}=2 a_{0} b_{1} & i=1 \\ a_{2} b_{0}+a_{1} b_{1}+a_{0} b_{2}=2 a_{0} b_{2}+a_{1} b_{1} & i=2 \\ a_{3} b_{0}+a_{2} b_{1}+a_{1} b_{2}+a_{0} b_{3}=2 a_{0} b_{3}+2 a_{1} b_{2} & i=3 \\ a_{4} b_{0}+a_{3} b_{1}+a_{2} b_{2}+a_{1} b_{3}+a_{0} b_{4}=2 a_{0} b_{4}+2 a_{1} b_{3}+a_{2} b_{2} & i=4 \\ a_{5} b_{0}+a_{4} b_{1}+a_{3} b_{2}+a_{2} b_{3}+a_{1} b_{4}+a_{0} b_{5}=2 a_{0} b_{5}+2 a_{1} b_{4}+2 a_{2} b_{3} & i=5 \end{array}\right.\)       (1)

The following conclusions can be drawn

\(\sum_{j=0}^{i} a_{j} b_{i-j}=\left\{\begin{array}{ll} 2 \sum_{j=0}^{i / 2} a_{j} b_{i-j} & i \% 2=1 \\ 2 \sum_{j=0}^{i / 2-1} a_{j} b_{i-j}+a_{i / 2} b_{i / 2} & i \% 2=0 \& i \neq 0 \end{array}\right.\)       (2)

Similarly, for \(\sum_{j=i-s+1}^{s-1} a_{j} b_{i-j}\), for the ith column with parity difference, there is a j = i = j  in the ith even columns, and the partial product has only one, e.g., ai/2bi/2 . Therefore, the ith column is symmetric relative to the partial product, and the cumulative form is somewhat different because the difference of parity in the ith column. For i∈[1,2s-2], by this way, the computing of ajbi-j can be reduced by nearly 50%. Moreover, in logic circuits, 2ajbi-j can be generated by only one left shift of ajbi-j .

4. Efficient Dual-field FIPS Modular Multiplication Algorithm

4.1 Algorithm improvement

Through the FIPS modular multiplication algorithm principle and operation characteristics analysis, we can see, for the logic optimization of FIPS modular multiplication algorithm, measures need to be taken in the critical computing process, as shown in Fig. 2. In addition, a dual-field modular multiplication unit that supports both the prime field GF(p) and the binary field GF(2m) is needed. Therefore, the efficient dual-field FIPS algorithm is improved on the basis of modular multiplication, where field is used to control the selection of operation fields, and equal ← A=B?0:1 is used to control the selection of modular multiplication operations including modular multiplication or modular square.

E1KOBZ_2020_v14n2_738_f0002.png 이미지

Fig. 2. Critical Computation Process

The improved FIPS modular multiplication algorithm is shown in Algorithm 4.

Algorithm 4. Efficient Dual-field FIPS modular multiplication algorithm

In the calculation process, two variables, j and i - j , are used to determine the flow of calculation, so as to avoid the increase of redundant variables.

  • When A=B , i.e., equal = 0 , similar to the traditional FIPS modular multiplication algorithm, each internal loop needs only a simple accumulation of ajbi-j − and mjni-j . As shown in Fig. 2, the Formula 3 is executed until the algorithm is over.

(v2, v1, v0)w ← (v2, v1, v0)w + mjni-j       (3)

  • When A≠B, i.e., equal =1 , as shown in Fig. 1, the symmetric partial product is accumulated in advance when accumulating the grey partial product ajbi-j , and after that, it just needs to accumulate mjni-j . The algorithm controls the executed formula according to the relationship between j and i - j  each time. In the case of j < i - j , the partial product ajbi-j  is accumulated 2 times according to Formula 4. If j = i - j , because there is no symmetric partial product, it will be accumulated according to Formula 5. For each time of the inner loop, both Formula 4 and Formula 5 need two multiplication operations and two addition operations. And when j < i -j , since the partial product ajbi-j  has accumulated ahead of time, and at this point only the mjni-j part is left, for each time of the inner loop, Formula 3 only needs one multiplication operation and one addition operation until the algorithm is over.

(v2, v1, v0)w ← (v2, v1, v0)w + 2ajbi-j + mjni-j       (4)

(v2, v1, v0)w ← (v2, v1, v0)w + ajbi-j + mjni-j       (5)

When calculating the modular square, the computational flow of traditional FIPS modular multiplication algorithm is the same as modular multiplication algorithm, and needs 2s2 + s multiplication operations. The improved FIPS modular multiplication algorithm uses equal, j and i - j signals to distinguish and control the modular multiplication and modular square operation. It reduces unnecessary operations, but also increases the difficulty of hardware implementation.

4.2 Algorithm efficiency

In [22], the optimization for the modular square is also based on FIPS. Nevertheless some operations such operand accesses, multiplications and additions are more redundant in the algorithm, and the operations on the GF(2m) field are not supported. At the same time, no further analysis is made for the application of improved modular multiplier.

For the FIPS algorithm, the coordinate system is constructed with i - j  as the abscissa axis and j as the ordinate axis, as shown in Fig. 3. The FIPS algorithm uses the product scanning mode, indicated by the red dotted line where i increments from 0 to 2s−1 and the dashed arrow points to the ascending order of the bi-j subscript. When A = B , ajbi-j on the dot matrix is symmetric about the j = i - j line, and the cumulative directions shown in the red dashed line are also symmetric about the j = i - j line. This makes it easier to adopt Algorithm 3 to calculate \(\sum_{j=0}^{i} a_{j} b_{i-j}\) of the front s columns and \(\sum_{j=i-s+1}^{s-1} a_{j} b_{i-j}\) of the rear s columns without extra storage space.

E1KOBZ_2020_v14n2_738_f0003.png 이미지

Fig. 3. FIPS Algorithm

According to Fig. 3, when A B = , the traditional FIPS algorithm needs to complete s2 w-bit × w-bit product operations in the square matrix and accumulate them. Through the improvement of the FIPS algorithm, only (3s2-s)/2 multiplication operations in the shadow of the square matrix are needed, and (3s2-s)/2 multiplication operations are reduced. By Formula 3, 5 and 4, similarly, the read operations on multiplication operands and addition operations on product results are reduced. Their reductions are s2-s and (3s2-s)/2 respectively. The number of operations decreases as the width of operands varies. Table 1 shows the comparison of the cycles of modular square operations in different word widths, setting the frequency is 500MHz. In modular exponentiation operations of RSA and double point operations of SM2, it can make full advantage of the efficient dual-field FIPS modular multiplication algorithm.

Table 1. Comparison of modular square operation time in different word widths

E1KOBZ_2020_v14n2_738_t0001.png 이미지

In addition, the difference of algorithm design in dual-field is represented in the following two aspects:

  • First, the addition operations in the prime field are carried with carries, and in the binary field, XOR operations are performed without carry. Similarly, the accumulation of the partial products in the multiplier is also different.
  • Second, due to no carry on binary domain, the final result will not be greater than N, and the subtraction step is omitted.

5. Design of Efficient Dual-field FIPS Modular Multiplication

Although the dual-field efficient FIPS modular multiplication algorithm has a certain degree of efficiency improvement, the data path of the operation is more complex, increasing the difficulty of controller design. The modular multiplication circuit employs resource reuse technology, focuses on the area and power consumption, and at the same time, optimize synthetically the circuit design. It provides an important support for the applications of public key cryptography in resource constrained devices.

5.1 Logic structure

As shown in Fig. 4, the logic structure of the efficient dual-field FIPS modular multiplication mainly consists of the following modules.

E1KOBZ_2020_v14n2_738_f0004.png 이미지

Fig. 4. Logic structure of efficient dual-field FIPS modular multiplication

  •  Data input unit

It uses 32 bit bus for in turn receiving operands including A, B, N and n0', and storing them in the corresponding memory.

  • Modular multiplication operation unit

The module is the main part of the operation, which is responsible for the calculation of the critical path of the modular multiplication algorithm. The operation in the dual domain is mainly embodied in the multiplier and adder of this part.

  •  Data register file

It is mainly responsible for the preservation of the intermediate results of the operation process, including the field and equal signals which are stored in registers. field is used to control the selection on the dual fields, i.e., if field = 0 , the operation on the GF(p) field will be executed; otherwise, the operation on the GF(2m) field will be executed. equal is used for modular square selection, if equal = 0 , the operands are not equal and the modular multiplication operation will be executed; otherwise, the modular square operation will be executed.

  • Modular multiplication control unit

The control part of the modular multiplication algorithm is implemented by a state machine, which mainly controls the data flow of the modular multiplication operation circuit.

5.2 Modular multiplication operation unit

The design of the dual field modular multiplication operation unit is mainly for multipliers and adders, e.g. 32 32 × bit multiplier, which compresses the partial product in the way of the Wallace tree [24], and obtains respectively the products of GF(p) field and GF(2m) field. The structure of the dual-field multiplier is designed as shown in Fig. 5. For the dual-field adder, n dual-field adder units are cascaded in Fig. 6 to complete the addition operations of two n-bit operands in dual fields.

E1KOBZ_2020_v14n2_738_f0005.png 이미지

Fig. 5. Dual-field multiplier

E1KOBZ_2020_v14n2_738_f0006.png 이미지

Fig. 6. Dual-field adder

Considering the different requirements of speed, power consumption and area, the critical path of the pipelined modular multiplication circuit is implemented by two schemes. In the case that the application environment requires higher resources, a 32-bit dual-field multiplier and a 96-bit and 65-bit dual-field adders can be used, as shown in Fig. 7, in which the data stream and controller are relatively simple. A higher performance implementation is to use two 32-bit dual-field multipliers and two dual-field adders, as shown in Fig. 8 in which the area is relatively increased and the controller is more complex.

E1KOBZ_2020_v14n2_738_f0007.png 이미지

Fig. 7. Data Path of Single-multiplier

E1KOBZ_2020_v14n2_738_f0008.png 이미지

Fig. 8. Data Path of Double-multiplier

The main problem of the efficient dual-field FIPS modular multiplication circuit based on the critical path lies in the following two aspects. One is the operand width which determines the speed, area and power consumption of the circuit. Considering the application of resource-constrained, the multiplier in small width has the advantages of small size and low power consumption, but low speed. The other is the multiplier number. Double multipliers can increase the speed but it will double the area and increase the complexity of the controller.

5.3 Modular multiplication control unit

The modular multiplication control unit is implemented by a state machine, and mainly used to send the address to the RAM for reading a data, control the data flow on the data path, and send the address to the RAM for writing a data, as shown in Fig. 9, which includes nine states as follows.

E1KOBZ_2020_v14n2_738_f0009.png 이미지

Fig. 9. State Transition of the Single Multiplier

• Init: Initial state, waiting for data to be transferred to the RAM.

• S0: The start signal becomes true when the data transmission is over, and then the reading signal is issued to start multiplying.

• S1: a0 and b0 are read.

• S2: Save the result to the register heap after multiplying and adding.

• S3: (v2, v1, v0)W ← (v2, v1, v0)W + ajb0 .

• S4: mi ← v0n0'modW .

• S5: (v2, v1, v0)W ← (v2, v1, v0)W + min0, and then right shift.

• S6: Internal calculation of the second cycle.

• S7: Send signal done for denoting that the modular multiplication is over.

This machine is triggered by three input signals such as clk, rst and start and will generate an output signal V_add.

• clk. clk is a clock signal.

• rst. rst is used to initialize the modular multiplication controller and registers.

• start. When start becomes true, it denotes that the data is loaded and the modular multiplication operation need to be executed.

• V_add. In the state S2, if i=0|equal=0 , then V_add =1, otherwise V_add = 0 .

• done. The done signal is initialized to 0 at the time of rst reset or in the Init state, and it will be set to be 1 when the operation in the state S7 is completed.

• V_Straight and V_Shift. The design employs three 32-bit registers such as V2, V1 and V0 as the additions and registers in the efficient dual-field FIPS modular multiplication, in which V2 keeps the 32 bits on the MSB (Most Significant Bit) so as to facilitate the shift operations as shown in Algorithm 4. The machine controls the changes of the register value through V_Straight and V_Shift: a) when V_Straight =1, the sum value add_z of the dual-field adder is saved into the register; b) on the falling edge of clk in the S5 state, i.e., the accumulation of min0 has been completed, V_Shift =1, the registers will execute the shift operation.

• P0 - P3. They correspond to the 4 judgment conditions in Fig. 2. They also determine the read-write signal and enable signal of the memory A, B and N, and the bit extension in the dual-field multiplier. When the accumulation of each column in the front s columns is completed, P4 is used to control whether to add aib0 to the current column.

• i and j. These two registers and equal together decide the value of P0 - P3. They also determine the beginning and end of each inner loop and each outer loop.

Under the combined action of the registers and the control signals, the specific pipeline of the dual-field multiplier and adder is as follows.

• The register i is initialized to 0 at the start of the operation. As described in Algorithm 2, this condition does not satisfy the condition of the inner loop of the first outer loop. Thus, the accumulation of aib0 can be carried out directly. This is known beforehand before all modular multiplication operations start. Therefore, in the first cycle after the start signal is valid, the operands a0 and b0 are read directly.

• In the second cycle, the multiply-add operation will be carried out. And, according to field, equal and P0 - P4 and other related signals, the next operands to be calculated will be read.

• According to Fig. 2, when Formula 4 and 5 are executed, after each time aj and bi-j are read, in the next time mj and ni−j must be read. This facilitates the control of read/write and enable signals of memory.

• When j < i−j and p23 =1, ajbi-j is shifted left.

• When j > i-j, only the operand mj and ni-j are read. In each subsequent cycle, the operations including multiply-add, read, and storing the result of the calculation in the corresponding register stack are executed repeatedly.

• When A≠B , S6 is always executed. When A=B , S6 is executed only when the first column is computed.

• In the state S7, the signal done =1 so as to generate the completeness signal of the modular multiplication operation.

In order to perform respectively 1024-bit modular multiplication operation and modular square operation with the pipeline organization,in traditional FIPS algorithms, 2084 and 1044 clock cycles are required, and only 1588 and 796 clock cycles are required in the improved FIPS algorithm.

6. Experiment and Simulation

In the case of single multiplier and double multiplier setting frequency of 500MHz, the cycle number and the optimization rate of modular multiplication and modular square operation under different word width are shown in Fig. 10. The number of cycles is more than 4 cycles compared to the theoretical cycles shown in Table 1, and the optimization rate is equal to the ratio of the number of reduced cycles to the number of cycles of the modular multiplication. For n-bit operands, the number of cycles of modular square is reduced by (s2-s)/2, where s=n/32. The optimization rate increases as the number of bits of operands increase. When s tends to infinity, the optimization rate approaches 25%.

E1KOBZ_2020_v14n2_738_f0010.png 이미지

Fig. 10. Cycles comparison of modular multiplication and modular square in different word widths

When the number of 1 in the key is half of the width in different word width, the number of cycles between L_R and ML is shown in Fig. 11, setting the frequency is 500MHz. L_R guarantees the maximum utilization of the modular square circuit. The speed of L_R is 1.25 times of ML which needs two modular multipliers.

E1KOBZ_2020_v14n2_738_f0011.png 이미지

Fig. 11. Cycles comparison of L_R and ML in different word widths

The synthesis results of 1024-bit modular multiplication on the Xilinx Artix-7TM FPGA device are compared as shown in Table 2. Due to the different FPGA devices with different maximal clock frequency, in order to evaluate more objectively, we mainly list the area (LUTs)and the latency (cycles), and compare them with the throughput AT2 [15], in which A is the area (LUTs) and T is the latency (cycles). Different from [31] and [32], the proposed scheme supports dual-field operations. The AT2 in the proposed design is slightly larger than Sudhakar [33], but the number of LUTs is smaller. Compared with other modular multiplication circuits, the proposed method has good AT2 throughput with less power consumption.

Table 2. Implementation comparison of modular multiplication circuits​​​​​​​

E1KOBZ_2020_v14n2_738_t0002.png 이미지

7. Conclusions and Future Works

Through optimizing the logic of the modular square based on the FIPS Montgomery modular multiplication algorithms, we can effectively improves the efficiency of the public key cryptography algorithm and reduces the power consumption of redundant operations, so as to make it suitable for use in resource constrained devices. Experimental results show that the proposed circuit increases 23.8% speed compared to the traditional FIPS on 1024-bit modular square. In addition, the circuit has strong expansibility, and the supported modular multiplication length can be increased according to the actual demand.

The dual-field FIPS modular multiplication supports both GF(p) and GF(2m​​​​​​​) fields, and provides the basic modular operation for the RSA and SM2 cryptosystems. Based on the dual-field FIPS modular multiplication, some modular operations such as modular square and modular exponentiation can be implemented for the RSA cryptosystem. However, some high-level operations over SM2 modular arithmetic layer including the single point operation layer and the multiple point operation layer are not addressed. It is a worthwhile direction to employ the efficient dual-field FIPS modular multiplication module to perform point operations so as to realize the hardware support of the dual-field modular multiplication for both RSA and SM2 cryptosystems. In addition, the operations on dual-filed include multiplication and addition. An arithmetic unit supporting both multiplication and addition needs be be further optimized in logic design in order to reduce logic resource and power consumption.​​​​​​​

References

  1. F. Chen, Y. Luo, J. Zhang, J. Zhu, Z. Zhang, C. Zhao and T. Wang, "An infrastructure framework for privacy protection of community medical internet of things-Transmission protection, Storage Protection and Access Control," World Wide Web, vol. 21, no. 1, pp. 33-57, 2018. https://doi.org/10.1007/s11280-017-0455-z
  2. N. Rajitha and R. Sridevi, "Implementations of Reconfigurable Cryptoprocessor A Survey," in Proc. of Third International Conference of Information Systems Design and Intelligent Applications, pp. 11-19, 2016.
  3. F. Gandino, F. Lamberti, G. Paravati, et al., "An Algorithmic and Architectural Study on Montgomery Exponentiation in RNS," IEEE Transactions on Computers, vol. 61, no. 8, pp. 1071-1083, 2012. https://doi.org/10.1109/TC.2012.84
  4. J. Wei, W. Guo, H. Liu, et al, "A Unified Cryptographic Processor for RSA and ECC in RNS," in Proc. of CCF National Conference on Compujter Engineering and Technology, pp. 19-32, 2013.
  5. W. C. Lin, J. H. Ye and M. D. Shieh, "Scalable Montgomery Modular Multiplication Architecture with Low-Latency and Low-Memory Bandwidth Requirement," IEEE Transactions on Computers, vol. 63, no. 2, pp. 475-483, 2014. https://doi.org/10.1109/TC.2012.218
  6. G. Hachez and J. J. Quisquater, "Montgomery Exponentiation with no Final Subtractions: Improved Results," in Proc. of International Workshop on Cryptographic Hardware and Embedded Systems, pp. 293-301, 2000.
  7. J. Shao, L. Wu and X. Zhang, "Design and Implementation of Long Integer Modular Exponentiation Unit of Asymmetric Encryption in Smart Card," Microelectronics & Computer, vol. 32, no. 2, pp. 37-41, 2015.
  8. M. Li, D. Wu, K. Dai and X. Zou, "Research and Design of a High-Performance Scalable Public-Key Cipher Coprocessor," Acta Electronica Sinica, vol. 39, no. 3, pp. 665-670, 2011.
  9. G. Chen,J. Zhu, M. Liu and W. Zeng, "Dual-field Modular Multiplication Algorithm and Modular Inversion Algorithm with VLSI Implementation," Journal of Electronics & Information Technology, vol. 32, no. 9, pp. 2095-2100, 2010. https://doi.org/10.3724/SP.J.1146.2009.01258
  10. M. M. A. Kadar and A. V. Ananthalakshmi, "An energy efficient Montgomery modular multiplier for security systems using reversible gates," in Proc. of IEEE International Conference on Communications and Signal Processing, pp. 0071-0074, 2015.
  11. X. Qi, Q. Tang, F. Chen, et al, "Design of Modular Inversion Circuits Using Reversible Logic on Galois Field," Journal of Frontiers of Computer Science & Technology, vol. 9, no. 5, pp. 555-564, 2015.
  12. P. L., "MontgomeryModular multiplication without trial division," Mathematics of Computation, vol. 44, no. 170, pp. 519-521, 1985. https://doi.org/10.1090/S0025-5718-1985-0777282-X
  13. G. Wu, X. Xie, D. Wu, et al, "Design and implementation of high radix Montgomery modular multiplication array structures," Computer Engineering and Science, vol. 36, no. 2, pp. 201-205, 2014. https://doi.org/10.3969/j.issn.1007-130X.2014.02.002
  14. J. H. Ye, T. W. Hung and M. D. Shieh, "Energy-efficient architecture for word-based Montgomery modular multiplication algorithm," in Proc. of International Symposium on VLSI Design, Automation and Test, pp. 1-4, 2013.
  15. A. P. Renardy, N. Ahmadi, A. A. Fadila, et al, "Hardware implementation of montgomery modular multiplication algorithm using iterative architecture," International Seminar on Intelligent Technology and ITS Applications, pp. 99-102, 2015.
  16. M. Morales-Sandoval and A. Diaz-Perez, "Scalable GF(p) Montgomery multiplier based on a digit-digit, computation approach," Iet Computers & Digital Techniques, vol. 10, no. 3, pp. 102-109, 2016. https://doi.org/10.1049/iet-cdt.2015.0055
  17. E. Savas and A. F. Tenca, "A Scalable and Unified Multiplier Architecture for Finite Fields GF(p) and GF(2m)," in Proc. of International Workshop on Cryptographic Hardware and Embedded Systems, pp. 277-292, 2000.
  18. Z. Zheng, Y. Zi, Y. Tian, et al, "Design and Application of High Speed Dual-Field Multiplier," Microelectronics and Computer, vol. 33, no. 5, pp. 1-5, 2016.
  19. W. Liao, M. Wan, K. Dai, et al, "Design and research of dual-field scalable modular multiplier," Huazhong Univ. of Sci. and Tech. (Natural Science Edition), vol. 43, no. 9, pp. 51-54, 2015.
  20. G. R. Blakely, "A computer algorithm for calculating the product AB modulo M," IEEE Transactions on Computers, vol. 32, no. 5, pp. 497-500, 1983. https://doi.org/10.1109/TC.1983.1676262
  21. P. Barrett, "Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor," Proceedings of Advances in Cryptology, pp. 311-323, 1986.
  22. Q. Shao, "Improvment of RSA Crypitography Algorithm and Implementation of Its IP Core," Shanghai Jiao Tong University, 2014.
  23. M. Joye and S. M. Yen, "The Montgomery powering ladder," in Proc. of 4th International Workshop on Cryptographic Hardware and Embedded Systems, vol. 2523, pp. 291-302, 2002.
  24. H. Bansal, K. G. Sharma and T. Sharma, "Wallace Tree Multiplier Designs: A Performance Comparison Review," Innovative Systems Design & Engineering, vol. 5, no. 5, pp. 60-67, 2014.
  25. L. Chen, W. Sun, X. Chen, et al, "Montgomery Modular Inversion Algorithm Based on Signed Digit System and Hardware Implementation," Acta Electronica Sinica, vol. 40, no. 3, pp. 489-494, 2012. https://doi.org/10.3969/j.issn.0372-2112.2012.03.013
  26. E. A. Kuzu and A. Tangel, "A new style CPA attack on the ML implementation of RSA," in Proc. of IEEE Computer Science and Engineering Conference, pp. 323-328, 2014.
  27. Verma R., Dutta M. and Vig R., "RSA Cryptosystem Based on Early Word Based Montgomery Modular Multiplication," SERVICES 2018 in Computer Science, Springer, vol. 10975, pp. 33-47, 2018.
  28. S.S. Erdem, T. Yanik and A. Celebi, "A general digit-serial architecture for montgomery modular multiplication," IEEE Transactions on Very Large Scale Integration Systems, vol. 25, no. 5, pp.1658-1668, 2017. https://doi.org/10.1109/TVLSI.2017.2652979
  29. W. Dai, D. D. Chen, R. C. C. Cheung and C. K. Koc, "Area-Time Efficient Architecture of FFT-Based Montgomery Multiplication," IEEE Transactions on Computers, vol. 66, no. 3, pp. 375-388, 1 March 2017. https://doi.org/10.1109/TC.2016.2601334
  30. T. Wu, "Improving radix-4 feedforward scalable montgomery modular multiplier by precomputation and double booth-encodings," in Proc. of 2013 3rd International Conference on Computer Science and Network Technology, pp. 596-600, 2013.
  31. M.-D. Shieh, J.-H. Chen, H.-H. Wu, and W.-C. Lin, "A New Modular Exponentiation Architecture for Efficient Design of RSA Cryptosystem," IEEE Transactions on VLSI Systems, vol. 16, no. 9, pp. 1151-1161, 2008. https://doi.org/10.1109/TVLSI.2008.2000524
  32. S. S. Erdem, T. Yanik and A. Celebi, "A General Digit-Serial Architecture for Montgomery Modular Multiplication," IEEE Transactions on VLSI Systems, vol. 25, no. 5, pp. 1658-1668, 2017. https://doi.org/10.1109/TVLSI.2017.2652979
  33. M. Sudhakar, R.V. Kamala and M.B. Srinivas, "A bit-sliced, scalable and unified Montgomery multiplier architecture for RSA and ECC," in Proc. of IFIP International Conference on Very Large Scale Integration, pp. 252-257, 2007.
  34. S. Wang, W. Lin, J. Ye and M. Shieh, "Fast scalable radix-4 Montgomery modular multiplier," in Proc. of 2012 IEEE International Symposium on Circuits and Systems, pp. 3049-3052, 2012.