## Servicios Personalizados

## Revista

## Articulo

## Indicadores

- Citado por SciELO

## Links relacionados

- Similares en SciELO

## Compartir

## Latin American applied research

##
*versión impresa* ISSN 0327-0793

### Lat. Am. appl. res. v.37 n.1 Bahía Blanca ene. 2007

**A fixed-point implementation of the expanded hyperbolic CORDIC algorithm**

**D. R. Llamocca-Obregón ^{1} and C. P. Agurto-Ríos^{1}**

^{1} *Grupo de Procesamiento Digital de Señales, Pontificia Universidad Católica del Perú, Av. Universitaria cdra 18, Lima 32- Perú. llamocca.dr@pucp.edu.pe; agurto.cp@pucp.edu.pe*

*Abstract* — The original hyperbolic CORDIC (Coordinate Rotation Digital Computer) algorithm (Walther, 1971) imposes a limitation to the inputs' domain which renders the algorithm useless for certain applications in which a greater range of the function is needed. To address this problem, Hu *et al.* (1991) have proposed an interesting scheme which increments the iterations of the original hyperbolic CORDIC algorithm and allows an efficient mapping of the algorithm onto hardware. A fixed-point implementation of the hyperbolic CORDIC algorithm with the expansion scheme proposed by Hu *et al.* (1991) is presented. Three architectures are proposed: a low cost iterative version, a fully pipelined version, and a bit serial iterative version. The architectures were described in VHDL, and to test the architecture, it was targeted to a Stratix FPGA. Various standard numerical formats for the inputs are analyzed for each hyperbolic function directly obtained: *Sinh, Cosh, Tanh ^{-1} and exp*. For each numerical format and for each hyperbolic function an error analysis is performed.

**I. INTRODUCTION**

The hyperbolic CORDIC algorithm as originally proposed by Walther (1971) allows the computation of hyperbolic functions in an efficient fashion. However, the domain of the inputs is limited in order to guarantee that outputs converge and yield correct values, and this limitation will not satisfy the applications in which nearly the full range of the hyperbolic functions is needed.

Various strategies have been proposed to address the problem of limited convergence of the hyperbolic CORDIC algorithm. One strategy is to use mathematical identities to preprocess the CORDIC input quantities (Walther, 1971). While such mathematical identities work, there is no single identity that will remove or reduce the limitations of all the functions in the hyperbolic mode. In addition, the mathematical identities are cumbersome to use in hardware applications because their implementation requires a significant increase in processing time and hardware (Hu *et al.* 1991). Another approach, proposed by Hu *et al.* (1991), involves a modification to the basic CORDIC algorithm (inclusion of additional iterations) that can be readily implemented in a VLSI architecture or in a FPGA without excessively increasing the processing time.

Three architectures for the fixed-point implementation of the hyperbolic CORDIC algorithm with the expansion scheme proposed by Hu *et al.* (1991) are presented: a low cost iterative version, a fully pipelined version, and a bit serial iterative version. Results in terms of resource count and speed were obtained by targeting the architectures, described in VHDL, to a Stratix FPGA of ALTERA®.

Four different numerical formats are proposed for the inputs. For each hyperbolic function, an analysis of each numerical format is performed and the optimal number of iterations along with the optimal format for the angle are obtained. Finally, an error analysis is performed for each hyperbolic function with each numerical format. The data obtained with the fixed-point architectures are contrasted with the ideal values obtained with MATLAB®.

**II. EXPANSION SCHEME FOR THE HYPERBOLIC CORDIC ALGORITHM**

**A. Original Hyperbolic CORDIC algorithm**

The original hyperbolic CORDIC algorithm, first described by Walther (1971), states the following iterative equations:

(1) | ||

Where: | (2) |

And *i* is the index of the iteration (*i*=1,2,3,...*N*). The fol-lowing iterations must be repeated in order to guarantee the convergence: *4, 13, 40,... k, 3k + 1*. The value of d * _{i}* is either +1 or -1 depending on the mode of operation:

(3) |

In the rotation mode, the quantities X, Y and Z tend to the following results, for sufficiently large *N*:

(4) |

And, in the vectoring mode, the quantities X, Y and Z tend to the following results, for sufficiently large *N*:

(5) | ||

Where 'A' is:_{n} | (6) |

With a proper choice of the initial values X_{0}, Y_{0}, Z_{0} and the operation mode, the following functions can be directly obtained: *Sinh, Cosh, Tanh ^{-1}, *and

*exp*. Additional functions (e.g.

*ln, sqrt, Tanh*) may be generated by applying mathematical identities, performing extra operations and/or using the circular or linear CORDIC algorithms (Meyer - Baese, 2001).

**B. Basic Range of Convergence**

The basic range of convergence, obtained by a method developed by Hu *et al.* (1991) states the following:

Rotation Mode: | (7) | |

(8) | ||

(9) |

This is the restriction imposed to the domain of the input argument of the hyperbolic functions in the rotation mode. Note that the domain of the functions *Sinh* and *Cosh* is <-∞,+∞>.

Vectoring Mode:

(10) | ||

(11) | ||

(12) |

This is the limitation imposed to the domain of the quotient of the input arguments of the hyperbolic functions in the vectoring mode. Note that the domain of *Tanh ^{-1}* is <-1,+1>, and thus this function remains greatly limited in its domain.

**C. Expansion of the Range of Convergence**

The convergence range described by Eq. 9 and Eq. 12 is unsuitable to satisfy all applications of the hyperbolic CORDIC algorithm.

One strategy to address the problem of limited convergence is the use of mathematical identities to preprocess the CORDIC input quantities (Walther, 1971). However, a different preprocessing scheme is necessary for each function, making it very difficult to have a unified hyperbolic CORDIC hardware. Moreover, the preprocessing leads to a significant increase in processing time and hardware.

Hu *et al.* (1991) have proposed another scheme to address the problem of the range of convergence. The approach consists in the inclusion of additional iterations to the basic CORDIC algorithm. As it will be shown in Section III, the hardware and processing time increase is bearable and suitable for VLSI and FPGA implementation.

The method proposed by Hu *et al.* (1991) consists in the inclusion of additional iterations for negative indexes *i*:

(13) |

Therefore, the modified algorithm results:

(14) | |

(15) |

The trend of the results for the rotation and vectoring mode is the same as that stated in Eq. 4 and Eq. 5. The value of d * _{i}* is the same as indicated in Eq. 3. But the quantity

*A*, described in Eq. 6, must be redefined as follows:

_{n}(16) |

The range of convergence, stated in Eq. 7 and Eq. 10 for the basic hyperbolic CORDIC algorithm, now becomes:

Rotation Mode: | (17) | |

Vectoring Mode: | (18) | |

Where: | (19) | |

Although Eq. 17 and Eq. 18 look nearly the same, they are interpreted differently: Equation 17 states the maximum input angle the user can enter to obtain a valid result, whereas Eq. 18 states the maximum value attainable for the *Tanh ^{-1}* function to which

*Z-Z*tends (according to Eq. 5). If

_{0}*Z*, Eq. 18 states the maximum value attainable at

_{0}=0*Z*, and therefore imposes a limitation to the inputs

*X*and

_{0}*Y*.

_{0} The values for θ * _{max}* have been tabulated for M between 0 an 10 and are shown in Table 1.

**Table 1**. θ _{max} versus M for the Modified Hyperbolic CORDIC algorithm. (Hu *et al.*, 1991)

For example, if M = 5 is chosen (six additional iterations), then θ * _{max}=12.42644*, and the domain of the functions

*Cosh*and

*Sinh*is greatly expanded to [-12.42644,+12.42644] compared with the domain in Eq. 9. Similarly, the range of the function

*Tanh*is in creased to [-12.42644,+12.42644], which means that the domain of the quotient

^{-1}*Y*becomes nearly <-1,+1>, which is the entire domain of

_{0}/X_{0}*Tanh*.

^{-1} From the last example, it is clear that the expansion scheme does work. The more domain of the functions is needed, the more the iterations *(M+1)* that must be executed.

**III. ARCHITECTURES PROPOSED FOR THE EXPANDED HYPERBOLIC CORDIC ALGORITHM**

The architectures presented here implement the expand-ed hyperbolic CORDIC algorithm described in Eq. 14 and Eq. 15. The architectures are such that the inputs and outputs have an identical bit width. The intermediate registers and operators can be of higher bit width due to particular details of the algorithm and precision considerations which will be explored later in this paper. As it will be shown in Section IV, the bit width of the intermediate registers, the fixed-point format of the inputs and outputs, and the number of iterations vary considerably with the input/output bit width and the particular function desired. That is, to obtain an optimum architecture which yields a particular hyperbolic function (e.g. *Tanh ^{-1}, Sinh/Cosh, and exp*), the architecture has to be changed for each function and for each input/output bit width. In Section IV, we explore the particularities of each architecture for

*Tanh*and

^{-1}, Sinh, Cosh,*exp*.

It is worth to note, however, that a unified hyperbolic CORDIC hardware, capable of obtaining all the functions within the same architecture, is desirable for certain applications, as has been shown in Hu (1992). The same principle which will be applied to the analysis of *Tanh ^{-1}, Sinh, Cosh *and

*exp*can be applied to this case and thus the optimum architecture can be attained.

In addition, there exists a precision consideration which extends the bit width: it is a 'rule of thumb' found in Hu (1992): "If *n* bits is the desired output precision, the internal registers should have *log _{2}(n)* additional guard bits at the LSB position". This consideration, although arbitrary, have proved to work very well.

With these considerations in mind, three fixed-point architectures are presented: a low cost iterative version, a fully pipelined version, and a bit serial iterative version. But first, it is necessary to define some nomenclature used:

n: input/output bit width |

nr: bit width of the internal registers and operators |

ng: additional guard bits. ng = log _{2}(n) |

N: number of basic iterations |

M: number of additional iterations minus one. |

Note that *nr ≥ ng + n*. We define the quantity *na = nr - (ng+n)* as the additional bits that are added to the MSB part, which will be necessary as we will demonstrate in Section IV.

**A. Low-Cost Iterative Architecture**

Figure 1 depicts the architecture that implements the Eq. 14 and Eq. 15 in an iterative fashion. The two LUTs (look-up tables) are needed to store the two sets of elementary angles defined in Eq. 2 and Eq. 13.

**Figure 1**. Iterative - CORDIC

The process begins when a start signal is asserted. After '*M+1+N+v*' clock cycles ('*v*' is the number of repeated iterations stated in Section II-A), the result is obtained in the registers *X, Y* and *Z*, and a new process can be started.

Inputs: X_0, Y_0, and Z_0 |

Outputs: X_N, Y_N, and Z_N |

j = M 0 it = 1 N |

There are two stages: One that implements the iterations for *i ≤ 0* and is depicted in the upper part, it needs two multiplexers, two registers, four adders and two barrel shifters. This is the most critical part of the design, and introduces considerable delay, thus reducing the frequency of operation. The lower part of Fig. 1 implements the iterations for *i > 0*, this is a classical hardware found in many textbooks and papers.

A state machine controls the load of the registers, the data that passes onto the multiplexers, the add/substract decision of the adder/substracters, and the count given to the barrel shifters.

**B. Bit serial iterative architecture**

The simplified interconnect and logic in a bit serial design should allow it to work at a much higher frequency than other architectures. However, the design needs to be clocked *'n'* times for each iteration (*'n'* is the width of the data). This architecture maps well in FPGA (Andraka, 1998 ) and is depicted in Fig. 2.

**Figure 2.** Bit Serial Iterative-CORDIC

The input data (*X_0, Y_0 *and* Z_0)* is loaded into the register bit per bit. Then the calculation starts. The array of serial adders/substractors, multiplexers, flip flops are arranged in a special fashion and controlled by a state machine, so that one output bit is computed every cycle, and after '*n + n*(M+1+N+v*)' cycles a new result is obtained in the registers. * '*v*' is the number of repeated iterations as stated in Section II-A.

**C. Fully Pipelined Architecture**

To develop the architecture, the algorithm described in Eq. 14 and Eq. 15 is unfolded. In addition, the stages that implement the expansion (the upper part of Fig. 2) need to be partitioned in order to avoid large delays. Therefore '2*(M+1) + N + v' stages will appear ('*v*' is the number of repeated iterations as stated in Section II-A). The architecture is depicted in Fig. 3.

**Figure 3**. Pipelined-CORDIC

Such architecture can obtain a new result each cycle. The initial latency is '2*(M+1) + N + v' cycles.

At each stage, *X *and* Y* have a fixed shift that can be implemented in the wiring, thus removing the barrel shifters of Section III-A. In addition, the look-up values for the θ * _{i}* are distributed as constants across the stages, which are hardwired, hence removing the look-up table. The entire hardware is reduced to an array of interconnected adder/substractors and registers. A little additional hardware in needed to obtain the '

*dix*' signals, which are obtained as indicated in Eq. 3.

**IV. ANALYSIS OF NUMERICAL FORMATS FOR EACH HYPERBOLIC FUNCTION**

For our fixed-point hardware, 4 standard bit widths for the inputs/outputs are explored: 12, 16, 24, and 32. Our aim is to obtain an optimum architecture that implements just one hyperbolic function. As a result, we will restrict our analysis to each architecture that implements one of the following functions: *Tanh ^{-1}, Cosh, Sinh, *and

*exp*, which can be directly obtained from the hyperbolic CORDIC Eq. 14 and Eq. 15. In case a unified hyperbolic CORDIC hardware is desired, the same analysis can be applied to obtain the optimum architecture. We will show that the intermediate registers and operators need to be augmented in the MSB part and that the format for X, Y, Z, and the number of iterations varies for each architecture that implements a particular hyperbolic function. The LSB positions are always extended for

*log2(n)*bits, where

*n*is the input/output bit width. In the following sub-sections we will calculate the internal datapath, but this value will not consider the guard bits (

*log2(n)*), because it is always present and to avoid complicating the explanation. The numerical format is defined as: [T D].

Where | T: total number of bits |

D: total number of fractional bits. |

**A. Inverse Hyperbolic Tangent ( Tanh^{-1})**

To obtain this function in *Z*, we have to set *Z _{0}=0* and

*X*in the vectoring mode. Then,

_{0}=1*Z*. As the domain of

_{N}Tanh^{-1}(Y_{0})*Tanh*is

^{-1}*<-1,+1>*, the input

*Y*is restricted to 1 integer bit in the 2's complement fractional fixed-point representation (

_{0}*|Y*). But, as the input

_{0}|<1*X*requires 2 integer bits and the format for

_{0}=1*X*and

*Y*must be the same,

*X*and

*Y*must have 2 integer bits. The critical case occurs when

*Y*is at its max. value, from which the max. value of

_{0}*Z*is obtained. Then we use Table 1 to find the number of additional iterations (

_{N}*M*) needed to correctly represent

*Z*by locating the nearest θ

_{N }(*. It is unnecessary to add more bits to*

_{max})*X*and

*Y*, because they tend to decrease as shown in Eq. 5. At each bit width (12, 16, 24 and 32) we will obtain an adequate format for

*Z*.

Table 2 shows the number of basic iterations (N), the additional iterations (M+1), and the set of elementary angles for each bit width. The way these values are obtained along with the format of the elementary angles is specified in the following subsections.

**Table 2.** Elementary angles for each bit width

**Input/Output bit width: 12. Format for X and Y: [12 10]**

*|Y _{0}| _{max} = 3FFh = 0.999023475 *

*Z _{N max} = Tanh^{-1}(3FFh) = 3.812065*

From Table 1, 3 additional iterations are needed (M=2, *θ _{max}=5.162)*. From Table 2, we need N=9 iterations. Given

*Z*,

_{N max}*Z*needs 3 integer bits.. However, the maximum intermediate value for

*Z*is

*4.04*, and 1 bit must be extended to the MSB. This change will be implemented in the internal architecture. Thus, the format for

*Z*is [12 9] and the internal datapath is 13 bits.

**Input/Output bit width: 16. Format for X and Y: [16 14]**

*|Y _{0}| _{max} = 3FFFh = 0.99993896484375*

*Z _{N} _{max} = Tanh^{-1}(3FFFh) = 5.1985885952*

Table 1 indicates that 4 additional iterations are needed (M=3, *θ _{max}=7.23*). From Table 2, we need N=12 iterations. Given the maximum

*Z*, 4 integer bits are needed to represent

_{N}*Z*. And, as

*θ*needs 4 integer bits, no bit will be extended. Thus, the format for

_{max}=7.23*Z*remains [16 12] and the internal datapath is 16 bits.

**Input/Output bit width: 24. Format for X and Y: [24 22]**

*|Y _{0}| _{max} = 3FFFFFh *

* Z _{N} _{max} = Tanh^{-1}(3FFFFFh) = 7.9711925*

Table 1 specifies that 5 additional iterations are needed (M=4, *θ _{max}=9.592634*). From Table 2, we choose N=16 iterations. Given

*Z*, we found that 4 integer bits are needed to represent

_{N max}*Z*. However, in this case, the maxi mum intermediate value for

*Z*is

*8.5346*, so we have to extend 1 bit to the MSB. This change will be implemented in the internal architecture. Thus, the format for

*Z*remains [24 20] and the internal datapath is 25 bits.

**Input/Output bit width: 32. Format for X and Y: [32 30]**

*|Y _{0}| _{max} = 3FFFFFFFh*

*Z _{N} _{max }= Tanh^{-1}(3FFFFFFFh) = 10.743781*

Table 1 specifies that 6 additional iterations are needed (M=5, *θ _{max}=12.42644*). From Table 2, we choose N=16 iterations. Given the maximum

*Z*, 5 integer bits are needed to represent

_{N}*Z*. And, as

*θ*requires 5 bits, no bit is needed to be extended. Thus, the format for

_{max}=12.42644*Z*remains [32 27]. The internal datapath is 32 bits.

We have chosen the number of iterations to be 16 for 24 and 32 bits. While 20 iterations can be executed for 24 bits, and 27 iterations for 32 bits, it would increase the amount of hardware excessively.

**B. Hyperbolic Sine and Hyperbolic Cosine**

To obtain these functions in *X* and *Y*, it is necessary to set *Y _{0}=0* and

*X*in the rotation mode. Then,

_{0}=1/A_{n}*Y*and

_{N}Sinh(Z_{0})*X*. As the domain of

_{N}Cosh(Z_{0})*Sinh*and

*Cosh*is <-∞,+∞>, there is no input restriction. Our strategy will consist in fixing to [12 10] the input

*Z*for the bit width of 12, and increment 1 integer bit for a larger bit width, so that by augmenting the bit width, the range of the functions

*Sinh and Cosh*is incremented.

The critical case occurs when |Z* _{0}*| is at the max. value attainable at each format, from which the max. values of

*X*and

_{N}*Y*are obtained. Then we use Table 1 to find the number of additional iterations (

_{N}*M*) needed to correctly represent

*Z*There is no need to add more bits to

_{0 }.*Z*, because

*Z*tends to 0 as shown in Eq. 4. At each different bit width (12, 16, 24 and 32) we will obtain an adequate format for

*X*and

*Y.*

Table 3 shows the number of basic iterations (N) and the additional iterations (M+1) necessary for each bit width. The way these values are obtained is shown in the following subsections. In addition, it also shows the set of elementary angles needed for each bit width. The format of these angles and the procedure to obtain these formats are specified in the following subsections.

**Table 3**. Elementary angles for each bit width

**Input/Output bit width: 12. Input format for Z: [12 10]**

* |Z _{0}|_{max} = |400h| = |-2|*

From Table 1, we need 1 additional iteration (M=0). From Table 3, we need N=10 iterations. Then, *A _{n}=0.547776019905* and

*X*. Also, the maximum values of

_{0}=1/A_{n}=1.82556366774193*X*and

_{N}*Y*given

_{N},,*|Z*are:

_{0}|_{max} *X _{N} = Sinh(400h) = -3.62686040784702*

*Y _{N} = Cosh(400h) = +3.76219569108363*

Given these maximum values, 3 integer bits are needed to represent *X *and* Y*. In addition, the maximum intermediate values for *X *is* 2.510150043145* and for *Y *is* 2.281954584677*. So, no bit needs to be extended. Thus the format for X and Y remains [12 9] and the internal datapath for X and Y is 12 bits.

**Input/Output bit width: 16. Input format for Z: [16 13]**

*|Z _{0}| _{max} = |8000h| = |-4|*

From Table 1, we need 3 additional iterations (M=2). Table 3 shows that N=13 iterations are needed. Any further iteration will yield a value less or equal than 001h for the fixed angle rotation, which is useless. Then, *A _{n }= 0.09228252133203 and X_{0 }= 1/A_{n} = 10.83628823276322*. Also, the maximum values of

*X*and

_{N}*Y*, given

_{N}*|Z*are:

_{0}|_{max} *X _{N} = Sinh(8000h) = -27.289917197*

*Y _{N} = Cosh(8000h) = +27.3082328361*

These maximum values indicate that 6 integer bits are needed to represent *X *and* Y*. In addition, in this case, the maximum intermediate values for *X *is* 34.45601024011 *and for* Y* is* -34.4348456146*, so we have to extend 1 bit to the MSB. This change will be implemented in the internal architecture. Thus, the format for *X *and* Y* remains [16 10] and the internal datapath for X and Y is 17 bits.

**Input/Output bit width: 24. Input format for Z: [24 20]**

*|Z _{0}|_{max}=|800000h|=|-8|*

From Table 1, 5 additional iterations are needed (M=4). With the *Z* format [24 20] the θ * _{i}s* defined for the LUT are those of Table 3, from which we have chosen N=16 basic iterations. While 20 iterations can be executed, it would increase the amount of hardware excessively. Then,

*A*4.0305251x10

_{n}=^{-3}and X

_{0}=

*1/A*. Also, the maximum values of

_{n}=248.1066*X*and

_{N}*Y*, given |Z

_{N}_{0}|

_{max}are:

*X _{N} = Sinh(800000h) = -1490.47882*

*Y _{N} = Cosh(800000h) = +1490.47916*

Given these maximum values, we found that 12 integer bits are needed to represent correctly *X* and* Y*. In addition, in this case, the maximum intermediate values for *X *is* 3081.0854* and for *Y* is *3081.085173*, so we have to extend 1 bit to the MSB. This change will be implemented in the internal architecture. Thus, the format for *X* and *Y* remains [24 12] and the internal datapath for *X* and *Y* is 25 bits.

**Input/Output bit width: 32. Format for Z: [32 27]**

*|Z _{0}| _{max} = |80000000h| = |-16|*

Table 1 states that 8 additional iterations are needed (M=7). With the Z format [32 27] the θ * _{i }s* defined for the LUT are those of Table 3. We have chosen N=16 basic iterations. While 27 iterations can be executed, it would increase the amount of hardware excessively. Then,

*A*2.7737x10

_{n}=^{-6}and

*X*=

_{0}*1/A*. Also, the maximum values of

_{n}= 3.605287519x10^{-5}*X*and

_{N}*Y*, given

_{N}*|Z*are:

_{0}|_{max} *X _{N} = Sinh(80000000h) = 4.44305526x10^{6}*

*Y _{N} = Cosh(80000000h) = 4.44305526x10^{6}*

Given these maximum values, 24 integer bits are needed to represent *X* and *Y*. In addition, the max. intermediate value for *X *is* 20.32x10 ^{6} *and for

*Y*is

*20.32x10*, so we have to extend 2 bits to the MSB. This change will be implemented in the internal architecture. Thus, the format for

^{6}*X*and

*Y*remains [32 8] and the internal datapath is 34 bits. The format for

*Z*([32 27]) is a good format, since with [32 26], we would need more than 32 integer bits for X and Y, that is impossible to implement. In conclusion, M=7 and N=16.

*X, Y*format is [32 8]. The internal datapath for

*X, Y*is 34.

**C. Exponential (e ^{x})**

To obtain this function, we have to set *X _{0}*=Y

*in the rotation mode. Then,*

_{0}=1/A_{n}*Y*and

_{N}Sinh(Z_{0}) + Cosh(Z_{0})*X*. As

_{N}Cosh(Z_{0})+Sinh(Z_{0})*e*, we can rewrite: Y

^{w}= Sinh(w) + Cosh(w)*and*

_{N}e^{Z0}*X*. The domain of

_{N}e^{Z0}*e*is <-∞,+∞>, hence there is no input restriction. Our strategy will consist in fixing to [12 10] the input

^{w}*Z*for the bit width of 12, and incrementing 1 integer bit for each larger bit width, so that by augmenting the bit width, the range of the function

*e*is incremented. It is worth to mention that the hardware is identical to the hardware that computes

^{w}*Sinh*and

*Cosh*. The critical case occurs when Z

*is the max. value attainable at each format, from which the max. values of*

_{0}*X*and

_{N}*Y*are obtained. Then we use Table 1 to find the number of additional iterations (

_{N}*M*) needed to represent

*Z*As

_{0}.*Z*tends to 0 (as shown in Eq. 4), it is needless to add more bits to

*Z*. At each different format (12, 16, 24 and 32) we obtain an adequate format for the bits of

*X*and

*Y*.

The analysis is similar to the case of section IV-B, and since the maximum value for X and Y will be X_{N}=Y_{N}≈2cosh(Z_{0}), we found that 1 additional integer bit is needed, which means that the formats obtained in 4.2 will lose 1 fractional bit and X and Y must be read differently (12 bits: X, Y format = [12 8]; 16 bits: X,Y format = [16 9]; 24 bits: X, Y format = [24 11]; 32 bits: X, Y format = [32 7]).

**D. Results of FPGA implementation**

Note that the hardware for obtaining *exp and Sinh/Cosh* is exactly the same, though the results are interpreted differently.

The results, obtained with *Quartus II 5.0*, show that the hyperbolic CORDIC implementation is amenable to FPGA. The clock rates are relatively high and the resource effort is bearable for high-density FPGAs.

**V. ERROR ANALYSIS**

For the cases analyzed in sections IV-A, IV-B and IV-C, an error analysis is performed. The results are con trasted with the ideal values obtained in MATLAB®. The error measure will be:

(20) |

The three cases will be tested. We have taken 1024 values equally spaced along the maximum domain of functions obtained for each bit width analyzed. In the case of *Tanh ^{-1}*, it has been necessary to add more values, for the

*Tanh*grows dramatically as its argument nears ± 1.

^{-1} **Table 4**. Final Results - Stratix EP1S10F484C5

**A. Inverse Hyperbolic Tangent**

We will show the relative error for the hardware that implements the hyperbolic tangent in its entire domain for 12, 16, 24 and 32 bits. Figures 7, 8, and 9 show the relative error performance for the function *Tanh ^{-1}(w)* for 12, 16, 24 and 32 bits. Although the domain of

*Tanh*is <-1,+1>, we have just plotted for

^{-1}*w*∈ [0,+1> since

*Tanh*is an odd function.

^{-1} For *w* near *0*, all the curves exhibit high relative error values, because *Tanh ^{-1}(w)* yields the smallest values for

*w*near

*0*, and the fixed-point hardware fails representing those small values.

The more the bit width, the less the relative error. For example, for 12 bits (Fig. 4) nearly all the relative error values are below 10^{-2} (an error below 1%), and for 24 bits (Fig. 5), the relative error values are below 10^{-4} (an error below 0.1%).

**Figure 4**. In Curve A, 12 bits were used.

In Curve B, 16 bits were used.

**Figure 5**. In the curve, 24 bits were used.

The curve for 32 bits (Fig. 6) exhibits some irregularities due to the reduced basic iterations (16); but in general it provides the least relative error. However, it is unusual to have a *Tanh ^{-1}* hardware with a bit input data width of 32 bits.

**Figure 6**. In the curve, 32 bits were used.

**B. Hyperbolic Sine and Hyperbolic Cosine**

We will show the relative error for the hardware that implements *Sinh* and *Cosh* in the maximum domain obtained in 4.2 for 12, 16, 24 and 32 bits.

Figures 7 and 8 show the relative error for *Cosh(w)* for 12, 16, 24 and 32 bits. We have just plotted the positive domain. The negative domain is not plotted, since *Cosh* is an even function and will yield the same values.

**Figure 7**. In Curve A, 12 bits were used (w∈[0,2>).

In Curve B, 16 bits were used (w∈[0,4>).

**Figure 8**. In Curve A, 24 bits were used (*w*∈[0,8>).

In Curve B, 32 bits were used (*w*∈[0,16>).

The curve A (Fig. 8) for 24 bits is very regular because in this format we use a larger quantity of fractional bits than with the other formats.

The curve B (Fig. 8) for 32 bits exhibits some irregularities due to the reduced fractional bits (8) and the reduced number of basic iterations (16). However, it provides the greatest domain for the *Cosh(w)* function (*w* ∈ [-16,16>).

Figures 9 and 10 show the relative error performance for *Sinh(w)* for 12, 16, 24 and 32 bits. We have just plotted the positive domain. The negative domain is not plotted, since *Sinh* is an odd function and will yield the negative values of those obtained in the positive domain.

**Figure 9**. In Curve A, 12 bits were used(*w*∈[0,2>).

In Curve B, 16 bits were used(*w*∈[0,4>).

**Figure 10**. In Curve A, 24 bits were used (*w*∈[0,8>).

In Curve B, 32 bits were used (*w*∈[0,16>).

The curve A (Fig. 10) for 24 bits is very regular because in this format we use a larger quantity of fractional bits than with the other formats.

The curve B (Fig. 10) for 32 bits exhibits some irregularities due to the reduced fractional bits (8) and the reduced number of basic iterations (16). However, it provides the greatest domain for the *Sinh(w)* function (*w *∈ [-16,+16>).

**C. Exponential**

We will show the relative error for the hardware that implements *e ^{x}* in the domain obtained in section IV-C for 12, 16, 24 and 32 bits. Figures 11 and 12 show the relative error for

*e*for 12, 16, 24 and 32 bits.

^{w}

**Figure 11**. In Curve A, 12 bits were used(*w*∈[-2,2>).

In Curve B, 16 bits were used(*w*∈[-4,4>).

**Figure 12**. In Curve A, 24 bits were used(*w*∈[-8,8>).

In Curve B, 32 bits were used(*w*∈[-16,16>).

Note that, as *w* is more negative, the error increases and even becomes constant (as in Fig. 12). The reason is that *e ^{w}* is very small for large negative values of

*w*, and the fixed-point hardware fails representing those small values.

**VI. CONCLUSIONS**

The expansion scheme proposed by Hu *et al.* (1991), despite the additional hardware needed, has proved to be amenable for our FPGA implementation, as the clock rate and resource effort indicates. The function *Tanh-1* gets expanded in its entire domain and the functions *Cosh* and *Sinh* have a greater domain as the bit width increases.

The analysis for a unified CORDIC algorithm has not been performed in order to not to lengthen this paper. But the analysis for this case is very similar to that of Section IV.

The error analysis shows certain irregularities in the relative error performance. These irregularities are due to the truncation of the fractional bits and the ever-limited number of basic and additional iterations. We have tested the CORDIC algorithm in MATLAB® and have found that the error performance is uniform.

**REFERENCES**

1. Andraka, R., "A survey of CORDIC algorithm for FPGA based computers", *Proceedings of the ACM/SIGDA*, 191-200 (1998). [ Links ]

2. Hu, X., R. Huber and S. Bass, "Expanding the Range of Convergence of the CORDIC Algorithm", *IEEE Transactions on Computers*, **40**, 13-21 (1991). [ Links ]

3. Hu, Y., "CORDIC-Based VLSI Architectures for Digital Signal Processing", *IEEE Signal Processing Magazine*, **9**, 16-35 (1992). [ Links ]

4. Meyer-Baese, U., *Digital Signal Processing with Field Programmable Gate Arrays*, Springer-Verlag Berlin, Heidelberg (2001). [ Links ]

5. Walther, J.S., "A unified algorithm for elementary functions", *Proc. Spring Joint Comput. Conf.*, **38**, 379-385 (1971). [ Links ]

**Received: April 14, 2006. Accepted: September 8, 2006. Recommended by Special Issue Editors Hilda Larrondo, Gustavo Sutter.**