SciELO - Scientific Electronic Library Online

vol.37 número1uRT51: An embedded real-time processor implemented on fpga devicesFlexible FPGA interface for three-phase power modules índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados




  • No hay articulos citadosCitado por SciELO

Links relacionados

  • No hay articulos similaresSimilares en SciELO


Latin American applied research

versión impresa ISSN 0327-0793

Lat. Am. appl. res. v.37 n.1 Bahía Blanca ene. 2007


A Verilog HDL digital architecture for delay calculation

A. Chacón-Rodríguez1, F. N. Martín-Pirchio2, P. Julián2,3 and P. S. Mandolesi2

1 Laboratorio de Componentes Electrónicos, Universidad Nacional de Mar del Plata, Mar del Plata, Argentina,
on leave from Instituto Tecnológico de Costa Rica.

2 Departamento de Ingeniería Eléctrica y Computadores, Universidad Nacional del Sur, Bahía Blanca, Argentina
3 Consejo Nacional de Investigaciones Científicas y Técnicas-(CONICET),,

Abstract — A method for the calculation of the delay between two digital signals with central frequencies in the range [20, 300] Hz is presented. The method performs a delay calculation in order to determine the bearing angle of a sound source. Computing accuracy is tested against a previous implementation of the Cross Correlation Derivative method. A Verilog RTL model of the method has been tested on a Xilinx® FPGA in order to evaluate the real performance of the method. Simulations of an ASIC design on a standard CMOS technology predict a power saving of about 25 times per delay stage over previous implementations.

Keywords — Verilog. FPGA. Low Power. Digital CMOS VLSI.


Methods for the detection of sound sources have been widely studied, including the use of complex techniques such as Independent Component Analysis, Cross-correlation analysis (Carter, 1987; Knapp and Carter, 1976; Riddle, 2004), Gradient Flow techniques (Stanacevic and Cauwenberghs (2005), and the emulation of the human hearing cochlea (Shamma et al., 1986; Lazzaro and Mead, 1989; Horiuchi, 1995), some of which have been successfully implemented in analog and digital VLSI circuits (Lazzaro and Mead, 1989; Horiuchi, 1995; Harris et al., 1999; Grech et al., 1999; van Schaik and Shamma, 2004). There are only a few cases in the literature where low power integrated circuits have been implemented for this task. Van Schaik and Shamma (2004) implemented an integrated circuit (IC) based on an analog cochlea using a 0.5μm process in an area of 5 mm2. In this case, the power dissipation depends strongly on the input signals. With no activity, the cochlear channels dissipate 400 μW; for a time delay of 100 μs (corresponding to a 77 ˚ angle incoming signal), the power dissipated is 1.85 mW. Stanacevic and Cauwenberghs (2005) implemented a 3 mm by 3 mm, 0.5μm CMOS technology IC with a method based on analog processing at a sampling rate of 16 kHz, which discriminates 2 µs with a power consumption of 32 μW. A third implementation, proposed in Julián et al. (2004) and successfully implemented in Julián et al. (2006), is based on a cross-correlation derivative algorithm. This method features an accuracy of one degree for angles in the range = [0,50] U [+130,+180] for signals between [20 Hz, 200 Hz], using two detectors (i.e. four microphones) to cover the whole 360 degrees without accuracy loss. The integrated circuit (IC) designed by Julián et al. (2006), allows for this with less than 600μW of power dissipation (for a full quadrant bearing estimation) on a 0.35μm technology with measures of 2 mm by 2.4 mm.

An alternative method for the estimation of such angle is proposed in this paper in order to reduce power dissipation still further, while keeping calculation performance. The problem is to determine the direction of a sound source picked by an array of microphones such as the one in Fig. 1. The digital signals are provided from this array after being conveniently conditioned. The cross-correlation derivative (CCD) approach is a variation of the standard time-domain cross-correlation between two signals. The CCD algorithm works with a one bit discrete quantization of the input signals, and therefore, reduces drastically the complexity of the resulting digital circuitry. Another feature is that the spatial derivative of the cross-correlation is calculated instead of the cross-correlation itself. Calculation of the CCD results in an activity reduction of a thousand times in the digital circuitry. In the case of the standard cross-correlation approach, once the partial correlations are calculated, the maximum needs to be evaluated which requires a dedicated stage. In the CCD, it is only necessary to locate a change in the output value of the partial correlations (which are either 1 or 0), making this task trivial: one only needs to detect transitions of the input signals.

Fig. 1. Microphone array to measure the bearing angle from a sound source

The strategy proposed in this paper is to use a single counter for delay measurement together with an adaptive closed loop system. The closed loop guarantees stability, and also a convergence of the counter count to the delay under measurement. The reduction of power consumption is a consequence of the reduction in size of the circuitry. In fact, just one counter is needed as opposed to the 104 10-bit counters used in the implementation proposed by Julián et al. (2006).

The paper is organized as follows. Section II describes the Verilog HDL front-end implementation of the proposed structure for the detection of transitions and the calculation of the respective delay. Section III analyses simulations and test runs on a Xilinx® Spartan3 FPGA to provide data for contrast against previous results obtained by Julián et al. (2006). Section IV shows comparisons of functional and power performance between the implementation in Julián et al. (2006) and the simulation results of a 0.5 μm ASIC design of the proposed architecture.


A. Basic Structure

The first objective is to obtain, at least, the same accuracy as in the previous implementation by Julián et al. (2004). Any degrading in the method's accuracy would render useless any further improvements in the rest of the circuit's features. The front-end design was coded using Verilog HDL and the Xilinx® Integrated Software Environment (ISE) and implemented on a Spartan3 Digilent Inc. prototyping board. Simulations were run on Mentor Graphics® ModelSim® HDL simulator. Figure 2 depicts the basic structure, composed of a block that captures and stores the signals being measured, and a second block which calculates the delay. The output has a tri-state control (oe_L) to allow its interfacing to a general data bus, with two extra signals providing information about the state of the unit, i.e., if it is out of its measurement range (out_range) and if data is available (data_rdy).

Fig. 2. Block diagram of front-end design

A frequency divider was implemented for the generation of the 200 kHz clock from the board's 50MHz oscillator. Xilinx® Digital Clock Managers were not used in order to avoid technology dependent code. Nonetheless, global clock buffers were instantiated using Xilinx® Verilog primitives' libraries, thus protecting generated clock signals from excessive fan out. These extra buffers can be easily removed if another programmable technology is to be used, or if the code is to be synthesized for a particular MOS technology.

B. Delay Chains

The first block, shown in Fig. 3, captures the signals at a 200 kHz rate. This rate has been chosen in order to attain an estimation accuracy of one degree for angles in the range = [0, 50] U [+130, +180] for signals between [20Hz, 200 Hz], as stated in Julián et al. (2004). Data is stored in two Serial In-Parallel Out registers that serve as delay chains. Considering such speeds, the circuit proposed would allow for the measurements of ±620 μs of delay, increasing thus the range of the system implemented by Julián et al. (2006).

Fig. 3 Block diagram of the delay chain

Registers and multiplexers are generated using Verilog's generic parameters so as to speed up post placement simulation times (one can instantiate shorter multiplexers and registers before a particular simulation that does not use the whole structure, and then return to their full size version for a test on the FPGA, all with a simple parameter overloading on the parent modules and a recompilation of the ISE's project).

For reference's sake, it is always assumed that signal X1 leads X2. The first bit of one of the chains (by convention X2) is used as base pointer, while the other chain is swept in search of transitions by the signal tao_index. This index is an 8-bit signed integer in two's complement. The sign bit switches the multiplexers in the case where X1 is actually lagging X2, instead of leading. Thus, the base pointer is switched to X1[0] and the index's magnitude is used to sweep X2. In order to do this using just one decoder, X2's selection signals are wired backwards (as it can be noticed in the code above). An error of minus one tap (minus 5μs at a sampling rate of 200 kHz) is introduced using this scheme, because of the base being actually displaced minus one bit as a result of the switching. This error is considered negligible, and can be easily corrected by the software of the system receiving the final data.

Due to the shortage of tri-state buffers in the chosen FPGA (only available in the IO blocks), the synthesizer was allowed to substitute the big 128 tri-state buffers with pull-up logic. This will not be the case in an ASIC implementation.

C. Calculation Unit

The calculation unit in Fig. 4 must discover valid transitions in the input signal to account for an increase or decrease of the index counter, depending on the index sign bit. Repeated application of the calculation will produce a monotonic estimation of the target delay. Since the circuit is designed to increase or decrease its count by one on each valid transition, the convergence time is determined by:

, (1)

In its worst case (maximum delay of 400 µseg for a 200 Hz signal) this convergence time is still well within the proposed estimation period of one second. An out_range signal is provided to indicate the saturation of the index counter. This serves as an auxiliary signal to allow for the adaptive measurement of faster or slower signals via the modification of the clock speed.

For the validation of the transitions, the signals pass through two FFs (Fig. 4). The decision logic determines whether to increase, decrease or leave the counter unchanged depending on the arrival order of the transitions and tao_index's current state. Transitions are checked on the rising and falling edges of the signals as seen in Fig. 5. Evaluation thus occurs at a speed twice as fast as the signal's frequency (on a noise-free signal). The decision logic is registered in order to eliminate the chance of falsely locking the circuit to the same transition. This avoids a run-up of the counter.

Fig. 4. Computing of the index for the delay chains (delay between X1 and X2)

Fig. 5. Signals fed to the valid transition detector and counting decision logic. In this case, a transition en Y1 is detected while Y2 remains constant. Counter is increased if sign bit is 0 or decreased otherwise.

Table 1 shows a summary of the resources used in the Spartan3.

Table 1. Resources Used on Selected Device 3s200ft256-5, reported by Xilinx® ISE


A. Functional Simulation and FPGA Testing

Simulations were run at the RTL level and the gate level (with back annotation from the post placement and routing models written by ISE®). Results from the ModelSim® simulator were fed into Matlab® for a preliminary check of the accuracy of the method. A set of files with test signals was created in Matlab® to serve as stimulus signals to the simulator. In addition, simulations were also performed using real signals taken from previous experiments on the same system reported in Julián et al. (2006).

The final tests were executed on a Spartan 3 Digilent Inc. board, with the input signals fed from a programmable delay generator written in VHDL and implemented on another Spartan 3 board. The outputs were fed to the computer through a PMD-1608FS Measurement Computing® acquisition board.

The measured data was processed in Matlab® to produce an analysis of the signals There was also a measurement of the convergence time when the output evolves from a steady state to another after a sudden change in the delay being measured. Figure 6 shows the case when the input delay is suddenly changed from 0 to 325 μs (65 delay units at a sampling speed of 200 kHz). The convergence time in the graph is equal to the time predicted by (1).

Fig. 6. System's calculation transient of a sudden change in the input delay (0 to 325 μs).

B. Improvement of Decision Logic and Delay Calculation

Boolean equations for the control of the delay counter, DN_UP (2) and CNT_CLK (3) were obtained using Berkeley's Espresso minimization algorithm and were tested on the FPGA by directly introducing them into the RTL code instead of the high level decision sentences used in the original front-end implementation.


A similar performance was obtained with simulations executed on both the RTL and the Boolean generated code. Tests were also run on the FPGA, contrasting both sets of code using Matlab® for verification.

This part of the circuit is not as critical as the delay chains from the viewpoint of power dissipation because in the worst-case condition (maximum delay between X1 and X2), the counter operates at a maximum speed equal to twice the input signal frequency, which is a low frequency signal. However, efforts were made to keep logic and registers at a minimum.


In order to compare the architecture's performance in terms of accuracy and power dissipation, a design using a standard CMOS 0.5 μm technology was simulated. Due to the lack of standard cells specifically made for low power purposes, a schematic based on the logical Verilog design was drawn on Tanner® S-Edit, including all the constraints regarding power consumption. Based on this schematic, a layout of the circuit was drawn using Tanner® L-Edit, from which a SPICE model was extracted for its simulation on Mentor Graphics® Eldo and Mach-TA for analog timing checking, power estimations and digital verification.

A. Performance Comparison

As already seen, the Delay Chain unit must operate at 200 kHz and, considering its size and operation speed, it would be responsible for the maximum power dissipation on an IC designed with the proposed architecture. In order to minimize power consumption, the SIPO delay chains were built using C2MOS registers. This master-slave edged triggered register does not need feedback, as the data is stored in the internal node capacitances, and features a lower clock fan-in and a smaller area compared with the eighteen transistors required for a static register (Rabaey et al., 2003).

Eldo Spice simulations were run with the whole unit of 256 registers connected plus the selection logic, all supplied with 3.3 V to obtain a preliminary estimation of the power dissipation. Results from simulations and average values calculated are shown in Fig 7 and Table 2, compared against data from previous implementations by Julián et al. (2006), Stanacevic and Cauwenberghs (2005), and Schaik and Shamma (2004). As it is shown, the power dissipation is significantly reduced with savings of 49.5, 2.66 and 154 times respectively. All this is achieved with a notable improvement of the system's measurement range.

Fig. 7. Current drawn by a 256 C2MOS master-slave register delay chain.

Table 2. Comparison of total power consumption between systems


An implementation of a method for the calculation of delay between two broad band digital signals has been presented. Results of digital simulations and tests executed on a FPGA showed that the method is functional and efficient, and exhibits an extended range that allows for measurement of delays up to ±640 μS with a sampling speed of 200 kHz. Simulation results of this method using a standard CMOS 0.5 μm technology were also presented. Results of analog simulations showed a significant improvement of total power dissipation over other implementations. The efficiency of C2MOS dynamic techniques is thus corroborated, with new improvements being still possible by reducing supply voltage in the critical stages. Future steps include the realization of the integrated circuit and its verification.

The authors thank Martín Di Federico at Universidad Nacional del Sur for his help with the VHDL programmable delay generator.
This work is partially funded by "Desarrollo de tecnología de redes de sensores para aplicaciones en el medio social y productivo", PICT 2003 No. 14628, Agencia Nacional de Promoción Científica y Técnica; "Redes de Sensores" PGI 24/ZK12, Universidad Nacional del Sur; "Desarrollo de Microdispositivos para Redes de Sensores Acústicos", # 5048, PIP 2005-2006, CONICET.
A. Chacón-Rodriguez is on a scholarship funded by the Organization of American States, and the Instituto Tecnológico de Costa Rica.

1. Carter, G.C., "Coherence and time delay estimation", Proccedings of the IEEE, 75, 236-255 (1987).         [ Links ]
2. Grech, I., J. Micallef, and T. Vladimirova, "Experimental results obtained from analog chips used for extracting sound localization cues", Proc. 9th Int. Conf. Electronics, Circuits and Systems, 1, 247-251 (2002).         [ Links ]
3. Harris, G.H., C.J. Pu and J.C. Principe, "A neuromorphic monaural sound localizer", Advances in Neural Information Processing Systems, II, 692-698 (1999).         [ Links ]
4. Horiuchi, T., "An auditory localization and coordinate transform chip", Advances in Neural Information Processing Systems, 7, 787-794 (1995).         [ Links ]
5. Julián, P., A.G. Andreou, G. Cauwenberghs, R. Riddle and A. Shamma, "A Comparative Study of Sound Localization Algorithms for Energy Aware Sensor Network Nodes", IEEE Trans. Circuits and Systems - I: Regular Papers, 51, 640-648 (2004).         [ Links ]
6. Julián, P., A.G. Andreou and D.H. Goldberg, "A low power correlation-derivative CMOS VLSI circuit for bearing estimation", IEEE Trans. On VLSI Systems, 14, 207-212 (2006).         [ Links ]
7. Knapp, C.H., and G.C. Carter, "The generalized correlation method for estimation of time delay", IEEE Trans. Acoustics, Speech, Signal Processing, ASSP-24, 320-327 (1976).         [ Links ]
8. Lazzaro, J.P., and C. Mead, "Silicon models of auditory localization", Neural Computation, 1, 41-70 (1989).         [ Links ]
9. Rabaey, J., A. Chandrakasan and B. Nikolic, Digital Integrated Circuits. A Design Perspective. Prentice Hall, New Jersey (2003).         [ Links ]
10. Riddle, L., "VLSI acoustic surveillance unit", GOMAC-tech Conf., Government Microcircuit Applications Critical Technology Conference, Monterey, USA, 12-13 (2004).         [ Links ]
11. Shamma S., R. Chadwick, J. Wilbur, K. Moorish and J. Rinzel, "A biophysical model of cochlear processing: intensity dependence of pure tone responses", J. Acoust. Soc. Am., 80, 133-145 (1986).         [ Links ]
12. Stanacevic, M. and G. Cauwenberghs, "Micropower gradient flow acoustic localizer", IEEE Trans. Circuits Syst. I, 52, 2148-2156, (2005).         [ Links ]
13. Van Schaik, A. and S. Shamma, "A neuromorphic sound localizer for a smart MEMS system", Analog Integrated Circuits and Signal Processing, 39, 267-273 (2004).
        [ Links ]

Received: April 14, 2006.
Accepted: September 8, 2006.
Recommended by Special Issue Editors Hilda Larrondo, Gustavo Sutter.

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons