RISC Architectures For Embedded Applications Introduction
|< back

The 166 CPU core makes extensive use of reduced instruction set computer (RISC) concepts to acheive its blend of very high performance at modest cost. To understand why RISC techniques are especially suited to high-speed real time embedded systems, it might be useful to examine in detail how it grew out of the traditional complex instruction set computers (CISC) that reached their peak in the late 1980’s to early 1990’s.

Behind The 166’s Near-RISC Core

The reasons behind the abandonment of traditional Complex Instruction Set Computers (CISC) has been the quest for ever greater throughput. The demands of workstations involved in CAD tasks and latterly advanced video games, have been the real driving force behind this. Traditionally, microprocessors have been designed with assembler instruction sets that have been geared towards making the assembler programmer’s life easier through the extensive use of microcode to produce ever more powerful instructions`. By providing single assembler instructions that perform, for instance, three operand multiplication, the assembler programmer (and HLL compiler writer) has been relieved of the job of achieving the same result with simpler instructions.

The need for the CPU to be able to recognise and act on (decode) many hundreds of different instructions, requires complex silicon and many clock cycles. The greater the silicon area, the greater the cost of the device and power consumed. With physical limitations acting to restrict achievable clock speeds on silicon devices, the number of cycles per instruction is obviously very significant in gaining higher performance..

RISCs tend to shift the burden of programming from the microcoder to the assembler programmers and compiler writers. Work both within academia and commercial manufacturers has proved that a suitably programmed RISC machine can achieve a far higher throughput than a CISC for a given clock speed.

Strangley, the embedded world has been slow to question the suitability of the CISC-based microcontroller. Whilst at the very top end, devices such as the i80960 have enjoyed some success, for more commonplace embedded tasks, RISC is almost unknown. With the increasing complexity of modern control algorithms, the need for greater processing power is set to become an issue in anything but the simplest applications. In addition, here more than in the workstation world, the worst-case response time to non-deterministic events is crucial, an area where CISCs are especially poor.

Many current high-end microcontrollers are based on existing CISC architectures such as the 8086, 68000 etc., who in common with 8-bit devices such as the 8051, have an internal structure that dates back up to 19 years. With the silicon vendor’s need to give existing users an upgrade path, apparently new designs are often based closely on the existing architecture/instruction set, so protecting the user’s investment in expensive assembler-code.

Like workstations, microcontrollers are tending to be programmed in a high level language (HLL) to reduce coding times and enhance maintainability. Inevitably, even with the best compilers, some loss of performance is encountered, emphasising again the need for improved CPU performance.

In addition to straight forward data processing, microcontrollers must also handle real-world peripherals such as A/D converters, PWM’s, timers, Ports, PLL’s etc., all of which require real time processing.

Conventional CISC Bottle-necks

1. Long And Unpredictable Interrupt Latencies

Complicated “labour-saving” instructions must hold CPU’s entire attention during execution, thus preventing real-world generated interrupts from being serviced. Unpredictable latency times result which can cause serious problems in hard real-time systems. One approach to overcoming the CISC’s poor real-time response has been to bolt a secondary “time processor” onto the core to try and off-load the time-critical portions. However, this results in an awkward design and the need to use a very terse microcode to program it, in addition to the more usual C and assembler for the CISC core itself.

2. Vast Instruction Sets Give Slow Decoding

Loaded instruction must be recognised from potentially many hundreds or even thousands of possibilities. Decoding is thus complicated and lengthy.

3. Frequent Accesses To Slow Memory Devices

Data is typically fetched from off-chip memory and placed in accumlator-type registers. Mathematical or logical operations are performed and then result written back to memory. Value is likely to be required again in course of procedure, thus requiring further movements to and from off-chip memory.

4. Slow Procedure Calling

When calling subroutines with parameters (essential in good HLL programming), parameters must be individually pushed on to stack. They must then be moved through accumulator register(s) for processing before being returned via stack to caller.

5. Strictly One Job At A time

Each peripheral device or interrupt source must have dedicated service routine which at the least will require the PSW, PC to be stacked and restored and data removed from or fed to peripheral device.

6. Software Has To Be Structured To Suit Architecture.

Embedded systems frequently contain many separate real time tasks which together form a complete system. Conventional CPU’s make switching between task slow. Often, many registers have to be stacked to free them up for the incoming task. This problem is aggravated by the use of HLL compilers which tend to use a large number of local variables in library functions which must be preserved.

7. Redundant Instructions And Addressing Modes

With the move to HLLs, compilers are tending to dictate what instructions should be provided in silicon.

In practice, compilers tend to only make use of a small number of addressing modes. This results in a large number of unused addressing modes which serve only to complicate the opcode decoding process.

8. Inconsistent Instruction Sets

Instruction sets that have evolved tend to be difficult to use due to large number of different basic types and the inconsistent addressing modes allowed.

9. Bus Not Fully Utilised

Whilst complex instructions are being executed, bus is idle.

The RISC Architecture For Embedded Control

To show how RISC design is used to improve microcontroller throughput, the 166 is used as an example.

Basic Definitions:

1 state time = 2 x 1/oscillator frequency

- fundamental unit of time recognised within processor system.

1 machine cycle = 2 * state time

- minimum time required to perform the simplest meaningful task within cpu.

The unit of state times is used when making comparisons between RISCs and CISCs as this removes any dependency on clock frequency.

- All state time counts are given in single chip operation mode for both 80C196 and 166.

Bus Interface

To maximise the rate at which instructions are executed, RISC CPU’s are very heavily pipelined. Here, on any given machine cycle, up to 4 instructions may be processed by overlapping the various steps thus:

FETCH: - get opcode from program store DECODE: - identify opcode from a small list and fetch operands EXECUTE: - perform operation denoted by opcode WRITE-BACK: - result returned to specified location

Thus although the instruction takes four machine cycles, it is apparently executed in just one (2 state times). Pipelining has considerable benefits for speeding sequential code execution as the bus is guarantied to be fully occupied.

RISC Interrupt Response

In the 166, branches to interrupts make use of the injected instruction technique and so vectoring to a service routine is achieved in only 4 machine cycles (400ns). The effect of complex but necessary instructions such as MUL and DIV (5 and 10 cycles respectively) stretch this but it is interesting to note that the 80C166 does provide these as interruptable instructions.

Very fast interrupt service is crucial in high-end applications such as engine management systems, servo drives and radar systems where real-world timings are is used in DSP-style calculations. As these normally form part of a larger closed control loop, erratic latency times manifest themselves as an undesirable jitter in the controlled variable.

Registers And Multi-Tasking

Traditional microcontrollers have one or more special registers which can be used for mathematical, logical or Boolean operations. In the 8051, there is a single “accumulator” with 8 other registers which may be used for handling local variables or intermediate results in complex calculations. These additional registers are also used to access memory locations via indirect and/or indexed addressing.

As pointed out in section 3 and 4 above, conventional CPU’s spend much time moving data from slow memory areas into active registers. The RISC offers a very large number of general purpose registers which may be used for locals, parameters and intermediates. The 166 provides 16 word wide general purpose registers (GPRs), each of which is effectively an accumulator, indirect pointer and index. With such a large number of GPR’s available, it becomes realistic to keep all locals and intermediates within the CPU throughout quite large procedures. This can yield a great increase in speed.

Further significant benefits are derived from the RISC technique of register windowing. As has been said, up to 16 registers are available for use by the program. However, by making the active register bank movable within a larger on-chip RAM, the job of real time multi-tasking is considerably eased.

Central to this is the concept of a “Context Pointer” (CP), which defines the current absolute base address of the active bank. Thus a reference to “R0” means the register at the address indicated by the CP. Thereafter, the 16 registers originating from CP are accessed by a fast 4-bit offset.

The best example of how the CP is exploited is perhaps a background task and a real-time interrupt co-existing. When the interrupt occurs, rather than pushing all GPR’s onto the stack, the CP of the current register bank is stacked and simply switched to a new value, determined at link time, to yield a fresh register bank. This results in a complete context switch in just one machine cycle but does rule out the use of recursion.

A hybrid method, which permits re-entrancy, uses the stack pointer to calculate the new CP dynamically.
Here, on entering the interrupt, the number of registers now required is subtracted from the current SP and the result placed in CP, with the old CP stacked. Thus the new register bank is located at the top of the old stack, with the old CP and then the new stack following on immediately afterwards. On exiting the interrupt routine, the original registerbank is restored by POPping the old CP from the stack. The SP is reinstated by adding the size of the new register bank onto the current SP.

A further RISC refinement is register window overlapping whereby when a new procedure is called, part of the new register bank defined by CP’ is coincident with the original at CP:

R3' ; Register for subroutine’s locals and intermediates R2' ; Register for subroutine’s locals and intermediates R7 R1' ; Common register, R7 == R1' CP’ R6 R0' ; Common register, R6 == R0' R5 ; Register for caller’s locals and intermediates R4 ; Register for caller’s locals and intermediates R3 ; Register for caller’s locals and intermediates R2 ; Register for caller’s locals and intermediates R1 ; Register for caller’s locals and intermediates CP R0 ; Register for caller’s locals and intermediates MODULE 1 ; *** Assignment Of GPRs To Local Variables - Caller *** x_var LIT ‘R0’ ; Local variable y_var LIT ‘R1’ ; Local variable parm1 LIT ‘R6’ ; Passed parameter 1 parm2 LIT ‘R7’ ; Passed parameter 2 result LIT ‘R6’ ; Value returned from sub routine ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ MODULE 2 ; *** Assignment Of GPRs To Local Variables - Sub Routine *** a_var LIT ‘R2’ ; Local variable b_var LIT ‘R3’ ; Local variable input1 LIT ‘R0’ ; Received parameter 1 input2 LIT ‘R1’ ; Received parameter 2 ret1 LIT ‘R0’ ; Final result returned in R0
Fig. A - Giving GPR’s Meaningful Names

By using some forethought, the programmer should arrange for any value to be passed to the sub routine to be located in the common area, so that all the normal loading and unloading of parameters is avoided. This technique can be used in either absolute or SP-relative registerbank modes.

To get the best from a RISC’s registers, the location of data needs close consideration: although highly orthogonal, the limited number of addressing modes provided for MUL and DIV for example, can appear somewhat restrictive. Fortunately though, most operands involved will already be in registers, so eliminating the need for many addressing techniques. As might be expected, the instructions with the widest range of addressing modes are the simple data moves - the fact RISC’s are the result of very careful analysis of the requirements for fast execution becomes obvious after a short acquaintance!