| RISC Architectures For Embedded Applications Introduction |
The 166 CPU core makes extensive use of reduced instruction set computer (RISC) concepts to acheive its blend of very high performance at modest cost. To understand why RISC techniques are especially suited to high-speed real time embedded systems, it might be useful to examine in detail how it grew out of the traditional complex instruction set computers (CISC) that reached their peak in the late 1980s to early 1990s.
Behind The 166s Near-RISC CoreThe reasons behind the abandonment of traditional Complex Instruction Set Computers (CISC) has been the quest for ever greater throughput. The demands of workstations involved in CAD tasks and latterly advanced video games, have been the real driving force behind this. Traditionally, microprocessors have been designed with assembler instruction sets that have been geared towards making the assembler programmers life easier through the extensive use of microcode to produce ever more powerful instructions`. By providing single assembler instructions that perform, for instance, three operand multiplication, the assembler programmer (and HLL compiler writer) has been relieved of the job of achieving the same result with simpler instructions.
The need for the CPU to be able to recognise and act on (decode) many hundreds of different instructions, requires complex silicon and many clock cycles. The greater the silicon area, the greater the cost of the device and power consumed. With physical limitations acting to restrict achievable clock speeds on silicon devices, the number of cycles per instruction is obviously very significant in gaining higher performance..
RISCs tend to shift the burden of programming from the microcoder to the assembler programmers and compiler writers. Work both within academia and commercial manufacturers has proved that a suitably programmed RISC machine can achieve a far higher throughput than a CISC for a given clock speed.
Strangley, the embedded world has been slow to question the suitability of the CISC-based microcontroller. Whilst at the very top end, devices such as the i80960 have enjoyed some success, for more commonplace embedded tasks, RISC is almost unknown. With the increasing complexity of modern control algorithms, the need for greater processing power is set to become an issue in anything but the simplest applications. In addition, here more than in the workstation world, the worst-case response time to non-deterministic events is crucial, an area where CISCs are especially poor.
Many current high-end microcontrollers are based on existing CISC architectures such as the 8086, 68000 etc., who in common with 8-bit devices such as the 8051, have an internal structure that dates back up to 19 years. With the silicon vendors need to give existing users an upgrade path, apparently new designs are often based closely on the existing architecture/instruction set, so protecting the users investment in expensive assembler-code.
Like workstations, microcontrollers are tending to be programmed in a high level language (HLL) to reduce coding times and enhance maintainability. Inevitably, even with the best compilers, some loss of performance is encountered, emphasising again the need for improved CPU performance.
In addition to straight forward data processing, microcontrollers must also handle real-world peripherals such as A/D converters, PWMs, timers, Ports, PLLs etc., all of which require real time processing.
Conventional CISC Bottle-necks1. Long And Unpredictable Interrupt Latencies
Complicated labour-saving instructions must hold CPUs entire attention during execution, thus preventing real-world generated interrupts from being serviced. Unpredictable latency times result which can cause serious problems in hard real-time systems. One approach to overcoming the CISCs poor real-time response has been to bolt a secondary time processor onto the core to try and off-load the time-critical portions. However, this results in an awkward design and the need to use a very terse microcode to program it, in addition to the more usual C and assembler for the CISC core itself.
2. Vast Instruction Sets Give Slow Decoding
Loaded instruction must be recognised from potentially many hundreds or even thousands of possibilities. Decoding is thus complicated and lengthy.
3. Frequent Accesses To Slow Memory Devices
Data is typically fetched from off-chip memory and placed in accumlator-type registers. Mathematical or logical operations are performed and then result written back to memory. Value is likely to be required again in course of procedure, thus requiring further movements to and from off-chip memory.
4. Slow Procedure Calling
When calling subroutines with parameters (essential in good HLL programming), parameters must be individually pushed on to stack. They must then be moved through accumulator register(s) for processing before being returned via stack to caller.
5. Strictly One Job At A time
Each peripheral device or interrupt source must have dedicated service routine which at the least will require the PSW, PC to be stacked and restored and data removed from or fed to peripheral device.
6. Software Has To Be Structured To Suit Architecture.
Embedded systems frequently contain many separate real time tasks which together form a complete system. Conventional CPUs make switching between task slow. Often, many registers have to be stacked to free them up for the incoming task. This problem is aggravated by the use of HLL compilers which tend to use a large number of local variables in library functions which must be preserved.
7. Redundant Instructions And Addressing Modes
With the move to HLLs, compilers are tending to dictate what instructions should be provided in silicon.
In practice, compilers tend to only make use of a small number of addressing modes. This results in a large number of unused addressing modes which serve only to complicate the opcode decoding process.
8. Inconsistent Instruction Sets
Instruction sets that have evolved tend to be difficult to use due to large number of different basic types and the inconsistent addressing modes allowed.
9. Bus Not Fully Utilised
Whilst complex instructions are being executed, bus is idle.
The RISC Architecture For Embedded Control
To show how RISC design is used to improve microcontroller throughput, the 166 is used as an example.
Basic Definitions:
1 state time = 2 x 1/oscillator frequency
- fundamental unit of time recognised within processor system.
1 machine cycle = 2 * state time
- minimum time required to perform the simplest meaningful task within cpu.
The unit of state times is used when making comparisons between RISCs and CISCs as this removes any dependency on clock frequency.
- All state time counts are given in single chip operation mode for both 80C196 and 166.
Bus Interface
To maximise the rate at which instructions are executed, RISC CPUs are very heavily pipelined. Here, on any given machine cycle, up to 4 instructions may be processed by overlapping the various steps thus:
FETCH: - get opcode from program store DECODE: - identify opcode from a small list and fetch operands EXECUTE: - perform operation denoted by opcode WRITE-BACK: - result returned to specified locationThus although the instruction takes four machine cycles, it is apparently executed in just one (2 state times). Pipelining has considerable benefits for speeding sequential code execution as the bus is guarantied to be fully occupied.
RISC Interrupt Response
In the 166, branches to interrupts make use of the injected instruction technique and so vectoring to a service routine is achieved in only 4 machine cycles (400ns). The effect of complex but necessary instructions such as MUL and DIV (5 and 10 cycles respectively) stretch this but it is interesting to note that the 80C166 does provide these as interruptable instructions.
Very fast interrupt service is crucial in high-end applications such as engine management systems, servo drives and radar systems where real-world timings are is used in DSP-style calculations. As these normally form part of a larger closed control loop, erratic latency times manifest themselves as an undesirable jitter in the controlled variable.
Registers And Multi-Tasking
Traditional microcontrollers have one or more special registers which can be used for mathematical, logical or Boolean operations. In the 8051, there is a single accumulator with 8 other registers which may be used for handling local variables or intermediate results in complex calculations. These additional registers are also used to access memory locations via indirect and/or indexed addressing.
As pointed out in section 3 and 4 above, conventional CPUs spend much time moving data from slow memory areas into active registers. The RISC offers a very large number of general purpose registers which may be used for locals, parameters and intermediates. The 166 provides 16 word wide general purpose registers (GPRs), each of which is effectively an accumulator, indirect pointer and index. With such a large number of GPRs available, it becomes realistic to keep all locals and intermediates within the CPU throughout quite large procedures. This can yield a great increase in speed.
Further significant benefits are derived from the RISC technique of register windowing. As has been said, up to 16 registers are available for use by the program. However, by making the active register bank movable within a larger on-chip RAM, the job of real time multi-tasking is considerably eased.
Central to this is the concept of a Context Pointer (CP), which defines the current absolute base address of the active bank. Thus a reference to R0 means the register at the address indicated by the CP. Thereafter, the 16 registers originating from CP are accessed by a fast 4-bit offset.
The best example of how the CP is exploited is perhaps a background task and a real-time interrupt co-existing. When the interrupt occurs, rather than pushing all GPRs onto the stack, the CP of the current register bank is stacked and simply switched to a new value, determined at link time, to yield a fresh register bank. This results in a complete context switch in just one machine cycle but does rule out the use of recursion.
A hybrid method, which permits re-entrancy, uses the stack pointer to calculate
the new CP dynamically.
Here, on entering the interrupt, the number of registers now required is subtracted
from the current SP and the result placed in CP, with the old CP stacked. Thus
the new register bank is located at the top of the old stack, with the old CP
and then the new stack following on immediately afterwards. On exiting the interrupt
routine, the original registerbank is restored by POPping the old CP from the
stack. The SP is reinstated by adding the size of the new register bank onto
the current SP.
A further RISC refinement is register window overlapping whereby when a new procedure is called, part of the new register bank defined by CP is coincident with the original at CP:
R3' ; Register for subroutines locals and intermediates R2' ; Register for subroutines locals and intermediates R7 R1' ; Common register, R7 == R1' CP R6 R0' ; Common register, R6 == R0' R5 ; Register for callers locals and intermediates R4 ; Register for callers locals and intermediates R3 ; Register for callers locals and intermediates R2 ; Register for callers locals and intermediates R1 ; Register for callers locals and intermediates CP R0 ; Register for callers locals and intermediates MODULE 1 ; *** Assignment Of GPRs To Local Variables - Caller *** x_var LIT R0 ; Local variable y_var LIT R1 ; Local variable parm1 LIT R6 ; Passed parameter 1 parm2 LIT R7 ; Passed parameter 2 result LIT R6 ; Value returned from sub routine ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ MODULE 2 ; *** Assignment Of GPRs To Local Variables - Sub Routine *** a_var LIT R2 ; Local variable b_var LIT R3 ; Local variable input1 LIT R0 ; Received parameter 1 input2 LIT R1 ; Received parameter 2 ret1 LIT R0 ; Final result returned in R0By using some forethought, the programmer should arrange for any value to be passed to the sub routine to be located in the common area, so that all the normal loading and unloading of parameters is avoided. This technique can be used in either absolute or SP-relative registerbank modes.
To get the best from a RISCs registers, the location of data needs close consideration: although highly orthogonal, the limited number of addressing modes provided for MUL and DIV for example, can appear somewhat restrictive. Fortunately though, most operands involved will already be in registers, so eliminating the need for many addressing techniques. As might be expected, the instructions with the widest range of addressing modes are the simple data moves - the fact RISCs are the result of very careful analysis of the requirements for fast execution becomes obvious after a short acquaintance!