Topic 07: Pipelined Processors
Lecture 32 (4/23)
-
Review of the clock steps used during the execution
of instructions by the Mano Basic Machine. Observe that memory access
instruction with indirect addressing access memory 3 times: to fetch the
instruction, to fetch the effective address, and third to fetch or store
the data.
-
The instruction format is 1-3-12: One bit to
indicate indirect addressing, 3 opcode bits, 12 address or function bits.
-
Address length is 12 bits, max memory size is 2^(12)*2
bytes= 8KB
-
Instruction set architecture of the MIPS computer:
(from CSI333, also review PH. Chap. 3)
-
MIPS Fields (p. 118), 6-5-5-5-5-6 and 6-5-5-16 formats.
-
3 register ALU instructions explained.
-
32 registers (instead of just one AC), 2^(32) byte
=4GB address space, instruction length is fixed at 32 bits.
-
"Good design demands good compromises": 32
bits of address plus an opcode cannot fit in a 32 bit instruction,
so memory data access is done with load and store instructions which use
base
addressing; the effective address is calculated by the CPU by adding
the 32 bit contents of a specified register with the sign extended value
in the bottom 16 bits of the instruction.
-
Various addressing modes used in the MIPS instruction
set architecture (Figure 3.17). Load and Store instructions come
in 3 varieties: to access bytes, halfwords, and words.
-
Glimpse at the hardware that implements a MIPS processor
(Figure 6.12)
-
EVERY MIPS instruction can be executed by performing
these 5 stages in sequence:
-
Instruction Fetch (FETCH)
-
DECODE (activate the register file to supply the
contents of the two registers specified by the first two 5-bit fields of
the instruction register.) The data to be stored is supplied in the
case of store instructions.
-
Execute (ALU) Use the ALU to do one arithmetic or
logic operation: It could be either a data processing operation required
for a 3 register ALU instruction or the calculation of an effective address
for a load or store instruction.
-
MEMORY: Read or write memory only for load or store
instructions, do nothing for the others.
-
WRITEBACK: For load or ALU instructions, write
the loaded data or arithmetic result to the destination register.
-
These 5 steps are THE SAME for all types of instructions.
An analogous situation is in a laundry. Regardless of the kind of
clothes, (jeans, sheets, shirts, sweaters), every laundry load can be washed
by doing four tasks in sequence: WASH, DRY, FOLD and PUT-AWAY.
(See PH figure 6.1)
Lecture 33 (4/25)
-
RISC (Reduced Instruction Set Computers) : Innovation of the mid-1980's
at Berkeley and Stanford; the CDC 6600 of the 60's used what is now called
RISC; also some research never made it out of IBM labs.
-
Purposely omit instructions which make assembly programming easier but
make it harder to implement the CPU in a pipelined way. For example,
unlike Mano Basic, MIPS has no Indirect Memory addressing. To access
data pointed to by an in memory pointer, two MIPS instructions are needed
instead of 1 Basic instruction with the I bit set. (Application of
indirect addressing: p = p->next; where p is a
pointer to a structure.)
-
Speedup = (performance of faster system)/(performance of slower system)
-
Throughput = (Total number of jobs done in a time period)/(Total time of
the time period)
-
The throughput of the non-pipelined laundry is (1 job)/( 2 hours) = 0.5
jobs/hour.
-
The throughput of the pipelined laundry in steady state (neglecting time
when all 4 stations were not in use) is
(1 job completed every 1/2 hour)/( 1/2 hour period between completion
of jobs) = 2 jobs/hour.
-
The ideal speedup of pipelining the laundry is (2 jobs/hour)/(0.5 jobs/hour)
= 4.
-
The ideal speedup for a pipeline is the number of stages.
-
DESPITE the fact the pipelined laundry has almost 4 times as much throughput
as the non-pipelined laundry, in BOTH cases, the response time (time
from start to finish of one laundry load) is the same and might be longer
for the pipelined laundry. It might be longer because the students
might spend extra time coordinating their activity.
-
Timing of non-pipelined and pipelined MIPS CPU compared: PH figure 6.3.
The reg stage might be faster for the non-pipelined CPU than for the pipelined
CPU.
-
Hardware organization for the pipelined MIPS CPU: PH figure 6.12.
Different hardware is provided for each of the 5 different stages of IF,
ID, EX, MEM and WB.
Lecture 34 (4/27)
Emphasis on data transfer, storage and processing
for this CSI404 topic
-
Understanding what to do with data is prior
to understanding and design of control structures that will control system
operation to make the system do what is is supposed to.
-
Main principle: Data cannot be processed until
a copy of that data is available:
-
Data cannot be transferred from the future: It
is impossible to calculate the Dow Jones stock average for tomorrow today,
because the data that is averaged are not available.
The hardware and its roles for one single clock step
of processing for the EX stage of executing a SW instruction.
-
The pipeline registers.
-
Shown in color on PH figure 6.12 (vertical bars)
-
Named for the stages they separate, ID/EX for example.
-
The pipeline registers are all clocked (flip-flops).
The circuitry between the pipeline registers is (mostly) combinational.
(Exceptions: the register file, fetch stage PC, instruction and data memory).
-
Analogous to laundry bins for a Berkeley student
to hold a laundry load while his classmate removes her load from the next
machine.
-
Most of the time between clock ticks is needed for
the combinational circuits between the pipeline registers to produce stable
signals.
-
When the clock ticks, the new data (or state) appears
on the output wires of pipeline registers.
-
Setup for the "bracketed discussion" of the EX processing
for the SW instruction:
-
SW $2, 100($3)
-
sign extend the 100(Ten) and store it in the Immediate
ID/EX pipeline register.
-
read register $3 and store contents (0xA0B4) in Read
Data 1 ID/EX pipeline register.
-
read register $2 and store contents (39(Ten)) in
Read Data 2 ID/EX pipeline register.
-
What is done during the EX step:
-
ALU adds (0xA0B4) with (100(Ten)) and result is stored
in the ADDRESS register of the EX/MEM pipeline register group.
-
Wires transmit the 39(Ten) data and it is stored
in the DATA register of the EX/MEM pipeline register group.
-
During the next step (MEM for the SW instruction),
the data 39 is stored at address (0xA0B4)+(100(Ten))
-
See PH pages 457-460 for more details; earlier pages
detail operation with an LW instuction.
Pipeline operation analysis, use of pipeline diagrams
and hazards
-
Pipeline timing diagram like PH figure 6.20
-
Horizontal axis represents TIME. One horizontal
row represents the time sequence of operations used for processing one
instruction with the hardware that does each step.
-
Vertical axis represents SPACE. One column
represents all the operations done by hardware at the same time by
different hardware, in different locations in space.
-
The pipeline hardware schematic on the horizontal
rows of figures like 6.20 shows how one instruction encounters different
hardware in different clock steps. (It's somewhat misleading since
the horizontal minature pipeline is not present at one time in space.)
-
A structrural hazard is a situation of correct
data processing being impossible during one clock step because of a resource
limitation. The Mano computer had a single and single ported
memory, so the memory operations cannot be performed in parallel.
If the MIPS computer had a single memory with a single port, the MEM stage
of a Load or Store instruction cannot be done in parallel with the IF stage
of the instuction 3 instructions later.
-
Two ways to deal with hazards:
-
Make the processing of one of the instructions stall,
so
it doesn't proceed until the resource is available.
-
Design more or different hardware so resource or
communication paths are available to prevent the hazard.
-
Consequences of one stall:
-
One step of processing of one instruction is delayed
by 1 clock step.
-
The completion of the stalled instruction AND
EVERY INSTRUCTION AFTER IT is delayed by 1 clock step.
Therefore, the total execution time is increased by 1 clock step.
-
The stuctural hazard from memory contention can be
removed by using separate instruction and data memories.
-
Contemporary computers use a single memory for instructions
and data, but have separate instruction and data caches. Most
of the time (very roughly, 90%) the needed instuctions and data are
in both caches, so accesses can be done in parallel and there are no stalls
due to this.
-
Pipeline diagram with the memory hazard:
| load |
IF |
ID |
EX |
MEM |
WB |
|
|
|
| instr 1 |
|
IF |
ID |
EX |
MEM |
WB |
|
|
| instr 2 |
|
|
IF |
ID |
EX |
MEM |
WB |
|
| instr 3 |
|
|
|
IF |
ID |
EX |
MEM |
WB |
| instr 4 |
|
|
|
IF(hazard) |
|
|
|
|
-
Pipeline diagram with the structural hazard removed by a stall:
| load |
IF |
ID |
EX |
MEM |
WB |
|
|
|
| instr 1 |
|
IF |
ID |
EX |
MEM |
WB |
|
|
| instr 2 |
|
|
IF |
ID |
EX |
MEM |
WB |
|
| instr 3 |
|
|
|
IF |
ID |
EX |
MEM |
WB |
| instr 4 |
|
|
|
(stall) |
IF |
ID |
EX |
MEM |
| instr 5 |
|
|
|
|
(delayed) |
IF |
ID |
EX |
Design process for pipelined CPUs:
LOOP:
-
Design
-
Analyze for Hazards. If there are no hazards, exit (go to detailed
design)
-
Propose how to remove hazards.
-
Evaluate cost (additional hardware cost AND design cost) and performance
(effect on computer speed) of proposals
-
Go back to step 1.
Lecture 35 (4/30)
-
The notion of dependency: One instruction
is dependent on another instruction when the dependent instruction requires
the results of the other instruction for correct execution.
-
The machine language (instruction set architecture)
rules for contemporary computers say that every instruction must be executed
as if all the preceeding instructions (in the dynamic program execution)
have been completed.
-
Non-pipelined CPU implementations meet this requirement
easily, since each instruction execution is completely finished before
the next instruction is fetched.
-
Pipelined CPU implementations do not meet this requirement
automatically, because several different instructions are having their
work done by different pipeline stages simultaneously, in other
words, in parallel.
-
Data hazard situations and their removal: Figure
6.37
-
Removal of the hazard by stalling until the earlier
instuction's result is actually stored in the register file.
-
Removal of the hazard by forwarding hardware, which
transmits the ALU output pipeline register contents directly to an
ALU input during the very next clock step. (see Figure 6.37)
-
The hardware that implements the forwarding: Figure
6.38.
-
A data hazard whose removal requires stalling:
Figure 6.44
-
Data cannot be processed until a copy of it exists
in the hardware to do the processing!
-
Data that is needed by the ALU at the beginning of
clock step is not read out of memory until the end of the same clock step.
At least one stall cycle is needed to obtain the data.
-
One stall cycle is sufficient because forwarding
hardware from the memory output pipeline register to the ALU inputs is
used
-
The way stalls are actually inserted in the pipeline:
Some stage operations are repeated (figure 6.45)
Lecture 36 (5/2)
-
The design/hazard analysis/proposal/evaluation CPU engineering
process flowchart (kindergarten level Intel-like company) again.
-
The list below presents ideas the computer architects
can use to make proposals.
-
Hardware to make the pipeline stall is called interlocking
hardware. (Background: "interlocking" was used by railware
signal and switch engineers to mean hardware (cams, cranks, levers, etc.)
that physically prevented switches and signals from being set in a way
to cause a train collision.)
-
Some ways to remove hazards besides stalling until
the condition or resource needed for the stalled operation becomes available:
-
Add more functional units
-
get more parallelism to remove structural hazards
(e.g., 2-ported or separate instruction and data caches to remove the conflict
between IF and MEM)
-
do some things earlier, e.g., make branch decisions
earlier to remove control hazards by using a comparitor to test register
content equality in the ID stage that is separate hardware from the
main ALU.
-
Add forwarding or bypassing hardware to copy, as
soon as possible, existing data to where it is needed.
-
Define Instruction Set Architecture so some instructions
take effect later than usual:
-
Delayed loads (obsolete). A MIPS load is effective
immediatly, but a stall occurs if the next instruction needs its result.
-
Delayed branches (used in RISC ISAs like MIPS
and SPARC)
-
Branch hazard (example of control hazard in MIPS)
see figure 6.50.
-
The instruction in memory after a branch or jump
is ALWAYS EXECUTED, independent of whether the control transfer is
taken. (Background: The full ISA has some variants of branches for
which this is not true.)
-
MIPS implementation from PH uses early branch decision
hardware PLUS delayed branchs.
See PH figure 6.51, and related text.
-
Newer ISAs (and the i386 "Pentium") don't use delayed
branches since dynamic scheduling and out of order and speculative execution
are architectural techniques that today achieve better instruction level
parallelism without the need for delayed branches.
-
(Obsolete) Define the ISA so programs with hazards
are illegal. Instead, we build interlocks.
-
Write better programs!
-
Compiler scheduling - optimizer reorders instructions
to minimize stalls. Example:
-
Source:
int A,B,C,D,E,F;
A = B + C;
D = E - F;
-
Original assembly language code:
LW Rb, B
LW Rc, C
ADD Ra, Rb, Rc ;stall!!
SW Ra, A
LW Re, E
LW Rf, F
SUB Rd, Re, Rf ;another stall!!
SW Rd, D
-
Optimized assembly language code:
LW Rb, B
LW Rc, C
LW Re, E ;begin work
on next C statement early.
ADD Ra, Rb, Rc ;no more stall!!
LW Rf, F
SW Ra, A
SUB Rd, Re, Rf ;no more stall!!
SW Rd, D
-
Other compiler optimizations: Take better advantage
of registers in lieu of memory, and many others.
Try out gcc -S something.c and gcc
-S -O3 something.c and compare the something.s assembly language
output files.
-
Advanced methods:
-
Out of order execution
-
Hardware scheduling
-
Speculative execution
-
Dynamic branch prediction
-
Fetch more than one instruction at a time (superscalar).
End of pipelined processor topic
Copyright (C) S. Chaiken, 2001,
eccept for quotations from PH. All rights reserved.
$Id: Topic06.html,v 1.1 2001/04/24
15:06:45 sdc Exp sdc $