Ch4. The Processor

The Processor

Haram Lee
2026-05-25
studies / 26-1 / computer-architecture

Big Picture

4단원은 MIPS 명령어가 CPU 내부에서 실제로 어떻게 실행되는지를 다룬다.
앞 단원과의 연결
- 2단원: 명령어가 어떻게 생겼는가.
- 3단원: 계산을 어떻게 하는가.
- 4단원: 명령어와 계산 장치를 모아 진짜 processor의 datapath와 control을 만든다.
4단원은 크게 두 덩어리다.
- single-cycle MIPS processor
  - 하나의 명령어를 한 clock cycle 안에서 fetch, decode, execute, memory, write-back까지 끝낸다.
  - 단순하지만 성능에 한계가 있다.
- pipelined MIPS processor
  - 여러 명령어를 겹쳐 실행해 throughput을 높인다.
  - 더 현실적인 구현이다.
다루는 명령어 subset
- memory reference: lw, sw
- arithmetic/logical: add, sub, and, or, slt
- control transfer: beq, j
CPU 성능 식

text

CPU time = Instruction Count × CPI × Clock cycle time
         = Instruction Count × CPI / Clock rate

핵심
- instruction count는 ISA와 compiler가 결정한다.
- CPI와 cycle time은 CPU hardware가 결정한다.
- 4단원은 CPI와 cycle time을 hardware가 어떻게 결정하는지를 배우는 단원이다.

명령어 실행의 기본 흐름

CPU가 명령어 하나를 실행할 때 큰 흐름은 거의 항상 비슷하다.
- PC를 이용해 instruction memory에서 명령어를 가져온다.
- 명령어 안의 register 번호를 보고 register file을 읽는다.
- 명령어 종류에 따라 ALU를 사용한다.
- load/store라면 data memory에 접근한다.
- 결과를 register에 다시 쓴다.
- PC를 다음 명령어 주소로 바꾼다.
좀 더 구체적으로

text

PC → instruction memory → instruction fetch
register numbers → register file → read registers

ALU 계산:
- arithmetic result
- memory address for load/store
- branch target address

data memory 접근:
- lw: memory에서 읽음
- sw: memory에 씀

PC update:
- 일반 명령어: PC + 4
- branch/jump: target address

이 흐름이 4단원 전체의 뼈대다.
모든 datapath와 control signal은 결국 이 흐름을 가능하게 만들기 위해 존재한다.

Datapath vs Control

CPU를 이해할 때는 항상 두 가지를 나눠 본다.
- Datapath: 데이터가 실제로 흘러가는 길.
- Control: 그 길 중 어디를 열고 닫을지 정하는 신호.
datapath에 들어가는 부품들
- PC
- instruction memory
- register file
- ALU
- data memory
- adder
- sign extender
- shift-left-2 unit
- multiplexer
- pipeline register
control이 답해야 하는 질문들
- 이번 명령어는 register에 write하는가?
- ALU의 두 번째 입력은 register 값인가 immediate인가?
- memory를 읽는가?
- memory에 쓰는가?
- write-back 값은 ALU 결과인가 memory 데이터인가?
- PC는 PC+4로 갈까, branch target으로 갈까, jump target으로 갈까?
핵심
- datapath는 “길"이고, control은 “신호등"이다.

Multiplexer

슬라이드의 핵심 문장
- “Can’t just join wires together. Use multiplexers.”
여러 데이터 경로를 그냥 한 선에 묶으면 안 된다.
- 예: ALU의 두 번째 입력은 어떤 명령어에서는 register 값, 어떤 명령어에서는 sign-extended immediate.
그래서 mux를 쓴다.
- control signal에 따라 여러 입력 중 하나를 선택한다.
예: ALUSrc

text

ALUSrc = 0 → ALU input 2 = register file의 Read data 2
ALUSrc = 1 → ALU input 2 = sign-extended immediate

이렇게 datapath가 명령어 종류에 따라 유연하게 동작할 수 있다.

Logic Design Basics

CPU는 digital hardware다. 정보는 binary로 표현된다.
- low voltage = 0
- high voltage = 1
- one wire per bit
- multi-bit data = multi-wire bus
hardware element는 크게 두 종류다.

Combinational element

입력만으로 출력이 결정되는 회로.
예시
- AND gate
- multiplexer
- adder
- ALU
동작

text

AND gate: Y = A & B
Mux:      Y = S ? I1 : I0
Adder:    Y = A + B
ALU:      Y = F(A, B)

State (Sequential) element

값을 저장하는 회로.
예시
- PC
- register file
- pipeline register
- memory
정리
- 조합논리는 “계산"을 하고, 상태소자는 “기억"을 한다.
- CPU는 둘을 조합해 매 clock마다 상태를 업데이트한다.

Datapath 1: Instruction Fetch

모든 명령어는 fetch에서 시작한다.
- PC가 현재 실행할 명령어의 주소를 갖고 있다.
- PC 값을 instruction memory의 read address로 보낸다.
- instruction memory가 해당 명령어를 내보낸다.
- 동시에 PC + 4를 계산한다.
- cycle 끝에서 PC를 업데이트한다.
MIPS 명령어는 32비트, 즉 4바이트이므로 보통

text

next PC = PC + 4

fetch datapath에는 최소한 다음이 필요하다.
- PC register
- instruction memory
- adder for PC + 4

Datapath 2: R-type 명령어

예시

mips

add $t1, $s1, $s2

R-type format

text

op | rs | rt | rd | shamt | funct

주요 필드
- rs: 첫 번째 source register
- rt: 두 번째 source register
- rd: destination register
- funct: 실제 ALU operation 결정
실행 흐름
- instruction[25:21] = rs → register file read register 1
- instruction[20:16] = rt → register file read register 2
- 두 register 값을 ALU로 보낸다.
- funct field를 보고 ALU operation을 결정한다.
- ALU result를 rd에 write-back한다.
R-type은 memory를 읽거나 쓰지 않는다. ALU 결과를 register에 쓰는 것이 핵심.
control signal

text

RegWrite = 1
RegDst   = 1   # destination은 rd
ALUSrc   = 0   # ALU 두 번째 입력은 register
MemRead  = 0
MemWrite = 0
MemtoReg = 0   # write-back 값은 ALU result

Datapath 3: Immediate 명령어

예시

mips

addi $t0, $s0, 4

I-type format

text

op | rs | rt | immediate

동작: $t0 = $s0 + 4
문제
- ALU 두 번째 입력은 register가 아니라 immediate 4.
- immediate는 16비트, ALU는 32비트 입력.
해결
- Sign-extend: 16-bit immediate → 32-bit value.
- ALU 두 번째 입력을 고르는 mux 필요.

text

ALUSrc = 1 → sign-extended immediate를 ALU input 2로 사용
RegDst = 0 → destination은 rt

Datapath 4: Load 명령어

예시

mips

lw $s0, 8($t0)

의미: $s0 = Memory[$t0 + 8]
실행 흐름
- base register $t0를 읽는다.
- offset 8을 sign-extend한다.
- ALU로 base + offset을 계산한다.
- 그 결과를 data memory address로 보낸다.
- data memory에서 값을 읽는다.
- 읽은 값을 register $s0에 write-back한다.
핵심
- load에서는 ALU result가 최종 값이 아니라 memory address다.
- 최종 write-back 값은 data memory에서 읽은 데이터.
control signal

text

ALUSrc   = 1   # offset immediate 사용
MemRead  = 1   # memory 읽기
MemtoReg = 1   # memory output을 register에 씀
RegWrite = 1   # register write
RegDst   = 0   # destination은 rt

Datapath 5: Store 명령어

예시

mips

sw $a0, 8($sp)

의미: Memory[$sp + 8] = $a0
실행 흐름
- base register $sp를 읽는다.
- store할 데이터 register $a0도 읽는다.
- offset 8을 sign-extend한다.
- ALU로 base + offset을 계산한다.
- ALU result를 data memory address로 사용한다.
- $a0 값을 data memory에 쓴다.
store는 register에 결과를 쓰지 않는다.
control signal

text

RegWrite = 0
MemWrite = 1
MemRead  = 0
ALUSrc   = 1

RegDst, MemtoReg은 결과에 영향이 없는 don’t care (d).

Datapath 6: Branch 명령어

예시

mips

beq $t0, $s0, offset

의미

text

if ($t0 == $s0) PC = branch target
else            PC = PC + 4

branch target address

text

branch target = PC + 4 + (sign-extended offset << 2)

<< 2를 하는 이유
- MIPS 명령어 주소는 4바이트 단위로 정렬.
- offset은 instruction 단위, 실제 byte address로 바꾸려면 4를 곱해야 함.
실행 흐름
- rs, rt register 값을 읽는다.
- ALU에서 두 값을 subtract한다.
- 결과가 0이면 두 값이 같다는 뜻이다.
- sign-extended offset을 왼쪽으로 2비트 shift한다.
- PC + 4와 shifted offset을 더해 branch target을 만든다.
- Branch signal과 ALU Zero signal을 보고 PCSrc를 결정한다.
핵심 control

text

ALUOp  = 01       # subtract
Branch = 1
PCSrc  = Branch AND Zero

beq에서 실제 branch taken 조건
- Branch = 1 and Zero = 1일 때 PC가 branch target으로 바뀐다.

Datapath 7: Jump 명령어

예시

mips

j target

jump address

text

jump address = PC+4[31:28] : (instruction[25:0] << 2)

절차
- instruction의 26-bit address field를 가져온다.
- 왼쪽으로 2비트 shift한다.
- PC+4의 상위 4비트와 concatenate한다.
- Jump signal이 1이면 PC를 jump address로 바꾼다.
PC로 들어갈 값 후보가 여러 개이므로 mux 필요.
- PC + 4
- branch target
- jump target
그래서 최종 datapath에는 branch용 PCSrc mux와 jump용 mux가 추가된다.

ALU Control

ALU는 여러 명령어에서 사용된다.
- lw/sw: address 계산 → add
- beq: 두 register 비교 → subtract
- R-type: funct field에 따라 add/sub/and/or/slt
ALU control code

text

0000 → AND
0001 → OR
0010 → add
0110 → subtract
0111 → set-on-less-than
1100 → NOR

main control은 매번 ALU control code 전체를 직접 만들지 않는다.
- 먼저 opcode를 보고 2비트 ALUOp를 만든다.

text

ALUOp = 00 → add      # lw, sw, addi
ALUOp = 01 → subtract # beq
ALUOp = 10 → funct field를 보고 결정 # R-type

R-type이면 funct field까지 봐야 한다.

text

funct 100000 → add
funct 100010 → subtract
funct 100100 → AND
funct 100101 → OR
funct 101010 → set-on-less-than

즉 control은 2단계다.
- Main control: opcode → ALUOp
- ALU control: ALUOp + funct → ALU control line
이 구조 덕분에 control logic이 단순해진다.

Main Control Unit

main control unit은 instruction[31:26], 즉 opcode를 보고 control signal을 만든다.
대표 control signal
- RegDst
- RegWrite
- ALUSrc
- ALUOp
- MemRead
- MemWrite
- MemtoReg
- Branch
- Jump
각 signal의 의미
- RegDst: destination register가 rt인지 rd인지 선택.
- RegWrite: register file에 write할지 결정.
- ALUSrc: ALU 두 번째 입력이 register 값인지 immediate인지 선택.
- ALUOp: ALU operation을 큰 범주에서 지정.
- MemRead: data memory를 읽을지 결정.
- MemWrite: data memory에 쓸지 결정.
- MemtoReg: register에 쓸 값이 ALU result인지 memory data인지 선택.
- Branch: branch 명령어인지 표시.
- Jump: jump 명령어인지 표시.
시험에 자주 나오는 control signal 표

Instruction	Opcode	RegDst	RegWrite	ALUSrc	ALUOp	MemRead	MemWrite	MemtoReg	Branch	Jump
R-type	000000	1	1	0	10	0	0	0	0	0
addi	001000	0	1	1	00	0	0	0	0	0
lw	100011	0	1	1	00	1	0	1	0	0
sw	101011	d	0	1	00	0	1	d	0	0
beq	000100	d	0	0	01	0	0	d	1	0
j	000010	d	0	d	d	0	0	d	d	1

d는 don’t care. 해당 명령어에서 그 signal 값이 결과에 영향을 주지 않는다는 뜻.

Single-cycle Processor의 문제

single-cycle에서는 모든 명령어가 한 cycle 안에 끝나야 한다.
clock period는 가장 오래 걸리는 명령어에 맞춰야 한다.
critical path는 load instruction.

text

Instruction memory
→ register file
→ ALU
→ data memory
→ register file

즉 lw가 제일 긴 경로를 가진다.
문제
- add, beq, sw처럼 더 짧게 끝날 수 있는 명령어도 모두 lw에 맞춘 긴 clock cycle을 써야 한다.
- 짧은 명령어도 긴 clock을 기다려야 한다.
명령어마다 clock period를 다르게 만들면 어떨까?
- 설계가 복잡해진다.
- “common case를 빠르게 하라"는 설계 원칙에 맞지 않는다.
그래서 성능 개선을 위해 pipelining이 등장한다.

Pipelining

빨래 비유로 설명한다.
한 명분 빨래 단계
- Washing 30분
- Drying 30분
- Folding 30분
- Packing 30분
한 사람이 끝내고 다음 사람이 시작하면

text

Without pipelining:
4명 × 2시간 = 8시간

pipeline처럼 겹치면, 첫 번째 사람이 washing 끝내고 drying으로 넘어갈 때 두 번째 사람이 washing을 시작할 수 있다.

text

With pipelining:
4명 = 3.5시간

speedup = 8 / 3.5 ≈ 2.3
작업이 계속 들어오는 상황에서는 speedup이 stage 수에 가까워진다.
일반식

text

S = stage 수
ts = 각 stage 시간
N = task 수

without pipelining = N × S × ts
with pipelining    = (N + S - 1) × ts

large N이면 speedup ≈ S

핵심
- pipelining은 한 명령어의 latency를 줄이는 기술이 아니다.
- 전체 instruction throughput을 높이는 기술이다.

MIPS 5-stage Pipeline

MIPS pipeline은 5단계로 나뉜다.
- IF: Instruction Fetch
- ID: Instruction Decode & Register Read
- EX: Execute operation or calculate address
- MEM: Access memory operand
- WB: Write result back to register
각 stage의 역할
- IF: PC로 instruction memory에서 명령어 fetch, PC + 4 계산.
- ID: instruction decode, register file read, control signal 생성.
- EX: ALU operation 수행, load/store address 계산, branch target 계산.
- MEM: lw/sw의 data memory access.
- WB: ALU result 또는 memory data를 register file에 write-back.

Pipeline Performance 계산

stage 시간 가정

text

register read/write = 100ps
나머지 stage        = 200ps

명령어별 single-cycle 실행 시간

text

lw       = 200 + 100 + 200 + 200 + 100 = 800ps
sw       = 200 + 100 + 200 + 200       = 700ps
R-format = 200 + 100 + 200 + 100       = 600ps
beq      = 200 + 100 + 200             = 500ps

single-cycle processor에서는 가장 긴 lw 때문에 clock = 800ps.
pipeline에서는 가장 긴 stage 시간이 clock period가 된다.
- pipeline clock period = 200ps
speedup

text

without pipelining = N × 800ps
with pipelining    = (N + 5 - 1) × 200ps

large N이면 speedup ≈ 800 / 200 = 4

5-stage인데 speedup이 5가 아니라 4에 가까운 이유
- stage들이 완벽히 균형 잡혀 있지 않다.
- single-cycle 기준 시간이 5 × 200ps가 아니라 800ps이기 때문.

Hazard

pipeline이 항상 이상적으로 흘러가지 않는다.
어떤 instruction이 다음 cycle에 자신의 pipeline stage를 실행하지 못하는 상황을 hazard라고 한다.
hazard는 세 종류다.
- Data hazard
- Structural hazard
- Control hazard

Data hazard

이전 명령어의 결과가 아직 준비되지 않았는데, 다음 명령어가 그 값을 필요로 하는 상황.

mips

add $s0, $t0, $t1
sub $t2, $s0, $t3

sub는 $s0 값을 필요로 한다. 그런데 $s0는 바로 앞의 add가 만들어야 하는 값.

Structural hazard

같은 hardware resource를 두 명령어가 동시에 쓰려고 하는 상황.
예: instruction fetch와 data memory access가 같은 memory를 공유하면 충돌.

Control hazard

branch나 jump 때문에 다음에 어떤 명령어를 fetch해야 할지 아직 모르는 상황.
beq가 taken인지 not taken인지 결정되기 전에는 다음 PC를 확실히 알 수 없다.

Data Hazard 해결 1: Stalling

가장 단순한 해결책은 기다리는 것.
- stall: 필요한 값이 준비될 때까지 pipeline을 멈춤.
- bubble: pipeline 안에 삽입되는 빈 단계.
예: 앞의 add 결과가 register file에 write-back될 때까지 sub를 기다리게 한다.
단점
- stall이 많아질수록 CPI가 증가한다.
그래서 가능하면 stall을 줄이는 다른 방법을 쓴다.

Data Hazard 해결 2: WB/ID Timing 최적화

한 cycle을 앞뒤 절반으로 나눠서 생각할 수 있다.
- cycle의 first half: WB 수행.
- cycle의 second half: ID에서 register read 수행.
어떤 명령어가 cycle 앞부분에 register file에 write하고, 다른 명령어가 같은 cycle 뒷부분에 register file을 읽을 수 있다.
이렇게 하면 일부 data hazard를 줄일 수 있다.

Data Hazard 해결 3: Forwarding (Bypassing)

가장 중요한 해결책.
forwarding의 아이디어
- 결과를 register file에 write-back할 때까지 기다리지 않고, 계산된 값을 바로 다음 stage로 넘겨준다.
- 즉 register file write 전에 ALU input으로 직접 전달.
예시

mips

add $s0, $t0, $t1
sub $t2, $s0, $t3

add의 ALU result를 register file에서 다시 읽는 대신, 곧바로 sub의 ALU input으로 보낸다.
장점
- stall을 크게 줄일 수 있다.
단점
- datapath에 extra connection, mux, forwarding unit 필요.

Load-use Data Hazard

forwarding이 만능은 아니다. 대표적인 예외가 load-use hazard.

mips

lw  $s0, 0($t0)
sub $t2, $s0, $t3

lw의 결과는 data memory에서 읽은 뒤 나온다. 즉 MEM stage가 끝나야 준비된다.
그런데 바로 다음 명령어 sub는 EX stage에서 $s0를 필요로 한다.
즉 필요한 시점이 데이터가 준비되는 시점보다 빠르다.
슬라이드 표현
- “Can’t forward backward in time!”
따라서 load-use hazard에서는 보통 1 cycle stall이 필요하다.

Compiler Code Scheduling

hardware만으로 해결하지 않고 compiler가 도와줄 수도 있다.
compiler가 load 결과를 바로 다음 instruction에서 쓰지 않도록 instruction order를 재배치한다.
원래 코드

mips
lw  $t1, 0($t0)
lw  $t2, 4($t0)
add $t3, $t1, $t2
sw  $t3, 12($t0)
lw  $t4, 8($t0)
add $t5, $t1, $t4
sw  $t5, 16($t0)

이 경우 load-use 때문에 stall이 생긴다.
재배치한 코드

mips
lw  $t1, 0($t0)
lw  $t2, 4($t0)
lw  $t4, 8($t0)
add $t3, $t1, $t2
sw  $t3, 12($t0)
add $t5, $t1, $t4
sw  $t5, 16($t0)

독립적인 lw $t4, 8($t0)를 중간에 끼워 넣어 load 결과를 기다리는 시간을 유용하게 쓴다.
결과

text

before scheduling: 13 cycles
after scheduling:  11 cycles

이게 compiler instruction scheduling의 기본 아이디어.

Structural Hazard

structural hazard는 같은 hardware를 동시에 쓰려고 할 때 생긴다.
MIPS pipeline에서 instruction memory와 data memory가 하나로 합쳐져 있다고 해 보자.
- IF stage: instruction fetch를 위해 memory 필요.
- MEM stage: lw/sw가 data access를 위해 memory 필요.
같은 cycle에 같은 memory를 요구하면 충돌.
그래서 pipelined datapath에서는 보통 다음처럼 설계한다.
- Separate instruction memory and data memory.
- 또는 Separate instruction cache and data cache.
이렇게 하면 instruction fetch와 data access가 동시에 가능해진다.

Control Hazard

branch는 control flow를 바꾼다.

mips

beq $t0, $s0, target

taken이면 PC는 target으로, not taken이면 PC + 4로.
문제
- branch outcome을 알기 전에도 pipeline은 다음 instruction을 fetch하려 한다.
가장 단순한 해결책
- branch outcome이 결정될 때까지 stall.
단점
- branch가 자주 나오면 stall penalty가 너무 커진다.
그래서 branch prediction을 사용한다.

Branch Prediction

branch prediction = branch 결과를 미리 예측.
- 예측이 맞으면 stall 없이 진행.
- 예측이 틀리면 잘못 가져온 instruction을 버리고 다시 fetch.
MIPS pipeline에서 간단한 전략
- Predict not taken: branch가 taken되지 않는다고 가정하고, 그냥 다음 sequential instruction을 fetch.

Static Branch Prediction

실행 중 history를 보지 않고, 일반적인 패턴을 바탕으로 예측.
규칙
- Backward branch → taken으로 예측.
- Forward branch → not taken으로 예측.
이유
- loop는 보통 뒤로 돌아가는 branch이므로 taken일 가능성이 크다.

Dynamic Branch Prediction

hardware가 각 branch의 최근 behavior를 기록.
- 최근에 taken이 많았으면 앞으로도 taken으로 예측.
- 최근에 not taken이 많았으면 앞으로도 not taken으로 예측.
예측이 틀리면 stall하면서 다시 fetch하고, history를 업데이트한다.

Pipeline Summary

pipelining의 핵심
- 여러 instruction을 겹쳐 실행해서 instruction throughput을 높인다.
주의
- 각 instruction 자체의 latency가 반드시 줄어드는 것은 아니다.
- pipeline register overhead 등을 생각하면 개별 instruction latency는 비슷하거나 더 복잡해질 수 있다.
pipeline이 잘 작동하려면 hazard를 처리해야 한다.
- Data hazard → forwarding, stalling, scheduling
- Structural hazard → resource 분리
- Control hazard → branch prediction, early branch resolution
instruction set design도 pipeline 구현 난이도에 영향을 준다.
- MIPS는 instruction format이 규칙적이고 load/store architecture라 pipeline에 적합하다.

Pipelined Datapath

single-cycle datapath를 pipeline으로 바꾸려면 stage 사이에 값을 저장해야 한다.
그래서 pipeline register를 넣는다.
- IF/ID
- ID/EX
- EX/MEM
- MEM/WB
각 pipeline register의 역할
- IF/ID: fetch한 instruction과 PC+4를 ID stage로 넘김.
- ID/EX: register read 값, sign-extended immediate, control signal을 EX stage로 넘김.
- EX/MEM: ALU result, branch 관련 값, memory control signal을 MEM stage로 넘김.
- MEM/WB: memory read data 또는 ALU result, write-back control signal을 WB stage로 넘김.
pipeline register는 각 stage의 입력을 한 cycle 동안 안정적으로 유지하고, 다음 cycle에 업데이트한다.

Write-back Destination 문제

pipeline을 만들 때 조심해야 할 부분.
WB stage에서 write-back하는 instruction은 ID stage에 있는 instruction이 아니다.
- ID stage instruction의 destination을 쓰면 틀린 register에 write할 수 있다.
따라서 destination register 번호도 pipeline register를 통해 끝까지 가져가야 한다.

text

instruction[20:16] = rt
instruction[15:11] = rd
RegDst mux 결과
→ ID/EX
→ EX/MEM
→ MEM/WB
→ WB stage에서 register file write register로 사용

슬라이드는 이를 “Wrong write register!“라고 지적하고, write register를 WB stage까지 가져오도록 datapath를 수정한다.

Pipeline Diagram 읽는 법

슬라이드에서는 두 종류의 pipeline diagram을 구분한다.

Multi-cycle Pipeline Diagram

여러 cycle에 걸쳐 각 instruction이 어떤 stage를 지나가는지 보여줌.
instruction별 시간 흐름 표.

text

cycle 1: I1 IF
cycle 2: I1 ID, I2 IF
cycle 3: I1 EX, I2 ID, I3 IF
...

Single-cycle Pipeline Diagram

특정 cycle 하나를 잘라, 그 순간 pipeline 안에 어떤 instruction들이 어느 stage에 있는지 보여줌.
한 clock cycle의 snapshot.
시험에서는 두 그림을 구분해서 읽어야 한다.

Pipelined Control

control signal은 single-cycle 구현과 마찬가지로 instruction에서 생성된다.
하지만 pipeline에서는 control signal도 해당 instruction과 함께 stage를 따라 이동해야 한다.
- 그냥 한곳에서 만든 control을 모든 stage에 동시에 쓰면 instruction이 섞인다.
그래서 control signal을 stage별로 나눠서 pipeline register에 저장한다.

text

EX control:  RegDst, ALUOp, ALUSrc
M control:   Branch, MemRead, MemWrite
WB control:  RegWrite, MemtoReg

흐름
- ID stage에서 control signal 생성.
- ID/EX register에 저장.
- 필요한 signal은 EX stage에서 사용.
- M signal은 EX/MEM으로 전달.
- WB signal은 MEM/WB까지 전달.
즉 control signal도 데이터처럼 pipeline을 타고 흘러간다.

Forwarding Unit

forwarding을 하려면 CPU가 dependency를 감지해야 한다.
예시 sequence

mips
sub $2,  $1, $3
and $12, $2, $5
or  $13, $6, $2
add $14, $2, $2
sw  $15, 100($2)

sub가 $2를 만들고, 뒤의 여러 명령어가 $2를 사용한다.
register file에 write-back할 때까지 기다리지 않고 forwarding 가능.
forwarding unit이 보는 정보

text

현재 EX stage instruction의 source registers:
  ID/EX.RegisterRs
  ID/EX.RegisterRt

MEM stage instruction의 destination register:
  EX/MEM.RegisterRd
  EX/MEM.RegWrite

WB stage instruction의 destination register:
  MEM/WB.RegisterRd
  MEM/WB.RegWrite

ALU input 앞에 mux를 둬서 다음 중 하나를 선택한다.
- ID/EX에서 온 원래 register value
- EX/MEM stage에서 forward된 값
- MEM/WB stage에서 forward된 값

Forwarding Condition: EX/MEM에서 Forward

가장 가까운 이전 instruction이 결과를 만들어 둔 경우.
ALU input A (Rs)

text

if (EX/MEM.RegWrite == 1)
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRs)
then ForwardA = 10

ALU input B (Rt)

text

if (EX/MEM.RegWrite == 1)
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd == ID/EX.RegisterRt)
then ForwardB = 10

RegisterRd != 0 조건이 중요한 이유
- MIPS의 $zero register는 write해도 값이 바뀌지 않는다.
- destination이 0번 register라면 forwarding하면 안 된다.

Forwarding Condition: MEM/WB에서 Forward

조금 더 오래된 instruction의 결과를 WB stage에서 forward.
ALU input A

text

if (MEM/WB.RegWrite == 1)
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRs)
then ForwardA = 01

ALU input B

text

if (MEM/WB.RegWrite == 1)
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd == ID/EX.RegisterRt)
then ForwardB = 01

단, double data hazard가 있을 때는 조심해야 한다.

Double Data Hazard

예시 sequence

mips
add $1, $2, $3
sub $1, $4, $5
and $6, $1, $7

and는 $1을 사용하는데, $1을 만드는 instruction이 두 개 있다.
- add도 $1을 만든다.
- sub도 $1을 만든다.
and가 써야 하는 값은 더 오래된 add 결과가 아니라, 더 최근의 sub 결과.
따라서 EX/MEM hazard와 MEM/WB hazard가 동시에 발생하면 EX/MEM forwarding을 우선한다.
MEM/WB forwarding condition은 다음 조건을 추가해야 한다.
- 같은 source register에 대해 EX/MEM hazard가 없을 때만 MEM/WB에서 forward.
즉 최신 결과를 우선 사용해야 한다.

Hazard Detection Unit

forwarding으로 해결할 수 없는 load-use hazard는 hazard detection unit이 stall을 넣는다.
대표 조건

text

if ID/EX.MemRead
and (
    ID/EX.Rt == IF/ID.Rs
    or
    ID/EX.Rt == IF/ID.Rt
)
then stall

의미
- 현재 EX stage에 있는 instruction이 load이고,
- 그 load가 값을 쓸 destination register가
- 현재 ID stage instruction의 source register와 같다면,
- 바로 다음 instruction이 load 결과를 필요로 하므로 stall해야 한다.
예시

mips

lw  $2, 20($1)
and $9, $2, $5

and는 $2가 필요한데, $2는 바로 앞의 lw가 memory에서 읽어와야 한다.
이 경우 1-cycle stall 필요.

Stall을 실제로 넣는 방법

stall은 그냥 “기다려"라고 말한다고 되는 게 아니라 datapath와 control을 조작해야 한다.
load-use hazard에서 CPU는 보통 세 가지 일을 한다.

1. PCWrite = 0

PC가 업데이트되지 않는다.
IF stage가 같은 instruction을 다시 fetch한다.

2. IF/IDWrite = 0

IF/ID pipeline register가 업데이트되지 않는다.
ID stage의 instruction이 그대로 유지된다.

3. control signal을 0으로 만들어 bubble을 넣는다

ID/EX로 들어가는 control signal들을 0으로 만들면, EX stage에 nop처럼 동작하는 bubble이 들어간다.
결과적으로 load는 앞으로 진행하고, dependent instruction은 한 cycle 기다린다.

Branch Hazard 줄이기: Branch를 ID Stage에서 빨리 해결

branch outcome을 MEM stage에서 결정하면 misprediction 때 3 cycles를 낭비할 수 있다.
해결책: branch를 더 빨리 결정.
- Resolve branch at the end of ID stage.
이를 위해 ID stage에서 추가로 해야 할 일
- branch target address 계산
- register 비교
효과

text

MEM에서 결정 → misprediction 시 3 cycles 낭비
ID에서 결정  → misprediction 시 1 cycle 손실

Branch Data Hazard

branch를 ID stage에서 빨리 결정하면 좋은 점이 있지만, 새로운 data hazard가 생긴다.
이유: branch는 ID stage에서 register 값을 비교해야 하는데, 그 register 값이 앞 instruction에서 아직 계산 중일 수 있다.

mips

add $1, $2, $3
beq $1, $4, target

beq는 ID stage에서 $1과 $4를 비교하고 싶지만, $1은 바로 앞 add가 만든 값.
슬라이드는 branch data hazard를 세 경우로 나눈다.

경우 1. 비교 register가 2번째 또는 3번째 앞 ALU instruction의 destination

mips
add $4, $5, $6
add $1, $2, $3
beq $1, $4, target

forwarding으로 해결 가능.

경우 2. 비교 register가 바로 앞 ALU instruction 또는 2번째 앞 load의 destination

1 stall cycle 필요.

경우 3. 비교 register가 바로 앞 load instruction의 destination

mips

lw  $1, addr
beq $1, $0, target

load data가 너무 늦게 나오므로 2 stall cycles 필요.
즉 branch를 ID에서 빨리 해결하면 control hazard penalty는 줄지만, branch가 사용하는 data hazard 처리가 더 중요해진다.

Summary

4단원은 MIPS 명령어가 CPU 내부 datapath를 어떻게 지나가는지, control signal이 그 길을 어떻게 제어하는지, pipelining으로 성능을 높일 때 hazard를 어떻게 처리하는지를 다룬다.
single-cycle datapath는 이해하기 쉽지만 가장 느린 instruction이 clock period를 결정해 성능에 한계가 있다.
pipelining은 여러 instruction을 겹쳐 실행해 throughput을 높인다.
pipeline에는 data hazard, structural hazard, control hazard가 생긴다.
data hazard는 forwarding, stalling, compiler scheduling으로 해결한다.
structural hazard는 instruction memory와 data memory를 분리해 해결한다.
control hazard는 branch prediction과 early branch resolution으로 줄인다.
추가 hardware로 forwarding unit, hazard detection unit, pipeline register, extra mux 등이 필요하다.
결국 4단원은 단순한 datapath에서 시작해 현실적인 pipeline processor로 확장하면서 성능과 복잡도 사이의 trade-off를 배우는 단원이다.

시험/복습 포인트

CPU execution flow: PC → instruction memory → register file → ALU → data memory → register write-back → PC update.
MIPS 5-stage pipeline: IF, ID, EX, MEM, WB.
ALU control: ALUOp 00 → add (lw/sw/addi), 01 → sub (beq), 10 → funct 보고 결정 (R-type).
Branch target: PC + 4 + (sign-extended offset << 2).
Jump target: PC+4[31:28] : (instruction[25:0] << 2).
Control signal 표를 명령어별로 외워둘 것 (R-type, addi, lw, sw, beq, j).
Pipeline speedup은 large N에서 stage 수에 가까워진다.
Hazard 종류 세 가지: data, structural, control.
Forwarding은 EX/MEM 또는 MEM/WB에서 ALU input으로 직접 전달.
Forwarding 조건에서 RegisterRd != 0 조건 필수 ($zero 때문).
Double data hazard에서는 EX/MEM forwarding을 MEM/WB보다 우선.
Load-use hazard는 forwarding으로 해결 불가, 1 cycle stall 필요.
Stall 구현: PCWrite = 0, IF/IDWrite = 0, ID/EX control signals = 0.
Branch hazard: ID stage에서 branch resolve하면 misprediction penalty가 1 cycle.
Branch data hazard: forwarding으로 해결 가능한 경우, 1 stall, 2 stalls 세 경우 구분.
pipelining은 latency가 아니라 throughput을 높이는 기술이다.

Discussion