Comparison of CPUs in DSM systems at NCAR/CISL

Last update: 6/11/08

IBM (PowerPC-440) IBM (POWER5) IBM (POWER5+) IBM (POWER6) AMD (Opteron 248)
frost bluevista blueice bluefire lightning
 
Chip speed (clock)
700 MHz 1,900 MHz 1,900 MHz 4,700 MHz 2,200 MHz
 
Cycle time
1.42 ns 0.5262 ns 0.5262 ns 0.2128 ns 0.45 ns
 
Die type
Dual core Single core Dual core Dual core Single core
 
Cache size

L1 -- 32-KB data cache, 32-KB instruction cache, 32-B cache line. 64-way set-associative. L1 not coherent, so only 1 CPU used.

L2 -- 256 KB, shared, 128-B line, fully associative.

L3 -- 4 MB, 8-way set-associative. 128-B cache line

L1 -- 96-KB per core, 32-KB data cache, 64-KB instruction cache, 128-B cache line. 4-way associative

L2 -- 1.92 MB, 128-B line, 10-way set-associative.

L3 -- 36 MB, 12-way set-associative. 128-B cache lines

L1 -- 96-KB per core, 32-KB data cache, 64-KB instruction cache, 128-B cache line. 4-way associative

L2 -- 1.92 MB per processor, 128-B line, 10-way set-associative.

L3 -- 36 MB, 12-way set-associative. 128-B cache lines

L1 -- 128 KB (64-KB data + 64 KB instruction) per processor

L2 -- 4 MB per processor

L3 -- (off chip) 32 MB per chip, shared by the two processors

L1 -- 64-KB, 2-way set-associative

L2 -- 1024 KB, 16-way associative

64-B cache lines on both

8 cycle latency (to transfer a whole cache line)

 
Cache latencies

L1 miss: 3 cycles

L2 miss: 11 cycles

L3 miss: ~35 cycles; external DRAM 75 cycles

2 instructions/cycle possible

L1 miss: 4 cycles

L2 miss: 14 cycles

L3 miss: L3 is on-chip, so operates at half CPU speed

7 instructions/cycle possible

L1 miss: 4 cycles

L2 miss: 14 cycles

L3 miss: L3 is on-chip, so operates at half CPU speed

7 instructions/cycle possible

L1 miss:

L2 miss:

4 floating point operations/cycle per processor possible

L1 miss: 2 cycles

L2 miss (to local memory): 19 cycles,
(includes L1 latency)

3 instructions/cycle possible

 
Translation Lookaside Buffer (TLB)
CPU memory management unit is a 64-entry fully associative unified TLB, supporting variable page sizes TLB holds 1024 entries, 4-way set-associative, pages can be 4 KB or 16 MB. Also has 2 ERATs with 128 entries each TLB holds 1024 entries, 4-way set-associative, pages can be 4 KB or 64 KB (settable by user). 16 MB pages possible (requires system reset). Also has 2 ERATs with 128 entries each   2-level TLB:
L1 TLB holds 32 entries to 4 KB pages, fully associative
L2 TLB holds 512 entries, 4-way associative
 
TLB latencies
Very low latency if in L2 Very low latency if in L2 Very low latency if in L2 Very low latency if in L2 Similar to L2 cache miss
 
Registers
Double FPU has 32 primary f.p. registers, 32 secondary f.p. registers 120 GPRs, 120 FPRs 120 GPRs, 120 FPRs   16 general-purpose (X86 integer) registers, 64 f.p. (128-bit media, 64-bit media, and X87 f.p.) registers
 
Functional units

3 32-bit integer pipelines:
"L" pipe: loads/stores
"I" pipe or "J" pipe: add/neg, log; "I" pipe only: branches/mult/div

7-stage pipeline

No support for f.p. in the processor core

Floating-point pipeline: 5 cycles, floating point load to use latency: 4 cycles

Can do 2 instructions/cycle

7 functional units:

2 floating-point even units
Separate branch and conditional units
3 fixed-point units
2 load/store units

Can fetch 2 groups of 5 instructions/cycle and complete 10 instructions/cycle

7 functional units:

2 floating-point even units
Separate branch and conditional units
3 fixed-point units
2 load/store units

Can fetch 2 groups of 5 instructions/cycle and complete 10 instructions/cycle

7 functional units:

2 floating-point even units
Separate branch and conditional units
3 fixed-point units
2 load/store units

7 pipelined functional
units (1 f.p. FMA, 3 integer, 3 AGU)

Pipeline depths: 12(int), 17(f.p.)

Integer:
Double clocked
6 integer results/clock

Floating-point:
Double clocked
FMA unit
2 x 64-bit results/clock peak, or 4 x 32 results/clock peak
Pipeline depths: 12 (Int), 17 (f.p.)

Can do 3 instructions/cycle

Max of 72 instructions in flight at once

 
Prefetching?
Yes-configurable Yes-automatic Yes-automatic Yes Yes
 
Simultaneous Multi-Threading (SMT)?
No Yes. SMT appears to the OS as multiple CPUs. Threaded applications may take advantage of this (e.g., "ptile=16"). Yes. SMT appears to the OS as multiple CPUs. Threaded applications may take advantage of this (e.g., "ptile=32"). Yes. SMT appears to the OS as multiple CPUs. Threaded applications may take advantage of this (e.g., "ptile=64") No
 
Peak rates (SPECfp2000)
946 2702 2702 SPECfp2006 - 22.3 1691
 
Page sizes (settable by user)
8 possible page sizes 4 KB 4 KB or 64 KB 4 KB or 64 KB 4 KB