ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings....

13
ARM Cortex A8 Pipeline EE126 Wei Wang

Transcript of ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings....

Page 1: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

ARM Cortex A8 Pipeline

EE126 Wei Wang

Page 2: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Cortex A8 is a processor core designed by ARM Holdings. • Application: Apple A4, Samsung Exynos 3110.

What’s the pipeline architecture in Cortex A8?

Deeper pipeline and superscalar pipeline.

Page 3: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

Deeper Pipeline

For pipeline, the speed is limited by the length of the longest stage, and the longest stage is set to be the standard one cycle time. For the deeper pipeline, the time of the new sub-stage is small. The smaller time resolution therefore leads to less time to complete one instruction.

IF ID EXE

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5

IF ID EXE

Why does it break one cycle into several cycles?

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5

Page 4: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

0 1 2 3 4 5 6 7 8 9

Simple 4 Stage Pipeline

Superscalar Pipeline

Two instructions executed at the same time

IF ID EX WB

Superscalar Pipeline• It is a form of instruction level parallelism, which is faster than normal pipeline.

Page 5: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

Cortex A8 Pipeline Main Architecture:

Instruction Fetch

Instruction Decode

Instruction Execute and Load/Store

Arch

itecture

Reg

ister File

ALU Pipeline0

Integer ALU Pipeline

MUL Pipeline0

ALU Pipeline1

Load/Store Pipeline0/1

InInteger register writeback

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5

NEON Instruction

DecodeN

EON

Register File

Integer MUL Pipeline

Integer shift Pipeline

Non-IEEE FP ADD Pipeline

Non-IEEE FP MUL Pipeline

IEEE FP Engine

Load/Store Permute Pipeline

NEON register writeback

NEON

14-Stage Integer Pipeline

M0 M1 M2 M3 N1 N2 N3 N4 N5 N6

10-Stage NEON Pipeline

Load/Store Data

Quence

NEON Store Data

Page 6: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Execution stages: 6 stage pipeline.

Instruction Execute and Load/Store

Architecture Register File

Shift

Load/Store Pipeline

InInteger register writeback

ALU+Flags

Sat BPUpdate WB

MUL1

MUL2

MUL3

ACC WB

ShiftALU+Flags

Sat BPUpdate WB

AGU WB

ALU Pipeline

ALU Pipeline

Multiple Pipeline

Load/Store Pipeline

E0 E1 E2 E3 E4 E5

• It can extensively support of key forwarding path. Result data is from the outputs of shift, ALU and MUL immediately as it is produced. The intermediate execution stage results can be forwarded. Unlike the simple pipeline, only the final execution stage result can be forwarded.

Two symmetric ALU pipeline, a multiple pipeline and an address generator for load and store.

1. For the ALU pipeline:

E0 access register file;

E1 shift if needed;

E2 ALU function;

E3 complete saturation if needed;

E4 change in control flow;

E5 write back to register file.

2. For the Mul pipeline:

E1-E3 implement multiply;

E4 perform addition.;

E5 write back.

Page 7: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Deep pipeline and superscalar pipeline have good performance. Why not increases the sub-stages and the parallel instructions?

• What’s the limitations?

Page 8: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Data Dependency

0 1 2 3 4 5

Add BUBBLE

BUBBLE

Data Independency

Data Dependency

Solution: Stall the adder until the multiplier has finished.

MUL t3,t2,t1ADD t6, t5,t4

MUL t3,t2,t1ADD t6, t3,t4

Page 9: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Output dependency:

• An output dependency occurs if two paralleled instructions are writing into the same location. An error occurs if the second instruction implement before the first one.

MUL t3,t2,t1;ADD t3,t4,t5;

Page 10: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Antidependency:

• An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs.

MUL t3,t2,t1;ADD t2,t4,t5;

Page 11: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Solution for the output independency and antidependency: Use other register.

MUL t3,t2,t1;ADD t3,t4,t5;

MUL t3,t2,t1;ADD t6,t4,t5;

MUL t3,t2,t1;ADD t2,t4,t5;

MUL t3,t2,t1;ADD t6,t4,t5;

Alternative ways to handle dependency:Compiler will generate instructions with less dependency.

Page 12: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

• Summary: Cortex architecture is a high speed architecture by using deeper pipeline and superscalar pipeline.

Page 13: ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What’s the.

Thank you