2 April, 2024

Implementing registers and instructions for the Owl-2820 CPU.

Part 1 - Registers and instructions

In the previous post in this series, we looked at a C implementation of the Fibonacci sequence, then examined what it looked like when compiled to RISC-V assembly language on an RV32I CPU. We discovered that a CPU capable of running that code would need implementations for a mere handful of opcodes.

In this post we’ll get into some details and look at how we can implement that CPU. First, we’ll see how we can implement registers. Next, we’ll take a look at how the Owl CPU represents instructions in memory. And finally, we’ll see what each instruction does when it is executed.

The Owl-2820 CPU

We’ll base the Owl-2820 CPU on RV32I, as it’s a good example of a simple RISC CPU. In other words, the Owl CPU will have the same thirty two integer registers as RV32I, and it will eventually support the same instructions as RV32I.

However, to start, we’ll implement only the instructions necessary for implementing the Fibonacci sequence that we saw in the previous post.

Implementing registers

The integer registers

In the last post we learned that an RV32I CPU has thirty two integer registers, x0 - x31, and that they also have symbolic names such as sp for the stack pointer, and ra for the return address.

The Owl-2820 CPU will have thirty two integer registers, x0 - x31, just like RV32I. It will use the same register names and usages as RV32I, as that will make it much easier for us to write our code. We’ll ensure that register x0 always contains zero, as this is also the case for RISC-V.

Register Name Description
x0 zero always contains zero (writing to x0 has no effect)
x1 ra return address
x2 sp stack pointer
x3 gp global pointer
x4 tp thread pointer
x5 - x7 t0 - t2 temporary registers
x8 - x9 s0 - s1 callee-saved registers
x10 - x17 a0 - a7 argument registers
x18 - x27 s2 - s11 callee-saved registers
x28 - x31 t3 - t6 temporary registers

As there are thirty two integer registers in the Owl CPU that are all 32-bit, we can represent them in code as a C-style array of uint32_t.

uint32_t x[32]; // The integer registers.

We can refer to any of the x-registers with a simple array lookup. For example, x[0] refers to the zero register, x[1] refers to the return address register, ra, and x[2] refers to the stack pointer register, sp.

It will be easier for us if we can use symbolic register names such as sp. To get a step closer to this, we’ll define the following enum so that, for example, we can write x[sp] rather than x[2].

// Symbolic register names.
enum { zero, ra, sp, gp, tp, t0, t1, t2, s0, s1, a0, a1, a2, a3, a4, a5, a6, a7,
        s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, t3, t4, t5, t6 };

The program counter

There’s one important register that I haven’t mentioned yet, and that’s the program counter register, pc. This 32-bit integer register tells the CPU where to read the next instruction from in memory. Some instruction set architectures use the name instruction pointer, ip, instead of pc. This is arguably a better name, but I’ll stick with pc for consistency with RV32I.

We can represent the program counter in code as a uint32_t.

uint32_t pc; // The program counter.

Encoding and decoding instructions

One thing that we’ll need to decide is how to represent instructions in Owl’s memory. In other words, how does an Owl CPU determine which instruction is which, and how does it know which operands are present?

Opcodes and operands

Each instruction is divided into an opcode, and zero or more operands. The opcode dictates what the instruction does. It also dictates which operands are present in an instruction, and this will be important when it comes to decoding, as we’ll need to know the opcode before we know which operands are present.

When written in text form, instructions consist of an opcode such as add, or bltu, followed by zero or more operands, as shown in the following table.

Instruction Description
add r0, r1, r2 add the values in registers r1 and r2 and stores the result into r0
addi r0, r1, imm12 adds the value in register r1 to an immediate value and stores the result into r0
beq r0, r1, offs12 compares the values in registers r0 and r1 and branches to offs12 if they’re equal
bltu r0, r1, offs12 compares the values in registers r0 and r1 and branches to offs12 if r0 is less than r1
call offs20 calls a function at offs20
j offs20 jumps to offs20
li r0, imm12 loads an immediate value into register r0
lui r0, uimm20 loads an immediate value into the upper bits of register r0
mv r0, r1 copies the value in r1 into r0

The next table describes the meaning of each operand.

Operand Meaning
r0 an integer register when at least one register is used by an instruction
r1 the second integer register when at least two registers are used by an instruction
r2 the third integer register when at least three registers are used by an instruction
imm12 a sign-extended 12-bit immediate value
offs12 a sign-extended 12-bit offset from pc in multiples of two bytes
offs20 a sign-extended 20-bit offset from pc in multiples of two bytes
uimm20 a 20-bit unsigned immediate value

Using a fixed word size

On some architectures, instructions can be variable length. For example, on x86 an instruction can be as short as a single byte, or it can span multiple bytes. On RV32I, and therefore on Owl-2820, instructions are always encoded into a 32-bit word.

We can use this fact, plus our knowledge of which operands will be present for any given opcode, to figure out how to encode and decode each part of an instruction.

Registers are easy. There are thirty two integer registers, so we need five bits to determine which register is being used for each of r0, r1, and r2.

The immediate values and offsets have the number of bits that they require built into their names. For example, uimm20 occupies twenty bits.

So that leaves the opcode. Given that the instruction with the largest number of bits taken up by operands is lui, which uses uimm20 (twenty bits) and r0 (five bits), that leaves us with seven bits to play with for the opcode. That gives us the possibility of 128 unique opcodes. Right now we only need nine, so I think we have sufficient headroom for expansion.

Opcodes

The 7-bit opcode field occupies bits 0 thru 6 of an instruction. The opcode field is always present.

In the table below, the header row contains each field’s name where relevant, followed by the bit positions that it occupies from high to low in square brackets. I’ll use that convention throughout this series of posts.

other [31:7] opcode [6:0]
25 bits 7 bits

Decoding the opcode is a single AND operation.

    opcode = ins & 0x7f;

We can represent each opcode as an enum class. The opcode Illegal, whose value is zero represents an illegal instruction. I’ve chosen this value so that a newly created block of memory will contain nothing but illegal instructions.

// Opcodes.
enum class Opcode : uint32_t
{
    Illegal = 0,
    Add,
    Addi,
    Beq,
    Bltu,
    Call,
    J,
    Li,
    Lui,
    Mv
};

Register operands

There are thirty two registers, so determining which register to use requires five bits per register operand.

For ease of decoding, each register operand occupies the same bit positions in an instruction. For example, operand r0 is used whenever there is at least one register present, and it always occupies bits 7-11 of an instruction.

r0

The 5-bit register operand r0, when present, occupies bits 7-11 of an instruction. It is always present for an instruction that uses one or more registers.

other [31:12] r0 [11:7] opcode [6:0]
20 bits 5 bits 7 bits

Decoding this is a single AND operation and a single shift operation.

    r0 = (ins >> 7) & 0x1f;

r1

The 5-bit register operand r1, when present, occupies bits 12-16 of an instruction. It is always present for an instruction that uses two or more registers.

other [31:17] r1 [16:12] r0 [11:7] opcode [6:0]
15 bits 5 bits 5 bits 7 bits

Decoding this is a single AND operation and a single shift operation.

    r1 = (ins >> 12) & 0x1f;

r2

The 5-bit register operand r2, when present, occupies bits 17-21 of an instruction. It is always present for an instruction that uses three registers.

other [31:17] r2 [21:17] r1 [16:12] r0 [11:7] opcode [6:0]
10 bits 5 bits 5 bits 5 bits 7 bits

Decoding this is a single AND operation and a single shift operation.

    r2 = (ins >> 17) & 0x1f;

Comparing RISC-V encoding with Owl encoding

RISC-V has a particular way of encoding instructions which, as far as I can tell, minimises the number of logic gates required to decode them. I’m not a hardware person, unless you count playing around with 7400 series TTL in my teens or doing the amazing Nand2Tetris course a few years ago, so take that with a pinch of salt. But whatever the reason, from a programmer’s perspective sometimes RISC-V’s encodings for immediate operands and offsets can seem a little weird, with bits being spread all over the place within the instruction.

RISC-V encoding

To illustrate this, let’s look at how RISC-V encodes branch instructions such as beq. These instruction use what RISC-V calls a B-type instruction format. Part of that instruction format includes twelve bits equivalent to Owl’s offs12.

Here’s the B-type instruction format, from the RISC-V User ISA Manual.

imm[12] [31] imm[10:5] [30:25] rs2 [24:20] rs1 [19:15] funct3 [14:12] imm[4:1] [11:8] imm[11] [7] opcode [6:0]
1 bits 6 bits 5 bits 5 bits 3 bits 5 bits 1 bit 7 bits

The fields for the 12-bit offset imm are scattered throughout the instruction. I suspect that decoding this in hardware is not a problem. However, in software, extracting those fields from the instruction and reassembling them into a sign-extended value that we can use takes several operations.

    imm12   = static_cast<int32_t>(ins & 0x80000000) >> 19;   // ins[31] -> sext(imm[12])
    imm11   = static_cast<int32_t>((ins & 0x00000080) << 4);  // ins[7] -> imm[11]
    imm10_5 = static_cast<int32_t>((ins & 0x7e000000) >> 20); // ins[30:25] -> imm[10:5]
    imm4_1  = static_cast<int32_t>((ins & 0x00000f00) >> 7);  // ins[11:8]  -> imm[4:1]
    offs12  = static_cast<uint32_t>(imm12 | imm11 | imm10_5 | imm4_1);

That’s four ANDs, four shifts, and three ORs. Eleven operations. I won’t even try to describe them because it’s just too complicated.

Owl encoding and sign extension

For Owl encoding, I’ve chosen to place all immediate values and offset values in the upper bits of the instruction, as this makes it very easy to sign-extend them simply by doing an arithmetic shift right.

Here’s how Owl encodes offs12.

offs12 [31:20] other [19:7] opcode [6:0]
12 bits 13 bits 7 bits

The fields for the 12-bit offset offs12 are in the upper twelve bits of the instruction.

    offs12 = static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfff00000) >> 19);

This is far simpler. It’s one AND and one arithmetic shift. Two operations.

We use AND to mask out the lower twenty bits, then we shift the signed value right by nineteen positions rather than twenty so that we end up with a sign-extended offset given in bytes rather than in multiples of two bytes.

Immediate and offset operands

imm12

This sign-extended 12-bit immediate value, when present, occupies bits 20-31 of an instruction.

imm12 [31:20] other [19:7] opcode [6:0]
12 bits 13 bits 7 bits

Decoding this is a single AND operation and a single arithmetic shift operation.

    imm12 = static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfff00000) >> 20);

offs12

This sign-extended 12-bit value represents an offset from the program counter, pc, in multiples of two bytes. When present, it occupies bits 20-31 of an instruction.

offs12 [31:20] other [19:7] opcode [6:0]
12 bits 13 bits 7 bits

At first glance, it would appear to be identical to imm12. However, it differs in the way that it is used, as imm12 represents a signed number, whereas offs12 represents multiples of two bytes.

It is extracted by shifting by nineteen bits rather than twenty to convert it from a 12-bit offset in multiples of two bytes into a 13-bit byte offset that always happens to be even.

Decoding this is a single AND operation and a single arithmetic shift operation.

    offs12 = static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfff00000) >> 19);

offs20

This sign-extended 20-bit value represents an offset from the program counter, pc, in multiples of two bytes. When present, it occupies bits 12-31 of an instruction.

offs20 [31:12] other [11:7] opcode [6:0]
20 bits 5 bits 7 bits

It is extracted by shifting by eleven bits rather than twelve to convert it into a 21-bit byte offset that always happens to be even.

Decoding this is a single AND operation and a single arithmetic shift operation.

    offs20 = static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfffff000) >> 11);

uimm20

This an unsigned 20-bit value represents the upper twenty bits of a 32-bit value. When present, it occupies bits 12-31 of an instruction.

uimm20 [31:12] other [11:7] opcode [6:0]
20 bits 5 bits 7 bits

As it represents the upper twenty bits of a 32-bit value, it is extracted by masking out the lower twelve bits of the instruction.

Decoding this is a single AND operation.

    uimm20 = ins & 0xfffff000;

Implementing instructions

In the previous post I said that we’d only need to implement nine opcodes to be able to run the program that outputs the Fibonacci sequence. As a reminder, here they are.

Opcode Meaning Description
add add registers adds the values in two registers and stores the result into another register
addi add immediate adds an immediate value and a register and stores the result into another register
beq branch if equal compares the content of two registers and jumps to a different location if they’re equal
bltu branch if less than (unsigned) compares the unsigned values in two registers and jumps to a different location if the first is less than the second
call call calls a function
j jump jumps to a location
li load immediate loads an immediate value into a register
lui load upper immediate loads an immediate value into the upper bits of a register
mv move copies the value in one register into another register

Instructions, as we’ve learned, consist of an opcode and zero or more operands. Each instruction performs a very simple operation on its operands.

The nature of that operation is dictated by the opcode, and what it operates on is dictated by the operands. The one key point that I’d like to emphasise is that the operation itself is very simple - I can’t stress this enough.

In most cases operands are given explicitly, for example, as register names or immediate values. However, some opcodes have implicit operands. For example, the J opcode implicitly changes the value in the program counter, pc.

Instruction descriptions

Before we can implement each instruction, we will need to know what they do. Let’s look at them in turn. For each instruction, we’ll describe the following:

Format

How the instruction is written in source form. For example: add r0, r1, r2.

Encoding

A table showing how the opcode and its operands are encoded in a 32-bit instruction. We’ve already looked at how opcodes and individual operands are encoded, so this should be reasonably easy to follow.

The header row contains each field’s name, followed by the bit positions that it occupies from high to low in square brackets.

The second row contains each field’s value (where appropriate) followed by the number of bits that it occupies.

For example, here’s how add is encoded.

unused [31:22] r2 [21:17] r1 [16:12] r0 [11:7] opcode [6:0]
10 bits 5 bits 5 bits 5 bits Add : 7 bits

Description

A short description of what the instruction does, with reference to the format.

Possible implementation

A short code fragment describing how the instruction might be implemented, written in terms of the uint32_t representations of the x-registers and the program counter, pc, that we saw earlier; and in terms of operand encodings such as uimm20 that we’ve just been looking at.

Given that I’ve emphasised that each instruction performs a simple operation on its operands, I hope that you won’t be too surprised by the fact that the implementation of each instruction is typically only a few lines of C++.

add - add registers

Format

add r0, r1, r2

Encoding

unused [31:22] r2 [21:17] r1 [16:12] r0 [11:7] opcode [6:0]
10 bits 5 bits 5 bits 5 bits Add : 7 bits

Description

Adds the value in register r1 to the value in register r2, storing the result into register r0.

Possible implementation

As register x0 must always contain zero, we make sure that we clear it after any assignment that could potentially change its value.

    x[r0] = x[r1] + x[r2];
    x[0] = 0;

addi - add immediate

Format

addi r0, r1, imm12

Encoding

imm12 [31:20] unused [19:17] r1 [16:12] r0 [11:7] opcode [6:0]
12 bits 3 bits 5 bits 5 bits Addi : 7 bits

Description

Adds the value in register r1 to the sign-extended 12-bit immediate value imm12, storing the result into register r0.

Possible implementation

    x[r0] = x[r1] + imm12;
    x[0] = 0;

beq - branch if equal

Format

beq r0, r1, offs12

Encoding

offs12 [31:20] unused [19:17] r1 [16:12] r0 [11:7] opcode [6:0]
12 bits 3 bits 5 bits 5 bits Beq : 7 bits

Description

This is a conditional branch instruction which takes the branch if the value in register r0 is equal to the value in register r1. The branch is given as offs12, which is a sign-extended 12-bit offset in multiples of two bytes relative to the address of the branch instruction.

Possible implementation

    if (x[r0] == x[r1])
    {
        nextPc = pc + offs12;
    }

bltu - branch if less than (unsigned)

Format

bltu r0, r1, offs12

Encoding

offs12 [31:20] unused [19:17] r1 [16:12] r0 [11:7] opcode [6:0]
12 bits 3 bits 5 bits 5 bits Bltu : 7 bits

Description

This is a conditional branch instruction which takes the branch if the value in register r0 is less than the value in register r1 when using an unsigned comparison. The branch is given as offs12, which is a sign-extended 12-bit offset in multiples of two bytes relative to the address of the branch instruction.

Possible implementation

    if (x[r0] < x[r1])
    {
        nextPc = pc + offs12;
    }

call - call a subroutine

Format

call offs20

Encoding

offs20 [31:12] unused [11:7] opcode [6:0]
20 bits 5 bits Call : 7 bits

Description

Sets the value in the return address register ra to the address of the instruction immediately following this one, then branches by the sign-extended offset offs20, which is a sign-extended 20-bit offset in multiples of two bytes relative to the address of the call instruction.

Possible implementation

    x[ra] = pc + 4;
    nextPc = pc + offs20;

j - jump

Format

j offs20

Encoding

offs20 [31:12] unused [11:7] opcode [6:0]
20 bits 5 bits J : 7 bits

Description

Branches by the sign-extended offset offs20, which is a sign-extended 20-bit offset in multiples of two bytes relative to the address of the jump instruction.

Possible implementation

    nextPc = pc + offs20;

li - load immediate

Format

li r0, imm12

Encoding

imm12 [31:20] unused [19:12] r0 [11:7] opcode [6:0]
12 bits 8 bits 5 bits Li : 7 bits

Description

Stores the sign-extended 12-bit value imm12 into register r0.

Possible implementation

    x[r0] = imm12;
    x[0] = 0;

lui - load upper immediate

Format

lui r0, uimm20

Encoding

uimm20 [31:12] r0 [11:7] opcode [6:0]
20 bits 5 bits Lui : 7 bits

Description

Stores the 20-bit immediate value uimm20 into the upper twenty bits of register r0, filling its lowest twelve bits with zeros.

Possible implementation

    x[r0] = uimm20;
    x[0] = 0;

mv - move

Format

mv r0, r1

Encoding

unused [31:17] r1 [16:12] r0 [11:7] opcode [6:0]
15 bits 5 bits 5 bits Mv : 7 bits

Description

Copies the value in register r1 into register r0.

Possible implementation

    x[r0] = x[r1];
    x[0] = 0;

And finally…

This was a long post in which we covered a lot of ground.

We learned that the Owl-2820 CPU is based on RV32I and that it has the same integer registers, with the same uses.

We learned how Owl represents instructions in memory, and that Owl instructions have a different encoding than RV32I instructions, because Owl’s encoding is easier to deal with in software.

And finally, we learned some possible implementations for each instruction.

This puts us in a good position for the next post in which we’ll apply what we’ve learned to implementing a main loop for the Owl-2820 CPU.