7 April, 2024

Implementing just enough of the Owl-2820 CPU to run Fibonacci.

Part 2 Implementing the main loop and an assembler

In Part 0 of this series, we looked at a C implementation of the Fibonacci sequence compiled to RISC-V assembly language and saw that it only used nine opcodes.

In Part 1 we were introduced to the Owl-2820 CPU. We saw how it implements registers, how it represents instructions in memory, and how each individual opcode is implemented.

By now we’ve learned:

what the Fibonacci sequence looks like when compiled to assembly language
how to represent the Owl-2820 CPU’s registers in code
how to encode and decode Owl instructions into opcodes and operands
how to implement the nine opcodes required to implement the Fibonacci sequence

Now we’re going to put it all together and implement a main loop that will read instructions from memory, decode them into an opcode and its operands, then execute them.

We will also implement a mini assembler so that we can write code for the Owl CPU to execute.

The main loop

When a CPU fetches an instruction from memory, it reads it from the address pointed to by the program counter, then it decodes it and executes it. When it has finished executing the instruction then it sets the program counter to the address of the next instruction, and continues.

In our initial implementation of the Owl CPU, we’re going to have it stop when it reaches an illegal instruction. Later, we’ll expand the code so that it’ll stop for other reasons, for example, after it has run a given number of instructions.

By the end of this post, we’ll have implemented an Owl CPU that can read Owl instructions from memory, decode the nine opcodes required to implement Fibonacci, and execute them.

Implementing only nine opcodes may not seem like much, but as we will expand on this over the course of this series, it will be worth taking your time to understand how it works.

However, if you’re pressed for time, then you can see the entire program and its output on Compiler Explorer.

Main loop - outline

We can write the main loop as a Run() function that contains a while-loop that terminates when it hits an illegal instruction.

void Run(const uint32_t* code)
{
    bool done = false;

    while (!done)
    {
        // Fetch a 32-bit word from memory at the address pointed to by the program counter.

        // Decode the word to extract the opcode.

        // Dispatch it and execute it.

        // Stop the loop if we hit an illegal instruction.
    }
}

The code parameter represents a view of memory as an array of 32-bit words. Memory in Owl is byte-addressable, but for simplicity, we’re going to represent memory that contains code as word-addressable.

Fetching

We need a representation of the program counter, pc. This register is typically incremented after each instruction to go to the next instruction, but some instructions such as beq, bltu, call and j can set it to another value. To accommodate this, we’ll use another 32-bit value, nextPc, to represent the address that pc will be set to at the start of each instruction.

When our Owl CPU goes to fetch another instruction from memory, it first copies the value in nextPc into pc then immediately updates nextPc to point to the next instruction. This value will be used next time around the loop, unless it is modified by an instruction such as a conditional branch, a call, or a jump. This implementation means that pc is only ever updated at the start of the loop, so that its value can be used by any instructions that need to read it.

After implementing fetch, Run() looks like this.

void Run(const uint32_t* code)
{
    // Set pc and nextPc to their initial values.
    uint32_t pc = 0;     // The program counter.
    uint32_t nextPc = 0; // The address of the next instruction.

    bool done = false;

    while (!done)
    {
        // Fetch a 32-bit word from memory at the address pointed to by the program counter.
        pc = nextPc;
        nextPc += wordSize;
        const uint32_t ins = code[pc / wordSize];

        // Decode the word to extract the opcode.

        // Dispatch it and execute it.

        // Stop the loop if we hit an illegal instruction.
    }
}

We’ve declared pc and nextPc, and added the code to fetch a 32-bit word from memory. As nextPc represents a register which points to byte-addressable memory, and code is addressed by 32-bit words, we have to increment nextPc by the word size. Likewise, we need to divide pc by the word size when reading from code.

Decoding and dispatching

Once the word containing the instruction has been fetched from memory, we need to decode it, then dispatch it to an instruction handler for the opcode.

The very first part of this is to extract the opcode which, as we saw in Part 1, is a simple operation.

    Opcode opcode = Opcode(ins & 0x7f);

We can use a switch statement to dispatch on the opcode to an instruction handler. Each instruction handler will decode the operands required by the opcode, then perform a simple opcode-specific operation on them, for example, adding them together.

After adding the code to decode and dispatch, Run() looks like this.

void Run(const uint32_t* code)
{
    // Set pc and nextPc to their initial values.
    uint32_t pc = 0;     // The program counter.
    uint32_t nextPc = 0; // The address of the next instruction.

    bool done = false;

    while (!done)
    {
        // Fetch a 32-bit word from memory at the address pointed to by the program counter.
        pc = nextPc;
        nextPc += wordSize;
        const uint32_t ins = code[pc / wordSize];

        // Decode the word to extract the opcode.
        const Opcode opcode = Opcode(ins & 0x7f);

        // Dispatch it and execute it.
        switch (opcode)
        {
        case Opcode::Add:
            // Implement add.
            break;            

        case Opcode::Addi
            // Implement addi.
            break;            

            // ... etc

        // Stop the loop if we hit an illegal instruction.
        default:
            // If we don't recognise the opcode then by default we have an illegal instruction.
            done = true;
        }
    }
}

Registers

We saw in Part 1 that many of the instructions operate on Owl’s integer registers and that we can represent registers with a C-style array.

    uint32_t x[32] = {}; // The integer registers.

Let’s introduce them before the start of the loop.

void Run(const uint32_t* code)
{
    // Set pc and nextPc to their initial values.
    uint32_t pc = 0;     // The program counter.
    uint32_t nextPc = 0; // The address of the next instruction.

    // Set all the integer registers to zero.
    uint32_t x[32] = {}; // The integer registers.

Instructions

Implementing each instruction is straightforward. We already know how to decode the operands, and we saw in Part 1 how each instruction is implemented.

To illustrate just how straightforward, here is the handler for add in its entirety. It decodes the operands r0, r1, and r2, then it uses them as indexes into the x-registers array, adding x[r1] to x[r2], then storing the result in x[r0].

    case Opcode::Add: {
        // Implement `Add`.
        const uint32_t r0 = (ins >> 7) & 0x1f;  // Decode r0.
        const uint32_t r1 = (ins >> 12) & 0x1f; // Decode r1.
        const uint32_t r2 = (ins >> 17) & 0x1f; // Decode r2.
        x[r0] = x[r1] + x[r2];                  // Add the two registers.
        x[0] = 0;                               // Ensure x0 is always zero.
        break;
    }

Handlers for other opcodes are similar. For example, here’s the code to implement mv, which copies the value in register r1 into register r0.

    case Opcode::Mv: {
        const uint32_t r0 = (ins >> 7) & 0x1f;  // Decode r0.
        const uint32_t r1 = (ins >> 12) & 0x1f; // Decode r1.
        x[r0] = x[r1];                          // Copy r1 into r0.
        x[0] = 0;                               // Ensure x0 is always zero. 
        break;
    }

And here’s the code for lui, which loads a 20-bit immediate value into the upper bits of register r0.

    case Opcode::Lui: {
        const uint32_t r0 = (ins >> 7) & 0x1f;    // Decode r0.
        const uint32_t uimm20 = ins & 0xfffff000; // Decode uimm20;                                  
        x[r0] = uimm20;                           // Load uimm20 into r0.
        x[0] = 0;                                 // Ensure x0 is always zero.
        break;
    }

Decoding operands

After only three opcodes, you can probably see that there’s a lot of unnecessary, error-prone repetition. We’ve seen the code to extract operand r0 three times already.

For the most part, the other instructions are also handled by following the pattern of decoding the same handful of operands before executing the instruction. There is so much similarity that it’s worth taking a few minutes to extract the task of decoding operands into short functions - one per operand - that the compiler will inline.

I’ve put these functions into the decode namespace.

namespace decode
{
    uint32_t r0(const uint32_t ins)
    {
        return (ins >> 7) & 0x1f;
    }

    uint32_t r1(const uint32_t ins)
    {
        return (ins >> 12) & 0x1f;
    }

    uint32_t r2(const uint32_t ins)
    {
        return (ins >> 17) & 0x1f;
    }

    uint32_t imm12(const uint32_t ins)
    {
        return static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfff00000) >> 20);
    }

    uint32_t offs12(const uint32_t ins)
    {
        return static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfff00000) >> 19);
    }

    uint32_t offs20(const uint32_t ins)
    {
        return static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfffff000) >> 11);
    }

    uint32_t uimm20(const uint32_t ins)
    {
        return ins & 0xfffff000;
    }
}

We’ll bring in the decode namespace at the top of Run() so that we don’t have to prefix calls to these functions with decode. This lets us write things like r0(ins) instead of decode::r0(ins) throughout Run()’s body.

void Run(const uint32_t* code)
{
    using namespace decode;

Refactored instructions

After refactoring to use the new decode functions, the handler for add simplifies to this.

    case Opcode::Add: {
        x[r0(ins)] = x[r1(ins)] + x[r2(ins)];  // Add the two registers.
        x[0] = 0;                              // Ensure x0 is always zero.
        break;
    }

Similarly, here’s mv.

    case Opcode::Mv: {
        x[r0(ins)] = x[r1(ins)]; // Copy r1 into r0.
        x[0] = 0;                // Ensure x0 is always zero.
        break;
    }

And here’s lui.

    case Opcode::Lui: {
        x[r0(ins)] = uimm20(ins); // Load uimm20 into r0.
        x[0] = 0;                 // Ensure x0 is always zero.
        break;
    }

Most of the other instructions are very similar, so in theory we can now easily implement the rest of the main loop.

Implementing printf()

However, there is a big gotcha. Our implementation of Fibonacci from Part 0 involved a call to printf(). That calls into a subroutine that we haven’t seen yet. Clearly implementing printf() in Owl machine code that runs on the Owl VM will need implementations for more than the nine opcodes that we’ve seen so far.

To keep things simple at this stage, I’m going to propose that we cheat, and take advantage of the fact that as there’s only one call instruction, we can temporarily hard-code it to do a printf().

Here’s that implementation of call. It has nothing to do with calling a subroutine, but instead uses the values in the two argument registers a1 and a2 as arguments to std::format as shown below.

    case Opcode::Call: {
        std::cout << std::format("fib({}) = {}\n", x[a1], x[a2]);
        break;
    }

Tradeoffs

This may seem somewhat specialized, but surprisingly this isn’t without precedent. Some old computers had unusually specialized instruction sets. For example, this British computer, the EMIDEC, had an instruction set that included pounds sterling input and output. Bear in mind that this was from the 1960s when there were twenty shillings in a pound and twelve pence in a shilling and you can get an idea just how specialized this machine was.

There are tradeoffs between specialized and general purpose instructions. A specialized instruction set is tempting, but I suspect that the EMIDEC probably became somewhat useless after 1971 when Britain switched to a decimal pound.

On the other hand, I feel that a temporary hack to implement printf() is a tradeoff that I’m willing to make.

Run()

Our final implementation of Run() looks like this.

void Run(const uint32_t* code)
{
    using namespace decode;
    
    // Set pc and nextPc to their initial values.
    uint32_t pc = 0;     // The program counter.
    uint32_t nextPc = 0; // The address of the next instruction.

    // Set all the integer registers to zero.
    uint32_t x[32] = {}; // The integer registers.

    constexpr uint32_t wordSize = sizeof(uint32_t);    

    bool done = false;

    while (!done)
    {
        // Fetch a 32-bit word from memory at the address pointed to by the program counter.
        pc = nextPc;
        nextPc += wordSize;
        const uint32_t ins = code[pc / wordSize];

        // Decode the word to extract the opcode.
        const Opcode opcode = Opcode(ins & 0x7f);

        // Dispatch it and execute it.
        switch (opcode)
        {
        case Opcode::Add: {
            x[r0(ins)] = x[r1(ins)] + x[r2(ins)]; // Add the two registers.
            x[0] = 0;                             // Ensure x0 is always zero.
            break;
        }

        case Opcode::Addi: {
            x[r0(ins)] = x[r1(ins)] + imm12(ins); // Perform the addition.
            x[0] = 0;                             // Ensure x0 is always zero.
            break;
        }

        case Opcode::Beq: {
            // Perform the comparison.
            if (x[r0(ins)] == x[r1(ins)])
            {
                nextPc = pc + offs12(ins); // Take the branch.
            }
            break;
        }

        case Opcode::Bltu: {
            // Perform the comparison.
            if (x[r0(ins)] < x[r1(ins)])
            {
                nextPc = pc + offs12(ins); // Take the branch.
            }
            break;
        }

        case Opcode::Call: {
            // Do a hard-coded printf().
            std::cout << std::format("fib({}) = {}\n", x[a1], x[a2]);
            break;
        }

        case Opcode::J: {
            nextPc = pc + offs20(ins); // Branch.
            break;
        }

        case Opcode::Li: {
            x[r0(ins)] = imm12(ins);
            x[0] = 0; // Ensure x0 is always zero.
            break;
        }

        case Opcode::Lui: {
            x[r0(ins)] = uimm20(ins);
            x[0] = 0; // Ensure x0 is always zero.
            break;
        }

        case Opcode::Mv: {
            x[r0(ins)] = x[r1(ins)];
            x[0] = 0; // Ensure x0 is always zero.
            break;
        }

        // Stop the loop if we hit an illegal instruction.
        default:
            // If we don't recognise the opcode then by default we have an illegal instruction.
            done = true;
        }
    }
}

An assembler

If you cast your mind back to Part 0, you may remember that the RISC-V assembly language for the Fibonacci program looked like this.

        li      s0, 0                       ; i = 0
        li      s2, 2                       ; s2 = 2
        lui     a0, %hi(format_str)
        addi    s1, a0, %lo(format_str)     ; s1 = the address of the printf format string
        li      s3, 48                      ; s3 = 48
        li      s4, 1                       ; s4 = 1
        j       fib                         ; go to fib
print_loop:
        mv      a0, s1                      ; arg0 = the address of the printf format string
        mv      a1, s0                      ; arg1 = i
                                            ; arg2 is already set to current
        call    printf                      ; call printf
        addi    s0, s0, 1                   ; i = i + 1
        beq     s0, s3, done                ; if i == 48 go to done
fib:
        mv      a2, s0                      ; current = i
        bltu    s0, s2, print_loop          ; if i < 2 go to print_loop
        li      a0, 0                       ; previous = 0
        li      a2, 1                       ; current = 1
        mv      a1, s0                      ; n = i
fib_loop:
        mv      a3, a2                      ; tmp = current
        addi    a1, a1, -1                  ; n = n - 1
        add     a2, a0, a2                  ; current = current + prev
        mv      a0, a3                      ; previous = tmp
        bltu    s4, a1, fib_loop            ; if n > 1 go to fib_loop
        j       print_loop                  ; go to print_loop
done:
        li      a0, 0                       ; set the return value of main() to 0

We need to convert this from text into instructions that will be understood by the Owl-2820 CPU. That sounds like the job for an assembler.

Rather than going the whole hog and writing a full-blown assembler, we can create a simple Assembler class whose member functions are named after opcodes. Each of these member functions will encode the corresponding opcode and its operands into a instruction, then emit the instruction as Owl-2820 machine code.

That would allow us to write the assembly language program in C++ like this.

    Assembler a;

// main:
    a.Li(s0, 0);                // li   s0, 0                   ; i = 0
    a.Li(s2, 2);                // li   s2, 2                   ; s2 = 2
    a.Lui(a0, 1);               // lui  a0, %hi(format_str)
    a.Addi(s1, a0, -548);       // addi s1, a0, %lo(format_str) ; s1 = the address of the printf format string
    a.Li(s3, 48);               // li   s3, 48                  ; s3 = 48
    a.Li(s4, 1);                // li   s4, 1                   ; s4 = 1
    a.J(fib);                   // j    fib                     ; go to fib
// print_loop:
    a.Mv(a0, s1);               // mv   a0, s1                  ; arg0 = the address of the printf format string
    a.Mv(a1, s0);               // mv   a1, s0                  ; arg1 = i (arg2 contains current)
    a.Call(printf);             // call printf                  ; call printf
    a.Addi(s0, s0, 1);          // addi s0, s0, 1               ; i = i + 1
    a.Beq(s0, s3, done);        // beq  s0, s3, done            ; if i == 48 go to done
// fib:
    a.Mv(a2, s0);               // mv   a2, s0                  ; current = i
    a.Bltu(s0, s2, print_loop1);// bltu s0, s2, print_loop      ; if i < 2 go to print_loop
    a.Li(a0, 0);                // li   a0, 0                   ; previous = 0
    a.Li(a2, 1);                // li   a2, 1                   ; current = 1
    a.Mv(a1, s0);               // mv   a1, s0                  ; n = i
// fib_loop:
    a.Mv(a3, a2);               // mv   a3, a2                  ; tmp = current
    a.Addi(a1, a1, -1);         // addi a1, a1, -1              ; n = n - 1
    a.Add(a2, a0, a2);          // add  a2, a0, a2              ; current = current + prev
    a.Mv(a0, a3);               // mv   a0, a3                  ; previous = tmp
    a.Bltu(s4, a1, fib_loop);   // bltu s4, a1, fib_loop        ; if n > 1 go to fib_loop
    a.J(print_loop2);           // j    print_loop              ; go to print_loop
// done:
    a.Li(a0, 0);                // li   a0, 0                   ; set the return value of main() to 0

This uses the Assembler class to encode one instruction at a time.

Encoding instructions

In the Assembler class, each opcode is represented by a member function that encodes it along with its operands. The individually encoded components are then combined by ORing them together before the resulting instruction is emitted as code.

Here’s the code that encodes each instruction. I’ll explain the encode:: functions and Emit() a little later.

    void Add(uint32_t r0, uint32_t r1, uint32_t r2)
    {
        // add r0, r1, r2
        Emit(encode::opc(Opcode::Add) | encode::r0(r0) | encode::r1(r1) | encode::r2(r2));
    }

    void Addi(uint32_t r0, uint32_t r1, int32_t imm12)
    {
        // addi r0, r1, imm12
        Emit(encode::opc(Opcode::Addi) | encode::r0(r0) | encode::r1(r1) | encode::imm12(imm12));
    }

    void Beq(uint32_t r0, uint32_t r1, int32_t offs12)
    {
        // beq r0, r1, offs12
        Emit(encode::opc(Opcode::Beq) | encode::r0(r0) | encode::r1(r1) | encode::offs12(offs12));
    }

    void Bltu(uint32_t r0, uint32_t r1, int32_t offs12)
    {
        // bltu r0, r1, offs12
        Emit(encode::opc(Opcode::Bltu) | encode::r0(r0) | encode::r1(r1) | encode::offs12(offs12));
    }

    void Call(int32_t offs20)
    {
        // call offs20
        Emit(encode::opc(Opcode::Call) | encode::offs20(offs20));
    }

    void J(int32_t offs20)
    {
        // j offs20
        Emit(encode::opc(Opcode::J) | encode::offs20(offs20));
    }

    void Li(uint32_t r0, int32_t imm12)
    {
        // li r0, imm12
        Emit(encode::opc(Opcode::Li) | encode::r0(r0) | encode::imm12(imm12));
    }

    void Lui(uint32_t r0, uint32_t uimm20)
    {
        // lui r0, uimm20
        Emit(encode::opc(Opcode::Lui) | encode::r0(r0) | encode::uimm20(uimm20));
    }

    void Mv(uint32_t r0, uint32_t r1)
    {
        // mv r0, r1
        Emit(encode::opc(Opcode::Mv) | encode::r0(r0) | encode::r1(r1));
    }

Encoding opcodes and operands

The opcode and operands are encoded by standalone functions in the encode namespace. Each of these functions encodes its namesake, so opc() encodes an opcode, r0() encodes register r0, and so on.

Encoding operands is the inverse of decoding them, so there’s not much to explain here.

namespace encode
{
    uint32_t opc(Opcode opcode)
    {
        return uint32_t(opcode);
    }

    uint32_t r0(uint32_t r)
    {
        return (r & 0x1f) << 7;
    }

    uint32_t r1(uint32_t r)
    {
        return (r & 0x1f) << 12;
    }

    uint32_t r2(uint32_t r)
    {
        return (r & 0x1f) << 17;
    }

    uint32_t imm12(int32_t imm12)
    {
        return static_cast<uint32_t>(imm12 << 20);
    }

    uint32_t offs12(int32_t offs12)
    {
        // Assumes that offs12 is pre-multiplied to a byte offset.
        return static_cast<uint32_t>(offs12 << 19) & 0xfff00000;
    }

    uint32_t offs20(int32_t offs20)
    {
        // Assumes that offs20 is pre-multiplied to a byte offset.
        return static_cast<uint32_t>(offs20 << 11) & 0xfffff000;
    }

    uint32_t uimm20(uint32_t uimm20)
    {
        return (uimm20 << 12) & 0xfffff000;
    }
}

Emitting code

The Emit() member function pushes a uint32_t representing an encoded instruction, onto the end of a vector representing the code.

private:
    std::vector<uint32_t> code_;

    // ...

public:    
    void Emit(uint32_t u)
    {
        code_.push_back(u);
    }

The assembled Owl-2820 machine code is accessed with the Code() member function.

    const std::vector<uint32_t>& Code() const
    {
        return code_;
    }

Running the Fibonacci sequence on the Owl-2820 CPU

Here’s the end result. In a little more than 300 lines of C++ we’ve implemented a simple assembler that assembles the Fibonacci program into Owl-2820 machine code, and an emulator that runs that machine code on an emulated Owl-2820 CPU.

You can see the results of running this on Compiler Explorer.

#include <cstdint>
#include <format>
#include <iostream>
#include <string>
#include <vector>

// Symbolic register names.
enum { zero, ra, sp, gp, tp, t0, t1, t2, s0, s1, a0, a1, a2, a3, a4, a5, a6, a7,
       s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, t3, t4, t5, t6 };

// Opcodes.
enum class Opcode : uint32_t
{
    Illegal = 0,
    Add,
    Addi,
    Beq,
    Bltu,
    Call,
    J,
    Li,
    Lui,
    Mv
};

namespace decode
{
    uint32_t r0(const uint32_t ins)
    {
        return (ins >> 7) & 0x1f;
    }

    uint32_t r1(const uint32_t ins)
    {
        return (ins >> 12) & 0x1f;
    }

    uint32_t r2(const uint32_t ins)
    {
        return (ins >> 17) & 0x1f;
    }

    uint32_t imm12(const uint32_t ins)
    {
        return static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfff00000) >> 20);
    }

    uint32_t offs12(const uint32_t ins)
    {
        return static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfff00000) >> 19);
    }

    uint32_t offs20(const uint32_t ins)
    {
        return static_cast<uint32_t>(static_cast<int32_t>(ins & 0xfffff000) >> 11);
    }

    uint32_t uimm20(const uint32_t ins)
    {
        return ins & 0xfffff000;
    }
}

void Run(const uint32_t* code)
{
    using namespace decode;
    
    // Set pc and nextPc to their initial values.
    uint32_t pc = 0;     // The program counter.
    uint32_t nextPc = 0; // The address of the next instruction.


    // Set all the integer registers to zero.
    uint32_t x[32] = {}; // The integer registers.

    constexpr uint32_t wordSize = sizeof(uint32_t);    

    bool done = false;

    while (!done)
    {
        // Fetch a 32-bit word from memory at the address pointed to by the program counter.
        pc = nextPc;
        nextPc += wordSize;
        const uint32_t ins = code[pc / wordSize];

        // Decode the word to extract the opcode.
        const Opcode opcode = Opcode(ins & 0x7f);

        // Dispatch it and execute it.
        switch (opcode)
        {
        case Opcode::Add: {
            x[r0(ins)] = x[r1(ins)] + x[r2(ins)];  // Add the two registers.
            x[0] = 0;                              // Ensure x0 is always zero.
            break;
        }

        case Opcode::Addi: {
            x[r0(ins)] = x[r1(ins)] + imm12(ins); // Perform the addition.
            x[0] = 0;                             // Ensure x0 is always zero.
            break;
        }

        case Opcode::Beq: {
            // Perform the comparison.
            if (x[r0(ins)] == x[r1(ins)])
            {
                nextPc = pc + offs12(ins); // Take the branch.
            }
            break;
        }

        case Opcode::Bltu: {
            // Perform the comparison.
            if (x[r0(ins)] < x[r1(ins)])
            {
                nextPc = pc + offs12(ins); // Take the branch.
            }
            break;
        }

        case Opcode::Call: {
            // Do a hard-coded printf().
            std::cout << std::format("fib({}) = {}\n", x[a1], x[a2]);
            break;
        }

        case Opcode::J: {
            nextPc = pc + offs20(ins); // Branch.
            break;
        }

        case Opcode::Li: {
            x[r0(ins)] = imm12(ins);
            x[0] = 0; // Ensure x0 is always zero.
            break;
        }

        case Opcode::Lui: {
            x[r0(ins)] = uimm20(ins);
            x[0] = 0; // Ensure x0 is always zero.
            break;
        }

        case Opcode::Mv: {
            x[r0(ins)] = x[r1(ins)];
            x[0] = 0; // Ensure x0 is always zero.
            break;
        }

        // Stop the loop if we hit an illegal instruction.
        default:
            // If we don't recognise the opcode then by default we have an illegal instruction.
            done = true;
        }
    }
}

namespace encode
{
    uint32_t opc(Opcode opcode)
    {
        return uint32_t(opcode);
    }

    uint32_t r0(uint32_t r)
    {
        return (r & 0x1f) << 7;
    }

    uint32_t r1(uint32_t r)
    {
        return (r & 0x1f) << 12;
    }

    uint32_t r2(uint32_t r)
    {
        return (r & 0x1f) << 17;
    }

    uint32_t imm12(int32_t imm12)
    {
        return static_cast<uint32_t>(imm12 << 20);
    }

    uint32_t offs12(int32_t offs12)
    {
        // Assumes that offs12 is pre-multiplied to a byte offset.
        return static_cast<uint32_t>(offs12 << 19) & 0xfff00000;
    }

    uint32_t offs20(int32_t offs20)
    {
        // Assumes that offs20 is pre-multiplied to a byte offset.
        return static_cast<uint32_t>(offs20 << 11) & 0xfffff000;
    }

    uint32_t uimm20(uint32_t uimm20)
    {
        return (uimm20 << 12) & 0xfffff000;
    }
}

class Assembler
{
    std::vector<uint32_t> code_;

public:

    const std::vector<uint32_t>& Code() const { return code_; }

    void Emit(uint32_t u)
    {
        code_.push_back(u);
    }

    void Add(uint32_t r0, uint32_t r1, uint32_t r2)
    {
        // add r0, r1, r2
        Emit(encode::opc(Opcode::Add) | encode::r0(r0) | encode::r1(r1) | encode::r2(r2));
    }

    void Addi(uint32_t r0, uint32_t r1, int32_t imm12)
    {
        // addi r0, r1, imm12
        Emit(encode::opc(Opcode::Addi) | encode::r0(r0) | encode::r1(r1) | encode::imm12(imm12));
    }

    void Beq(uint32_t r0, uint32_t r1, int32_t offs12)
    {
        // beq r0, r1, offs12
        Emit(encode::opc(Opcode::Beq) | encode::r0(r0) | encode::r1(r1) | encode::offs12(offs12));
    }

    void Bltu(uint32_t r0, uint32_t r1, int32_t offs12)
    {
        // bltu r0, r1, offs12
        Emit(encode::opc(Opcode::Bltu) | encode::r0(r0) | encode::r1(r1) | encode::offs12(offs12));
    }

    void Call(int32_t offs20)
    {
        // call offs20
        Emit(encode::opc(Opcode::Call) | encode::offs20(offs20));
    }

    void J(int32_t offs20)
    {
        // j offs20
        Emit(encode::opc(Opcode::J) | encode::offs20(offs20));
    }

    void Li(uint32_t r0, int32_t imm12)
    {
        // li r0, imm12
        Emit(encode::opc(Opcode::Li) | encode::r0(r0) | encode::imm12(imm12));
    }

    void Lui(uint32_t r0, uint32_t uimm20)
    {
        // lui r0, uimm20
        Emit(encode::opc(Opcode::Lui) | encode::r0(r0) | encode::uimm20(uimm20));
    }

    void Mv(uint32_t r0, uint32_t r1)
    {
        // mv r0, r1
        Emit(encode::opc(Opcode::Mv) | encode::r0(r0) | encode::r1(r1));
    }
};

std::vector<uint32_t> Assemble()
{
    Assembler a;

    // Offsets to labels.
    const int32_t fib = 24;
    const int32_t print_loop1 = -24;
    const int32_t print_loop2 = -60;
    const int32_t printf = 0; // No value, because we're going to cheat.
    const int32_t done = 48;
    const int32_t fib_loop = -16;

// main:
    a.Li(s0, 0);                // li   s0, 0                   ; i = 0
    a.Li(s2, 2);                // li   s2, 2                   ; s2 = 2
    a.Lui(a0, 1);               // lui  a0, %hi(format_str)
    a.Addi(s1, a0, -548);       // addi s1, a0, %lo(format_str) ; s1 = the address of the printf format string
    a.Li(s3, 48);               // li   s3, 48                  ; s3 = 48
    a.Li(s4, 1);                // li   s4, 1                   ; s4 = 1
    a.J(fib);                   // j    fib                     ; go to fib
// print_loop:
    a.Mv(a0, s1);               // mv   a0, s1                  ; arg0 = the address of the printf format string
    a.Mv(a1, s0);               // mv   a1, s0                  ; arg1 = i (arg2 contains current)
    a.Call(printf);             // call printf                  ; call printf
    a.Addi(s0, s0, 1);          // addi s0, s0, 1               ; i = i + 1
    a.Beq(s0, s3, done);        // beq  s0, s3, done            ; if i == 48 go to done
// fib:
    a.Mv(a2, s0);               // mv   a2, s0                  ; current = i
    a.Bltu(s0, s2, print_loop1);// bltu s0, s2, print_loop      ; if i < 2 go to print_loop
    a.Li(a0, 0);                // li   a0, 0                   ; previous = 0
    a.Li(a2, 1);                // li   a2, 1                   ; current = 1
    a.Mv(a1, s0);               // mv   a1, s0                  ; n = i
// fib_loop:
    a.Mv(a3, a2);               // mv   a3, a2                  ; tmp = current
    a.Addi(a1, a1, -1);         // addi a1, a1, -1              ; n = n - 1
    a.Add(a2, a0, a2);          // add  a2, a0, a2              ; current = current + prev
    a.Mv(a0, a3);               // mv   a0, a3                  ; previous = tmp
    a.Bltu(s4, a1, fib_loop);   // bltu s4, a1, fib_loop        ; if n > 1 go to fib_loop
    a.J(print_loop2);           // j    print_loop              ; go to print_loop
// done:
    a.Li(a0, 0);                // li   a0, 0                   ; set the return value of main() to 0

    // Emit an illegal instruction so that we have something to stop us.
    a.Emit(0);

    return a.Code();
}

int main()
{
    auto code = Assemble();
    Run(code.data());
}

Here’s the (shortened) output.

fib(0) = 0
fib(1) = 1
fib(2) = 1
fib(3) = 2
fib(4) = 3
fib(5) = 5
fib(6) = 8
fib(7) = 13
...
fib(46) = 1836311903
fib(47) = 2971215073

And finally…

Again, this was a long post. The payoff is that we now have beginnings of an emulator for the Owl-2820 CPU. We have also developed a simple assembler so that we can assemble the program that outputs the Fibonacci sequence, and get the expected results when we run it on our emulator.

We can even run it in Compiler Explorer.

Over the course of the next few posts we’ll bring our implementation of the Owl-2820 CPU up to parity with a RISC-V RV32I CPU so that we can use it to do far more than just generating the Fibonacci sequence.