8 September, 2024

Modfiying the Owl-2820 VM to decode RISC-V instructions directly.

Part 10 Running RISC-V instructions directly

In the last post we were finally able to run a compiled C program on the Owl-2820 VM. We did this by converting an image from RISC-V instruction encoding to Owl-2820 instruction encoding so that the Owl could run it.

In this post, we’re going to modify the Owl-2820 VM so that it can decode RISC-V instruction encoding and run the image directly.

Dispatchers and instruction handlers

We’re already quite close to being able to do that. Take a look at this snippet from Transcode() which converts from RISC-V instruction encoding to Owl-2820 instruction encoding.

void Transcode(Assembler& a, uint32_t code)
{
    DecodeRv32 rv(code);

    switch (code) {
        case 0x00000073: return a.Ecall();
        case 0x00100073: return a.Ebreak();
    }
    switch (code & 0xfe00707f) {
        case 0x00000033: return a.Add(rv.Rd(), rv.Rs1(), rv.Rs2());
        case 0x40000033: return a.Sub(rv.Rd(), rv.Rs1(), rv.Rs2());
        // ...

In essence, Transcode() decodes a RISC-V instruction into an opcode and dispatches it, along with its operands, to a member function of the Assembler to encode it as an Owl-2820 instruction. The Assembler has other member functions that do things such as resolving labels and offsets, but these functions are not used by Transcode(). In this context, the Assembler is playing the role of a handler for Owl-2820 instructions, and the other functions are not required for this role.

The role of an instruction handler

So, if the Assembler can play the role of an Owl-2820 instruction handler, then other classes can play this role also. We can think of the handler functions as a kind of interface or protocol that must be implemented by any class which fulfills that role.

For instance, if we had an OwlCpu class that played the role of an Owl-2820 instruction handler, then we could write a dispatcher function that decodes a RISC-V instruction into an opcode and dispatches it, along with its operands, to a member function of the OwlCpu to execute it directly on the Owl-2820 CPU.

The DispatchRv32i() function below does just that.

void DispatchRv32i(OwlCpu& cpu, uint32_t code)
{
    DecodeRv32 rv(code);

    switch (code) {
        case 0x00000073: return cpu.Ecall();
        case 0x00100073: return cpu.Ebreak();
    }
    switch (code & 0xfe00707f) {
        case 0x00000033: return cpu.Add(rv.Rd(), rv.Rs1(), rv.Rs2());
        case 0x40000033: return cpu.Sub(rv.Rd(), rv.Rs1(), rv.Rs2());
        // ...

Its basic structure is identical to that of Transcode(). We’ve merely changed the type, from Assembler to OwlCpu, yet we’ve already got something that looks a lot like it could run RISC-V instructions directly.

Admittedly we don’t have an OwlCpu class yet, but there’s something that Assembler and OwlCpu have in common. They can both be driven by what amounts to the same dispatcher function. From the perspective of the dispatcher function there is no difference between an Assembler and an OwlCpu.

Clearly we’re onto something here, and at this point you might consider extracting either a C++ interface consisting of pure virtual functions, or perhaps an abstract base class. However, let’s not rush into writing an IOwlHandler or an AbstractInstructionHandler just yet, as we may not want the overhead of runtime polymorphism.

No matter how we choose to implement it, with a C++ interface or otherwise, we have established that there is a pattern emerging here, in the shape of an instruction handler protocol that consists of handler functions for Owl-2820 instructions.

Let’s continue to explore our options, and what we might do with this protocol.

Substituting dispatchers

Using this protocol, we could substitute the dispatcher for RISC-V encoded instructions with a dispatcher for Owl-2820 encoded instructions. This new dispatcher would dispatch Owl-2820 encoded instructions for execution on the same OwlCpu class.

In this case, we’ve substituted the dispatcher and left the instruction handler unchanged.

Same instruction handler. Different dispatcher.

void DispatchOwl(OwlCpu& cpu, uint32_t ins)
{
    using namespace decode;

    // Decode the word to extract the opcode.
    const Opcode opcode = Opcode(ins & 0x7f);

    // Dispatch it and execute it.
    switch (opcode)
    {
        case Opcode::Ecall: return cpu.Ecall();
        case Opcode::Ebreak: return cpu.Ebreak();
        case Opcode::Add: return cpu.Add(r0(ins), r1(ins), r2(ins));
        case Opcode::Sub: return cpu.Sub(r0(ins), r1(ins), r2(ins));
        // ...

Substituting instruction handlers

Similarly, we could substitute the instruction handler with a disassembler. Here’s a dissassembler for Owl-2820 encoded instructions.

This time, we’ve substituted the instruction handler and left the dispatcher unchanged.

Same dispatcher. Different instruction handler.

void DisassembleOwl(OwlDisassembler& d, uint32_t ins)
{
    using namespace decode;

    // Decode the word to extract the opcode.
    const Opcode opcode = Opcode(ins & 0x7f);

    // Dispatch it and display it.
    switch (opcode)
    {
        case Opcode::Ecall: return d.Ecall();
        case Opcode::Ebreak: return d.Ebreak();
        case Opcode::Add: return d.Add(r0(ins), r1(ins), r2(ins));
        case Opcode::Sub: return d.Sub(r0(ins), r1(ins), r2(ins));
        // ...

Dispatchers require instruction handlers

The dispatchers know nothing about the way that the instruction handlers are implemented. All that a dispatcher needs is that the thing that it is dispatching to conforms to the requirements of an instruction handler. Behind the scenes, the instruction handler could be implemented by a flock of flying squirrels but the dispatcher would neither know nor care. For a dispatcher to work, its instruction handler has to meet the requirements of the protocol, nothing more.

Going the other way, there is no requirement for instruction handlers to know anything about dispatchers.

Being able to vary both dispatcher and instruction handler implementations gives us a lot of flexibility. By substituting one dispatcher with another, we can deal with both Owl-2820 encoded instructions and RISC-V encoded instructions. By changing the instruction handler, we can assemble those instructions, or execute them, or disassemble them. In a more distant future perhaps we could reach the point of transpiling those instructions to native machine code.

Being able to think of the problem conceptually is very promising, but we’ve yet to implement a single instruction handler other than the Assembler. Let’s rectify that by implementing an instruction handler that executes Owl-2820 CPU instructions.

An instruction handler for the Owl-2820 CPU

The Owl-2820 CPU implementation exists only inside the Run() function. in which its state is implemented by a handful of local variables that represent its registers, its memory, and whether it is done or not. Its behaviour is implemented by a while loop around a switch statement in which each instruction is handled by a different case.

Let’s convert that to an OwlCpu class whose member functions have the same names as those called by Transcode(). In other words, let’s implement an OwlCpu that meets the requirements of the instruction handler role.

This will allow us to implement both DispatchOwl() and DispatchRv32i() in terms of OwlCpu.

Implementing state

Here’s how state is represented in the Run() function:

code and memory are word-addressable and byte-addressable views of the image’s memory
pc and nextPc are the program counter and the address of the next instruction
x[32] is the state of the integer registers
done is a flag that determines if the CPU has finished

At startup, Run() sets these to their initial values prior to entering the while loop. Most values are set to zero, other than code and memory which point to the image, and the stack pointer which is set to the end of memory.

void Run(std::span<uint32_t> image)
{
    using namespace decode;

    // Get a read-only, word addressable view of the image for fetching instructions.
    const auto code = image;

    // Get a byte-addressable view of the image for memory accesses.
    auto memory = std::as_writable_bytes(image);

    // Set pc and nextPc to their initial values.
    uint32_t pc = 0;     // The program counter.
    uint32_t nextPc = 0; // The address of the next instruction.

    // Set all the integer registers to zero.
    uint32_t x[32] = {}; // The integer registers.

    // Set the stack pointer to the end of memory.
    x[sp] = uint32_t(memory.size());

    constexpr uint32_t wordSize = sizeof(uint32_t);

    bool done = false;

    while (!done)
    {
        // ...

In the OwlCpu class, state is represented by member variables whose initial values are set as shown in the code below. In this case, code and memory are set by the constructor’s initialization list, and the stack pointer is set to the end of memory in the constructor’s body.

class OwlCpu
{
    uint32_t pc = 0;             // The program counter.
    uint32_t nextPc = 0;         // The address of the next instruction.
    uint32_t x[32] = {};         // The integer registers.
    bool done = false;           // Are we there yet?
    std::span<uint32_t> code;    // Non-owning.
    std::span<std::byte> memory; // Non-owning.

public:
    OwlCpu(std::span<std::uint32_t> image) : code{image}, memory{std::as_writable_bytes(image)}
    {
        // Set the stack pointer to the end of memory.
        x[sp] = uint32_t(memory.size());
    }
    // ...

Implementing behaviour

Fetching instructions

After initializing the CPU’s state, Run() enters a while loop that repeatedly fetches the next instruction and dispatches it until the CPU is told to stop by an instruction or syscall that sets done to true.

    // ...
    while (!done)
    {
        // Fetch a 32-bit word from the address pointed to by the program counter.
        pc = nextPc;
        nextPc += wordSize;
        const uint32_t ins = AsLE(code[pc / wordSize]);
        // ...

In the OwlCpu class, done, pc and nextPc are private member variables. We’d rather not expose OwlCpu’s innards as we’d like to maintain its invariants, so we’ll give the OwlCpu class some member functions, Done() and Fetch().

class OwlCpu
{
    // ...
    bool Done() const
    {
        return done;
    }

    uint32_t Fetch()
    {
        constexpr uint32_t wordSize = sizeof(uint32_t);
        pc = nextPc;
        nextPc += wordSize;
        return AsLE(code[pc / wordSize]);
    }
    // ...

Executing instructions

Inside the while loop, Run() uses a switch statement to dispatch each instruction to a case statement that implements its behaviour.

        // ...
        // Decode the word to extract the opcode.
        const Opcode opcode = Opcode(ins & 0x7f);

        // Dispatch it and execute it.
        switch (opcode)
        {
            // System instructions.

        case Opcode::Ecall: {
            const auto syscall = Syscall(x[a7]);
            switch (syscall)
            {
            case Syscall::Exit:
                std::cout << "Exiting with status " << x[a0] << '\n';
                done = true;
                break;
        // ...

For the OwlCpu, rather than writing a case statement for each instruction, we will write a member function that does the same thing. Taken as a whole, these member functions allow the OwlCpu to meet the requirements of the instruction handler role.

There are about forty instructions. The conversion from case statement to member function is fairly mechanical, so I’ll illustrate this with a couple of examples rather than showing them all.

Here’s how Ecall is implemented as a case statement inside Run().

        // ...
        case Opcode::Ecall: {
            const auto syscall = Syscall(x[a7]);
            switch (syscall)
            {
            case Syscall::Exit:
                std::cout << "Exiting with status " << x[a0] << '\n';
                done = true;
                break;

            case Syscall::PrintFib:
                std::cout << "fib(" << x[a0] << ") = " << x[a1] << '\n';
                break;
            }
            break;
        }
        // ...

Here’s Ecall reimplemented as a member function of OwlCpu.

class OwlCpu
{
    // ...
    void Ecall()
    {
        const auto syscall = Syscall(x[a7]);
        switch (syscall)
        {
        case Syscall::Exit:
            std::cout << "Exiting with status " << x[a0] << '\n';
            done = true;
            break;

        case Syscall::PrintFib:
            std::cout << "fib(" << x[a0] << ") = " << x[a1] << '\n';
            break;
        }
    }
    // ...

Similarly, here’s Add inside of Run().

        // ...
        case Opcode::Add: {
            // r0 <- r1 + r2
            x[r0(ins)] = x[r1(ins)] + x[r2(ins)];
            x[0] = 0; // Ensure x0 is always zero.
            break;
        }
        // ...

And here’s Add() as a member function of OwlCpu.

class OwlCpu
{
    // ...
    void Add(uint32_t r0, uint32_t r1, uint32_t r2)
    {
        // r0 <- r1 + r2
        x[r0] = x[r1] + x[r2];
        x[0] = 0; // Ensure x0 is always zero.
    }
    // ...

The other instructions can be reimplemented in much the same way.

Dispatching instructions

We can replace each case of the switch statement with code that calls the equivalent member functions of OwlCpu, and put it inside a DispatchOwl() function like this.

void DispatchOwl(OwlCpu& cpu, uint32_t ins)
{
    using namespace decode;

    // Decode the word to extract the opcode.
    const Opcode opcode = Opcode(ins & 0x7f);

    // Dispatch it and execute it.
    // clang-format off
    switch (opcode)
    {
        case Opcode::Ecall: return cpu.Ecall();
        case Opcode::Ebreak: return cpu.Ebreak();
        case Opcode::Add: return cpu.Add(r0(ins), r1(ins), r2(ins));
        case Opcode::Sub: return cpu.Sub(r0(ins), r1(ins), r2(ins));
        case Opcode::Sll: return cpu.Sll(r0(ins), r1(ins), r2(ins));
        case Opcode::Slt: return cpu.Slt(r0(ins), r1(ins), r2(ins));
        case Opcode::Sltu: return cpu.Sltu(r0(ins), r1(ins), r2(ins));
        case Opcode::Xor: return cpu.Xor(r0(ins), r1(ins), r2(ins));
        case Opcode::Srl: return cpu.Srl(r0(ins), r1(ins), r2(ins));
        case Opcode::Sra: return cpu.Sra(r0(ins), r1(ins), r2(ins));
        case Opcode::Or: return cpu.Or(r0(ins), r1(ins), r2(ins));
        case Opcode::And: return cpu.And(r0(ins), r1(ins), r2(ins));
        case Opcode::Slli: return cpu.Slli(r0(ins), r1(ins), shift(ins));
        case Opcode::Srli: return cpu.Srli(r0(ins), r1(ins), shift(ins));
        case Opcode::Srai: return cpu.Srai(r0(ins), r1(ins), shift(ins));
        case Opcode::Beq: return cpu.Beq(r0(ins), r1(ins), offs12(ins));
        case Opcode::Bne: return cpu.Bne(r0(ins), r1(ins), offs12(ins));
        case Opcode::Blt: return cpu.Blt(r0(ins), r1(ins), offs12(ins));
        case Opcode::Bge: return cpu.Bge(r0(ins), r1(ins), offs12(ins));
        case Opcode::Bltu: return cpu.Bltu(r0(ins), r1(ins), offs12(ins));
        case Opcode::Bgeu: return cpu.Bgeu(r0(ins), r1(ins), offs12(ins));
        case Opcode::Addi: return cpu.Addi(r0(ins), r1(ins), imm12(ins));
        case Opcode::Slti: return cpu.Slti(r0(ins), r1(ins), imm12(ins));
        case Opcode::Sltiu: return cpu.Sltiu(r0(ins), r1(ins), imm12(ins));
        case Opcode::Xori: return cpu.Xori(r0(ins), r1(ins), imm12(ins));
        case Opcode::Ori: return cpu.Ori(r0(ins), r1(ins), imm12(ins));
        case Opcode::Andi: return cpu.Andi(r0(ins), r1(ins), imm12(ins));
        case Opcode::Lb: return cpu.Lb(r0(ins), imm12(ins), r1(ins));
        case Opcode::Lbu: return cpu.Lbu(r0(ins), imm12(ins), r1(ins));
        case Opcode::Lh: return cpu.Lh(r0(ins), imm12(ins), r1(ins));
        case Opcode::Lhu: return cpu.Lhu(r0(ins), imm12(ins), r1(ins));
        case Opcode::Lw: return cpu.Lw(r0(ins), imm12(ins), r1(ins));
        case Opcode::Sb: return cpu.Sb(r0(ins), imm12(ins), r1(ins));
        case Opcode::Sh: return cpu.Sh(r0(ins), imm12(ins), r1(ins));
        case Opcode::Sw: return cpu.Sw(r0(ins), imm12(ins), r1(ins));
        case Opcode::Fence: return cpu.Fence();
        case Opcode::Jalr: return cpu.Jalr(r0(ins), offs12(ins), r1(ins));
        case Opcode::Jal: return cpu.Jal(r0(ins), offs20(ins));
        case Opcode::Lui: return cpu.Lui(r0(ins), uimm20(ins));
        case Opcode::Auipc: return cpu.Auipc(r0(ins), uimm20(ins));
        case Opcode::J: return cpu.J(offs20(ins));
        case Opcode::Call: return cpu.Call(offs20(ins));
        case Opcode::Ret: return cpu.Ret();
        case Opcode::Li: return cpu.Li(r0(ins), imm12(ins));
        case Opcode::Mv: return cpu.Mv(r0(ins), r1(ins));
        default: return cpu.Illegal(ins);
    }
    // clang-format on
}

Rewriting Run()

Now that we have an Owl-2820 instruction handler, OwlCpu, we can drive it with a dispatcher.

We have a dispatcher function for Owl-2820 encoded instructions in the form of DispatchOwl() from the previous section. This will let us run Owl-2820 encoded instructions on the OwlCpu.

Similarly, we have a dispatcher function for RISC-V encoded instructions in the form of the DispatchRv32i() function that we saw at the top of this post. This will let us run RISC-V encoded instructions on the OwlCpu.

Running Owl-2820 encoded instructions on the Owl CPU

To run Owl-2820 encoded instructions, we can replace the Run() function with the following code that uses an OwlCpu instruction handler with the DispatchOwl() dispatcher function.

void Run(std::span<uint32_t> image)
{
    OwlCpu cpu(image);
    while (!cpu.Done())
    {
        const uint32_t ins = cpu.Fetch();
        DispatchOwl(cpu, ins);
    }
}

Running RISC-V encoded instructions on the Owl CPU

If we simply change the dispatcher from DispatchOwl() to DispatchRv32i(), then we can write a RunRv32i() function that dispatches RISC-V encoded instructions directly onto the Owl-2820 CPU.

void RunRv32i(std::span<uint32_t> image)
{
    OwlCpu cpu(image);
    while (!cpu.Done())
    {
        const uint32_t ins = cpu.Fetch();
        DispatchRv32i(cpu, ins);
    }
}

Show me the code

Here’s a link to the fully runnable code on Compiler Explorer.

Its main() function loads the program image and runs it twice: once as a RISC-V encoded image using RunRv32i(), and once as an Owl-2820 encoded image using Run().

int main()
{
    try
    {
        // Create a 4K memory image.
        constexpr size_t memorySize = 4096;
        std::vector<uint32_t> image(memorySize / sizeof(uint32_t));

        auto rv32iImage = LoadRv32iImage();

        // Copy the result into our VM image to run it directly.
        std::ranges::copy(rv32iImage, image.begin());

        std::cout << "Running RISC-V encoded instructions...\n";
        RunRv32i(image);

        // Transcode it to Owl-2820 and copy the result into our VM image.
        auto owlImage = Rv32iToOwl(rv32iImage);
        std::ranges::copy(owlImage, image.begin());

        std::cout << "\nRunning Owl-2820 encoded instructions...\n";
        Run(image);
    }
    catch (const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }
}

Benchmarking

If we can run RISC-V encoded instructions directly, why wouldn’t we do that by default? Why go to the effort of transcoding an image to the Owl-2820 instruction encoding? It’s a good question.

I’ve stated previously that it takes more work to decode the RISC-V instruction encoding in software than it does to decode the Owl-2820 instruction encoding. Until now, you’ve had to take my word for it.

Let’s modify the program slightly so that we can benchmark it. This will show how the difference in encoding translates to a difference in timing.

Disabling output

We have to disable the output from Exit and PrintFib. If we don’t do this, then the timings will be dominated by how quickly the terminal can scroll.

    // System instructions.

    void Ecall()
    {
        const auto syscall = Syscall(x[a7]);
        switch (syscall)
        {
        case Syscall::Exit:
            // std::cout << "Exiting with status " << x[a0] << '\n';
            done = true;
            break;

        case Syscall::PrintFib:
            // std::cout << "fib(" << x[a0] << ") = " << x[a1] << '\n';
            break;
        }
    }

Measuring how long it takes to run each image

We’d like to know how long it takes to run each image. However, if we only run an image for a small number of iterations, then the timings will vary wildly due to the granularity of the timing mechanism.

To get a meaningful measurement then we need to run each image several times. And by several times, I don’t just mean ten iterations, or even one hundred iterations. I mean at least one hundred thousand iterations, or better still, one million iterations.

Here’s the code. It measures the time that it takes to run one million iterations of each image, then it outputs the results.

int main()
{
    try
    {
        // Create a 4K memory image.
        constexpr size_t memorySize = 4096;
        std::vector<uint32_t> image(memorySize / sizeof(uint32_t));

        auto rv32iImage = LoadRv32iImage();

        // Crude microbenchmark.
        constexpr int attempts = 1'000'000;

        // Copy the result into our VM image to run it directly.
        std::ranges::copy(rv32iImage, image.begin());
        const auto startRv32i{std::chrono::steady_clock::now()};
        for (int i = 0; i < attempts; i++)
        {
            RunRv32i(image);
        }
        const auto endRv32i{std::chrono::steady_clock::now()};

        // Transcode it to Owl-2820 and copy the result into our VM image.
        auto owlImage = Rv32iToOwl(rv32iImage);
        std::ranges::copy(owlImage, image.begin());

        // Run it as Owl-2820.
        const auto startOwl{std::chrono::steady_clock::now()};
        for (int i = 0; i < attempts; i++)
        {
            Run(image);
        }
        const auto endOwl{std::chrono::steady_clock::now()};

        const std::chrono::duration<double> elapsedRv32i{endRv32i - startRv32i};
        const std::chrono::duration<double> elapsedOwl{endOwl - startOwl};

        std::cout << "Elapsed Rv32i: " << elapsedRv32i << '\n';
        std::cout << "Elapsed   Owl: " << elapsedOwl << "\n\n";
        std::cout << "RV32I timing as percentage of Owl: " << 100.0 * (elapsedRv32i / elapsedOwl) << '\n';
        std::cout << "Owl timing as percentage of RV32I: " << 100.0 * (elapsedOwl / elapsedRv32i) << '\n';
    }
    catch (const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }
}

The results

Here’s some typical output from my development machine.

Elapsed Rv32i: 12.5614s
Elapsed   Owl: 7.21679s

RV32I timing as percentage of Owl: 174.058
Owl timing as percentage of RV32I: 57.452

This may be a crude microbenchmark, but it is very clear that running Owl-2820 encoded instructions is quicker than running RISC-V encoded instructions directly.

We saw in Part 1 that decoding RISC-V encoded instructions in software is more complicated than decoding Owl-2820 instructions, but now we have numbers to back that up.

Summary

In this post we finally managed to run a RISC-V image directly on the Owl-2820 VM.

We did so by identifying the roles of dispatchers and instruction handlers. We wrote an OwlCpu instruction handler that executes instructions on an Owl-2820 CPU, and we wrote dispatchers for both RISC-V encoded instructions and Owl-2820 encoded instructions. We also saw that we could use the same approach to write other instructions handlers, such as a disassembler.

Finally, we benchmarked the difference in performance between RISC-V encoded instructions and Owl-2820 encoded instructions, and discovered that Owl-2820 encoded instructions are measurably quicker to dispatch.

Show me the code

Here’s a link to the fully runnable code on Compiler Explorer.

It doesn’t include the benchmarking code, as Compiler Explorer understandably has a limit on how long anything can run for.