The YASEP multiply instructions

Introduction

The current architecture of YASEP defines partial multiply instructions. It was not planned in the beginning of the architecture because an optional multi-cycle hardware multiplier was meant to be accessed through the Special Registers. However this would have made interrupts and multi-processing difficult, as context switches would be too complex.

The chosen method computes only parts of a multiply. It is a compromise between speed, context switching complexity and hardware complexity. The hardware reuses the existing 16-bit adder as a final stage, after 2 12-bit partial adders, each adding the 8-bit results of two 4×4 tables.

This method adds 1 stage to the pipeline whenever MUL8L or MUL8H is executed. As a consequence, one "bubble" may be inserted in the pipeline the first time a non-multiply instruction is executed after a multiply instruction.

There are two multiply instructions : MUL8L and MUL8H. They are identical but MUL8H takes the high byte of the SND operand, instead of the lower byte as in MUL8L. This simplifies the algorithms that multiply more than one byte, sparing a few instructions (see the R16×R16 example below).

Only unsigned multiplies are supported now. Signed multiplicands must be manually adjusted before and after the instruction sequence.

Some multiply examples

R8 × R8 multiply

; R0(8 bits) × R1(8 bits) => R2(16 bits)
  MUL8L R0 R1 R2 ; 4 bytes
; R0(8 bits) × R1(8 bits) => R1(16 bits)
  MUL8L R0 R1 ; 2 bytes

R8 × I8 multiply

; R0(8 bits) × Imm16(8 bits) => R1(16 bits)
  MUL8L 123 R0 R1 ; 4 bytes
; R0(8 bits) × Imm16(8 bits) => R0(16 bits)
  MUL8L 123 R0 ; is an alias to :
  MUL8L 123 R0 R0 ; 4 bytes anyway
; R0(8 bits) × Imm4(4 bits) => R0(12 bits)
  MUL8L 12 R0 ; 2 bytes

R8 × R16 multiply

YASEP16 : 6 instructions

; R0(8 bits) × R1(16 bits) => R2(16 lower bits)-R3(8 higher bits)
; R4 = scratch

; higher half
  MUL8H R0 R1 R2
  SHR 8 R2 R3 ; split the result between R2 and R3
  SHL 8 R2
; lower half
  MUL8L R0 R1 R4
  ADD R4 R2
    ADD 1 R3 R3 carry

R8 × I16 multiply

; R0(8 bits) × Imm16 => R2(16 lower bits)-R3(8 higher bits)
; R1 = scratch
; Imm16=1234h in this example

; higher half
  MUL8L R0 12h R2 ; 12h = 1234h>>8
  SHR 8 R2 R3 ; split the result between R2 and R3
  SHL 8 R2

; lower half
  MUL8L 34h R0 R1 ; 34h = 1234h & 0xFF
  ADD R1 R2
    ADD 1 R3 R3 carry

I8 × R16 multiply

; (8 bits) × R1(16 bits) => R2(16 lower bits)-R3(8 higher bits)
; R4 = scratch
; Imm8 = 12h in this example

; higher half
  MUL8H 12h R1 R2
  SHR 8 R2 R3 ; split the result between R2 and R3
  SHL 8 R2
; lower half
  MUL8L 12h R1 R4
  ADD R4 R2
    ADD 1 R3 R3 carry

R16 × R16 multiply

YASEP16 : 13 instructions

; R0 x R1 => R2-R3 (R4=scratch)
; R1 left modified (rotated) after execution

; the 2 middle bytes are computed together
  MUL8H R0 R1 R2
  MUL8H R1 R0 R3 ; Notice the exchange of operands
  ADD R3 R2 ; carry reused later

  SHR 8 R2 R3 ; adjust between R2 and R3,
  SHL 8 R2
    MOV 100h R4 ; speculative carry
    OR R3 R4 R3 carry ; and put the carry back into R3
; lower byte
  MUL8L R1 R0 R4
  ADD R4 R2
    ADD 1 R3 R3 carry
; higher byte
  ROL 8 R1
  MUL8H R1 R0 R4
  ADD R4 R3
; eventually :
  ROR 8 R1

How to initialise the multiply lookup table

Some FPGA implementations might use small SRAM blocks as a multiply accelerator. For 8×8 bits multiplies, there are 4 blocks of 4×4=8 bits of result. Two dual-port 256-byte SRAM blocks (at least) are necessary. The 4 LUTs can be initialised with the same values so there are 256 values to write. The following code provides a short routine that does this :

; LUT initialisation code :
; R1 : SND (first address + outer loop counter)
; R2 : SI4 (second address (lower byte) + init value (higher byte))
; R3 : accumulator : second address increment + init increment (higher byte)
; R4 : inner loop counter, 1-hot encoded
; A0 : inner loop address
; A1 : outer loop address

 mov 0 R1
 mov 11h R3
 mov 1 R4
 mov NPC A1
 ; outer loop 16 times :
   mov 0 R2
   mov NPC A0
   ; inner loop 16 times :
     ADD R3 R2
     MULI R2 R1
     ror 1 R4
     mov A0 NPC LSB0 R4

   add  100h R3
   add 1011h R1
   mov A1 npc no_carry
 ; the end