init_MUL8.txt
20090814 : init yg

The MUL8L and MUL8H instructions use 2 blocks of 512 bytes with two read ports.
Each block performs two 4x4 multiplies which are then combined by two 12-bit adders
then by the 16-bit ASU.

Before MUL8L and MUL8H can be used, the MULI instruction must be used to initialise the lookup table
(when the FPGA uses SRAM blocks, otherwise the whole point of MULI is null).
Note that the available blocks in A3P FPGAs have 9 bits of address and 9 bits of data but we use 8 of each
so one half of each 512-entry block is unused. The remaining space could be used
for signed operations but I have not yet figured how to use/do them.

All 4x4 blocks are initialised in parallel with the same value (the MUL8 are unsigned).
MULI must be executed 256 times to fill all the 16*16 table entries.
The format of the operands of MULI might change in the future but it should accept :
 - 1 8-bit value (the contents of the multiply table)
 - 2 8-bit addresses (identical to the input of the MUL8x instructions)

The MULI's operands should look as much as possible like the MUL8L instructions, so 
 - 1 operand (SND ?) is the first byte of the address
 - 1 operand (SI4 ?) is the second byte of the address (LSB),
      and the higher byte contains the value that must be written.
Note that the address contains 2 occurences of the same value :
 SND :   xxxxxxxx   ABCDABCD
 SI4 :  (ABCD*EFGH) EFGHEFGH

The Multiply table can be initialised sequentially, with 2 nested loops of 16 steps.
Each counter, say in R0 for SND and R1 for SI4, could be incremented by :
 ADD 11h, R1
or
 ADD 11h R2
This is simpler than a sequence of instructions in the inner loop that does the replication :
 SHL 4 R1, R3
 OR  R1, R3
 SHL 4 R2, R4
 OR  R2,R4
(that's 4 instructions and 2 temporary registers).

In the inner loop, the code needs an increment and an accumulator.
The contents of the accumulator can then be written by MULI.
The accumulator is the SI4 operand of MULI and it contains
both the address (incremented by 11h) and the value (incremented by SND << 8).

A simple algorithm with only 3 registers appears :
R1 contains SND
R2 contains SI4
R3 contains an increment.

R1 is cleared, R3 is set to 11h.
 MOV 0 R1
 MOV 11h R3

before the inner loop, R2 is cleared (multiply by 0).
 MOV 0 R2

Repeat 16 times :
  R3 is added to R2 then R2 is written to the LUT :
   ADD R3, R2
   MULI R2,R1

 at the end of the inner loop, R1 and R3 are "incremented" :
  ADD 11h,R1
  ADD 100h,R3 ; note that it does not affect the lower byte that still contains 11h

The outer loop is repeated 16 times. the MSB of R1 can serve as an indicator of the end
of the loop. In fact, since the higher byte of SND is ignored, an additional counter can be
put there. It is incremented in parallel with the others, so
  ADD 811h,R1 (instead of 11h)
does the trick : 800h*16=128*256 so the MSB is set after 16 loops.
(Note : for YASEP32, it's not so easy because there is no immediate 32-bit add).

The inner loop counter can be implemented with a 1-hot coded register in R4 :
  MOV 1 R4 ; init
  ROR 1 R4 ; move the bit
  MOV A0 NPC LSB0 R4 ; loop
Other methods are possible, for example : detecting that the lower byte
is equal to 10h but it uses another instruction and another register anyway.


=========================================================================
Result : Initialisation code for the multiply lookup table for YASEP16 :
  MOV 0 R1
  MOV 11h R3  

  MOV NPC A1
    MOV 0 R2

    MOV NPC A0 ; loop entry
      ADD R3, R2 ; increment
      MULI R2,R1 ; puts accumulator in LUT
      ROR 1 R4 ; move the bit
      MOV A0 NPC LSB0 R4 ; loop

    ADD 100h, R3
    ADD 811h, R1
    MOV A1 NPC MSB0 R1

total :
 12 instructions
 A0 A1 R1 R2 R3 R4
 34 bytes (pending instruction set changes,
  like no loopentry and relative conditional jumps)
=========================================================================