init_MUL8.txt 20090814 : init yg The MUL8L and MUL8H instructions use 2 blocks of 512 bytes with two read ports. Each block performs two 4x4 multiplies which are then combined by two 12-bit adders then by the 16-bit ASU. Before MUL8L and MUL8H can be used, the MULI instruction must be used to initialise the lookup table (when the FPGA uses SRAM blocks, otherwise the whole point of MULI is null). Note that the available blocks in A3P FPGAs have 9 bits of address and 9 bits of data but we use 8 of each so one half of each 512-entry block is unused. The remaining space could be used for signed operations but I have not yet figured how to use/do them. All 4x4 blocks are initialised in parallel with the same value (the MUL8 are unsigned). MULI must be executed 256 times to fill all the 16*16 table entries. The format of the operands of MULI might change in the future but it should accept : - 1 8-bit value (the contents of the multiply table) - 2 8-bit addresses (identical to the input of the MUL8x instructions) The MULI's operands should look as much as possible like the MUL8L instructions, so - 1 operand (SND ?) is the first byte of the address - 1 operand (SI4 ?) is the second byte of the address (LSB), and the higher byte contains the value that must be written. Note that the address contains 2 occurences of the same value : SND : xxxxxxxx ABCDABCD SI4 : (ABCD*EFGH) EFGHEFGH The Multiply table can be initialised sequentially, with 2 nested loops of 16 steps. Each counter, say in R0 for SND and R1 for SI4, could be incremented by : ADD 11h, R1 or ADD 11h R2 This is simpler than a sequence of instructions in the inner loop that does the replication : SHL 4 R1, R3 OR R1, R3 SHL 4 R2, R4 OR R2,R4 (that's 4 instructions and 2 temporary registers). In the inner loop, the code needs an increment and an accumulator. The contents of the accumulator can then be written by MULI. The accumulator is the SI4 operand of MULI and it contains both the address (incremented by 11h) and the value (incremented by SND << 8). A simple algorithm with only 3 registers appears : R1 contains SND R2 contains SI4 R3 contains an increment. R1 is cleared, R3 is set to 11h. MOV 0 R1 MOV 11h R3 before the inner loop, R2 is cleared (multiply by 0). MOV 0 R2 Repeat 16 times : R3 is added to R2 then R2 is written to the LUT : ADD R3, R2 MULI R2,R1 at the end of the inner loop, R1 and R3 are "incremented" : ADD 11h,R1 ADD 100h,R3 ; note that it does not affect the lower byte that still contains 11h The outer loop is repeated 16 times. the MSB of R1 can serve as an indicator of the end of the loop. In fact, since the higher byte of SND is ignored, an additional counter can be put there. It is incremented in parallel with the others, so ADD 811h,R1 (instead of 11h) does the trick : 800h*16=128*256 so the MSB is set after 16 loops. (Note : for YASEP32, it's not so easy because there is no immediate 32-bit add). The inner loop counter can be implemented with a 1-hot coded register in R4 : MOV 1 R4 ; init ROR 1 R4 ; move the bit MOV A0 NPC LSB0 R4 ; loop Other methods are possible, for example : detecting that the lower byte is equal to 10h but it uses another instruction and another register anyway. ========================================================================= Result : Initialisation code for the multiply lookup table for YASEP16 : MOV 0 R1 MOV 11h R3 MOV NPC A1 MOV 0 R2 MOV NPC A0 ; loop entry ADD R3, R2 ; increment MULI R2,R1 ; puts accumulator in LUT ROR 1 R4 ; move the bit MOV A0 NPC LSB0 R4 ; loop ADD 100h, R3 ADD 811h, R1 MOV A1 NPC MSB0 R1 total : 12 instructions A0 A1 R1 R2 R3 R4 34 bytes (pending instruction set changes, like no loopentry and relative conditional jumps) =========================================================================