vsp/docs/vsp05.txt
created mar oct 15 17:28:49 CEST 2002 by whygee@f-cpu.org
version sat. oct. 19, 2002
version Sun Oct 27 02:48:09 CET 2002 (d'oh, it's now winter time...)
version Sun Jul 25 04:06:28 CEST 2004 : qqs ajouts, conflits de PFQ et d'autres détails.
version Tue Jul 27 00:41:41 CEST 2004 : 6xPFQ + flags d'update
version Sun Aug  8 01:27:22 CEST 2004 : inversion des bits de l'instruction
version Tue Aug 31 04:16:58 CEST 2004 : ajout de la plage d'instructions d'I/O,
  optimisation des padding NOP, SMT, une page d'histoire...
version Thu Sep  9 10:08:24 CEST 2004 : VSP vs F-CPU


-------------------------------------------------
!!!!!!!!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!!!!!!!!

This text is a work draft and is subject to
arbitrary changes for whatever reason that pleases
me. It is by definition highly incomplete and
inacurate. In fact, it only gives a rough idea
of what this stuff is, please read the source
files for a more acurate and up-to-date definition.
Don't dare complaining for anything.

!!!!!!!!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!!!!!!!!
-------------------------------------------------


Introduction

The goal of this small project is to specify,
design and implement a "Very Simple Processor",
a kind of 32-bit RISC microncontroller,
with tight constraints on core size (as small
as reasonably possible) and power (ultra-low
power consumption) for tiny embedded,
battery-operated consumer devices.

It started as a serious project but almost immediately
degenerated into a weird but fun and instructive
experience. F-CPU was slowly loosing ground and
i needed new toys so i could test ideas.
There is a lot of common concepts between F-CPU
and VSP and i believe they both benefit from
each others. However, some crucial characteristics
differ radicaly : F-CPU is wide and scalable while VSP
is stricly limited to 32-bit registers with
scarce support for 8-bit and 16-bit data.
F-CPU is designed for speed and raw performance
when VSP is aimed at low power and dirty I/O
background tasks. I believe they complement
each other well and could be used in the same system.


Target 

This small processor shall be used as a
"SoC Area Controller" : it schedules hard real-time
events and resources but it is not in charge
of raw computational parts, which must be performed
by specialized processors that operate in parallel.
It can however handle some low-CPU background tasks
(system monitoring, garbage collector...) or
simply sit idle between two interrupts.
This is why a specific processor, wich consumes
very few resources, is needed : the overall system
can be more efficient and smaller if the tasks
are processed by specialized processors.
Relieving the other processors from the management
chores reduces the constraints and simplifies
their design. The VSP is not a multi-purpose
general processor, like the coprocessors which
can in turn get rid of any interrupt management.
This is a particularly important point in the
case of a DSP, where most registers are usually
doubled to keep interrupt response time short.

The target performance of the VSP is between a
baseline ARM and a PIC or AVR microcontroller,
but these are proprietary architectures.
Jürgen proposed to name this "new" processor
"LEG" but since it is not a direct replacement
of ARM, i still prefer the name "VSP".
I'l maybe use this name if someone finds a good
signification of the acronym "LEG",
but it misses the main point : it's something
completely "different" which brings (and also
throws away) several ideas. It's not a clone,
it has been designed from scratch for very
particular application.


Rationale

It is highly questionable whether rewriting a
core from scratch is a good idea. The software
development tools must be rewritten from scratch
and the "operating system" must be created for
this specific architecture. It is even possible
that no compiler is available during a while.

However a highly application-specific processor has
some interesting advantages. For example, it is
possible to tightly integrate peripherals, instead
of "glueing" "IP cores" together. Another reason
and motivation for not using cores like these
from Tensilica and others, is that the VSP is
implemented in full compliance with the Copyleft (GPL)
world, instead of a simple "open sourced" proprietary
product that is bound by additional licences and
patents. Finally, not everything is rewritten
"from scratch" because it reuses many aspects that
were developed for the Freedom CPU project.
In many ways, it is not simply "yet another
me-too architecture". Well, nobody would have
thought or dared doing this crazy stuff,
but it is possible *here* because it's a
"quick and dirty" project.


Historical background

One computer in particular has amazed me and
influenced me : the Control Data Corp. CDC6600
designed by Seymour Cray in the 60's. The
first machine that was named a Supercomputer.

Basicly, the system had just a few main (large)
parts : ample core memory (8 banks accessed
in round-robin to increase the memory bandwidth),
one (or two if you were really really really rich)
main processor (doing the math in 60-bit FP format)
and a set of small processors (accumulator-based with
15-bit instructions) that interfaced to dedicated
I/O channels like discs, tapes, typewriters, ...

The most remarkable trick is that these PPU
("Peripheral Processing Units", reminds us of
"Peripheral Interface Controller" by another
company which spun off to Microchip) were in fact
one single computer with 5 contexts, each dedicated
to one PPU and one I/O channel (though the channels
were SW-selectable IIRC). Yep, that was a
Simultaneous Multithreading computer that
did many simple low-level things at the same time.

The comparison with F-CPU is easy : the long
and painful context switches have been widely
discussed. If a small CPU, running only "trusted code",
can help reduce the number of IRQs sent to the CPU,
this could simplify the OS, reduce the burden of
device drivers, etc ... But here, we deal only
with "slow" I/O like keyboard, mice, serial I/O,
USB1... or just buttons, BIOS, front LCD display,
power management, hotplug, power sequencing ...

But that is only one application : the VSP can
sustain itself without coprocessors when only
simple 32-bit operations are needed.



External architecture

The working environment is a consumer-class SoC with
several data streams and interfaces : SDRAM interface,
digital sound in and out, video streams, hard disk
and/or DVD, LCD display, user input interface,
dedicated coprocessors, communication links,
power management.... The purpose of this
VSP is to control and configure the interfaces
and possibly handle some simple transfer protocols
(mass storage or peripherals) but not much more.

There are things that the VSP is not meant to do or be :
some additional functions and tasks are ruled out because
they don't fit with the profile of a microcontroller.
The goal is not to crunch a lot of computations because
this must be performed by "coprocessors" (other processors
with their own instruction stream that run in parallel
on a specific task).

Similarly, the goal is not to run Linux either :
virtual memory or protection rings are completely
useless in the very specific tasks that the
microcontroller executes. There is the risk that
a virus appears, but since the architecture
is "open" and user-modifiable, the potential
lifelength of such malicious code is quite short...

The VSP runs a small-footprint
real-time chip management software. It can communicate
with the other cores in the chip but must also access
data in main memory. This is usually implemented with
a single chip of SDRAM today. There are two consequences :
 - this must be a 32-bit core, as pointers can be
   quite large. 24-bit is not enough because 16MB of SDRAM
   can be much too small in the near future. However,
   it is not likely that 4GB is used soon, so the MSB
   can contain some flags. It is also practical
   to have 32-bit integers to reduce the register pressure
   when handling such large numbers, as they are very
   commonly used today.
 - SDRAM chips work with 4 simultaneous "banks" and this must
   be reflected in the architecture. VSP does not use
   a classical cache but rather a set of direct-mapped
   buffers of the SDRAM lines. These buffers are also
   directly accessed by other devices to reduce cache
   coherency problems.
From these points, it is obvious that :
 - no VM or protection (supervisor mode) is necessary.
   The VSP accesses and controls everything but it does
   not run user applications and no swapping is necessary.
 - fast interrupt response is needed (a few cycles).
 - it's not going to run Linux or anything like that.
   in fact it is designed in such a way that it is not
   possible to use it "as is", hehe... </evil grin>
 - the VSP must transparently but directly interface to
   the SDRAM chip (to a certain extent) to reduce the
   buffer sizes and response time.
 - it is not necessary to run *really* fast. The chosen
   target frequency is 10MHz to 20 MHz. Faster clocks are
   only possible through faster circuits,
   not with architectural changes. If more power is
   needed, then another core must be used or designed
   instead, or a coprocessor must do the work.
 - it doesn't need to be complex at all and with a lot
   of resources : a very small pipeline (if any) is a good
   choice, given the low operating frequency. This also
   reduces the decoding logic's complexity.
 - software size (code density) is not an issue,
   but core size is much more critical. SDRAM and FLASH
   capacity is much cheaper than FPGA cells. Feeding
   the core with instructions or data is easy,
   considering the available memory bandwidth
   (4 or 8 32-bit words per burst at around 50 or 60MHz,
   or 240MB/s peek) and the low core speed
   (maximum theoretical throughput is 4x32-bit words
   per 100ns cycle, or 160 Mbytes/sec) ==> there is
   a comfortable margin for other applications.


Register organisation

The data types are :
 - "byte" (8 bits)
 - "half word" (16 bits)
 - "word" (32 bits)
These data are right-aligned in the registers
and stored in memory in little-endian order
(but this could be changed in the VHDL source
code if otherwise is needed).

From the SDRAM structure context, it becomes obvious that
a PFQ-based architecture (PFQ means "PreFetch Queue)
is certainly desired, as it manages blocks of data easily
through a pair of registers (without load and store
instructions), thus emphasising on the communication
side of the targetted use.
One of the main task, beside answering to IRQs, would
be to scan incoming blocks from mass storage, and parse
MPEG streams in search of block delimitters, in order
to hand the decoding job to a DSP. Or display
data on a raster LCD screen with fonts or sprites.
The rest consumes so few instructions and cycles
that it's not worth "optimising" them. There must only
be a "cheap" way (in terms of time and space) to
assert IRQs and manage the integrated peripherals.

The first idea was to use a 16-register architecture
with 8 normal registers and 4 PFQ (8 registers in total).
It seemed too tight and it evolved into 5 PFQ and
6 registers that are all mapped into the 16-register range.
The added PFQ (#4) was reserved for the stack and was hardwired
to read pre-increment and write post-decrement (or the reverse
if you would like).

But it seemed still too short so a 6th PFQ was added,
all of them are programmable for pre- and post-increment
and post-increment. So multi-stack algorithms can be
implemented and there is some room for nested loops.
On top of that, the bandwidth increases and there
are at most only 10 registers to save when an IRQ or
trap occurs.


Table 1 : register map

 # name function
 0  A0  default PC
 1  D0  default Instruction register
 2  A1  PFQ1
 3  D1  PFQ1
 4  A2  PFQ2
 5  D2  PFQ2
 6  A3  PFQ3
 7  D3  PFQ3
 8  A4  PFQ4
 9  D4  PFQ4/
10  A5  PFQ5/Stack pointer
11  D5  PFQ5/Stack top
12  R0
13  R1
14  R2
15  R3

Register decoding is as follows :
D = 0 and /(3 and 2)
A = /0 and /(3 and 2)
R = /3 and /2

Of course this can be simplified a lot
or even changed (this draft must be considered
as highly preliminary !).

------------------------------------
Not a "Load/Store" architecture :
the principles of the Prefetch Queues
-------------------------------------

The VSP is the smallest possible implementation
of the PFQ concept : it is something looking
like a DSP with several simple address generators,
or a modified CDC6600 computer, or even a processor
with several register windows. Or none of them.

This kind of architecture decouples the computer
into two parts : the operating (control,
decoding and execution) part which "computes",
and the memory interface which contains
buffers that are transferred in short but
efficient bursts. The interface between the
two parts can be more or less sophisticated
but it is efficient when several simultaneous
data stream are processed.

The principle is simple : a PFQ models
a buffer of several words that are accessed
through a pair of user-visible registers.
The size of the buffer depends on the
implementation but does not matter much.

One register (the "Data register" or Dx)
contains data that is pointed to by
the address stored in the associated
register (the "Address Register" or Ax for
the x'th PFQ).

Data goes from and to main memory by accessing
A and D registers :
 - When data is written to the data register,
a store cycle is started with memory, using
the pointer register for the address.
 - When the pointer register is changed,
a load cycle is started and the corresponding
data register is loaded with a new value.

This is basicly the principle used on CDC6600
with some modification (number and use of the
register pairs are a bit different).
There is no classical "load" or "store" instruction,
and pointer arithmetics is pretty straight-forward,
even though it is rather unusual for people that are
accustomed to classical CISC or RISC computers.

Now, here are two important aspects of this principle :
 - a PFQ pair of registers can be used either for
handling data or instructions. this means that
a pointer can point to code or data. A jump is
performed by writing to the pointer register
or (preferred solution) by prefetching code
and then changing the "current queue" (CQ).
A branch instruction will simply copy the
"contents" of a specified queue.
 - the pointer can be automatically updated
when the corresponding data is accessed.
Typical mode is auto-increment on write and
auto-decrement on read, for implementing a stack.
Other complex pattern accesses are of course possible
but are unused in order to keep the core simple.
The pointer update bits require several flag bits
per PFQ and they are stored in the MSB of the
A registers, so they are saved and restored
automatically between function calls or IRQs.

From a programming point of view, accessing
a whole block is as simple as reading a register
as many times as needed. The memory will try to
prefetch as many data as possible, but if the
main memory is not ready, the core will simply
stall. The goal is to interleave as many instructions
as possible between two PFQ accesses, and reduce
stall cycles. Fortunately, the VSP is slow enough
for not being as hurt by "dirty code" as a more
sophisticated and pipelined processor would be.
If a 60MHz SDRAM chip is used with 4 32-bit words
bursts and 2 cycles of latency, a 10MHz VSP
will only wait during one cycle.

However, note that the pointer update is performed
in parallel with the current computation.
Difference between post and pre-update is
only valid for the PFQ4 which is hardwired
as a stack with a trick for performing pre-update.
The other queues can be only post-increment,
post-decrement or nothing (2 bits).


Table 2:

Pointers have the following format :

bit 0 : byte selection
 - it is always cleared when pointing to instructions
   because instructions are 16-bit or 32-bit wide.
   However, in the case of a trap or a severe event,
   it can contain the IRQ enable flag.
 - only the instructions "LSB", "LZB" and "SB" can use the
   associated data register if this bit is set
   (otherwise there is an alignment trap).

bit 1 : half-word selection
  - only the instructions "LSB", "LZB",
   "LSW", "LZW", "SB" and "SW" can use the
   associated data register if this bit is set
   (otherwise there is an alignment trap).
  - This also selects the "current instruction
    half-word" if this is the current queue.

These two first bits are "seen" only by the internal
datapath, which has a simple 2-bit counter per PFQ.
When this counter under- or overflows, it also triggers
the increment of the next counter and Dx is loaded with
the next word from the memory buffer.

bits 2 and 3 : word selection

This is another counter which selects one word out
of four in the memory buffer's line associated to the PFQ.
This is in the case where 4 words are stored per line,
otherwise it would be "bits 2 to 4".

bits 4 to 27 : Line selection

This is mapped to whatever structure, usually SDRAM
and FLASH memories. When the word selection counter
overflows, the line's address is incremented or decremented
accordingly and a new burst is enqueued. Usually,
prefetch is implemented by enqueuing a new request as
soon as a line is used. The new read line is stored
in another line and double-buffering is implemented.


bits 28 and 29 : Current Queue (instruction pointers only)

The CQ bits indicate which queue is currently active.
This is saved when there is a trap or an interrupt
just like the IRQ enable in the LSB. This is valid
only for the instruction pointers.


bits 28 to 31 : increment bits (data pointers)
   bits 28, 29 : read update
   bits 30, 31 : write update

  These fields have the following meaning :
   0 0 : no update (nop)
   0 1 : post decrement
   1 0 : post increment
   1 1 : pre decrement -> allows stacks



As you can see, the format of the pointer
indicates a lot of architectural things :

 - traps and interrupts store a few bits in the
unused parts of the pointers, this helps keep
the response time short and spares some cycles here and there.

 - the memory hierarchy has 3 levels with their own
management and strategies. The execution datapath is
decoupled from the rest and each level handles
data with varying granularities :
   - sub-words for the datapath,
   - lines of words (4 or 8) in the memory interface
   - millions of lines in the main memory chip.
However, thanks to prefetching and double-buffering,
it is relatively easy to reach high bandwidths when
data is accessed linearly, for example.



Instruction format :


The instructions have 16 bits for the register to register
form, called "RR" or "short instruction format",
and 32 bits with a 16-bit sign-extended data,
called "RRI" or "long instruction format".

In fact, most instructions have the possibility to use
both forms, so the 16/32 bit format is simply indicated
by the MSB of the instruction.

Remark : to simplify things,
just as on the plain old CDC6600, 32-bit (RRI) forms must be
correctly aligned in a 4-byte word, otherwise it will trap
or show other signs of unhappiness.

Coding density is not wonderful but rather satisfying anyway,
as the decoding logic is still quite simple, due to the
absence of pipeline and hazard detection logic.

Here is a description of the instruction word format :

Table 3:

bit  0     : Imm16 flag --> 16 bits follow.
bits 1,2   : instruction class number
bits 3,4   : Exection unit number
bits 5-7   : OP number
bits 8-11  : source/dest register number
bits 12-15 : source register number
bits 16-31 : (optional) immediate data

There is no operand size flag because this would make
the PFQ more complex. This also saves a couple of
useful bits but increases the coding complexity a bit,
as specific instructions must handle bytes and half-words
(at least this operation does not consume much resources).

The register-register ("RR") form takes the 2 indicated
registers as source and writes the result to the first
register (a la x86) to spare instruction bits. For example :

   add r1, r2

computes r1=r1+r2.

However, the register-register-immediate ("RRI") reads only the
second source and the immediate data, the result is put
into the destination register. The immediate data "replaces"
the first register source, for example :

  add r1, r2, 123

computes r1=r2+123 (the immediate operand is ALWAYS sign-extended
so be VERY careful).


There are 4 instruction "classes":
 - 00 control
 - 01 ALU operations
 - 10 jump
 - 11 I/O registers

Each "class" can have a specific instruction format.


1) The simplest ones are control instructions :

NOP  (encoded 0x0000)
HALT (stop the core and wait for an IRQ)
MOV  (copy a register to another)

GET  (used for reading the configuration space, IRQ controller, DMA, ...)
PUT  (idem, write)
RETI (but i'm not sure yet)

CPQ  : copy queue (loop entry)
LOOP : copy a queue to another conditionally
        (same a jump but no current IP saved,
         there are some 4 bits left to specify
           - 2 bits -> PFQ
           - 2 bits -> condition (?)

other opcodes are reserved for the future
and will appear.


2) The next class is the jump class :
 bits 11-12  : [Q] queue where to branch
 bit 10 : [L] "link" bit (push next IP if set, emulates a subroutine call)
 bit 9 : [N] condition negation
 bit 8 : [Z/O] Zero or Odd (LSB)

The first register field [8..11] indicates
the register where the next IP is written to,
it's usually D4 or D5 (stack top).
The other register field [12..15] indicates
the register that must be tested,
either zero or the LSB.

The immediate field gives a sign-extended
15-bit word that is XORed with the tested register.
This is defaulted to zero if no immediate is
given.

These codes can be "encoded" as :

1
0 9 8

0 0 0  JZ   "jump if zero"
0 0 1  JO   "jump if odd"
0 1 0  JNZ  "jump if not zero"
0 1 1  JEV  "jump if even"
1 0 0  JLZ  "jump and link if zero"
1 0 1  JLO  "jump and link if odd"
1 1 0  JLNZ "jump and link if not zero"
1 1 1  JLEV "jump and link if even"

The other flag is the 2-bit flag that
encodes which queue is next used.

*** !!! Important remark !!! *** 

There are 3 ways to "jump" :
 - write to the Address register of the
   currently used queue (CQ)
 - modify the CQ 
 - use the "jump" instruction

however the last way is the only one recommended
even though the others can be enabled or
disabled by a specific version of the core.

This is to ensure that no problem will occur
with timing (the other ways may create delay
slots) and maximum performance is used through
the use of "split branches" (the queue must first
be prefetched, this gives some time to load the
new instruction stream from slow memory).

Another detail is that jump instructions are
the only conditional instructions here.
unconditional jumps can be done by testing
a register that is known to not be zero,
or testing the LSB of a pointer that is
used for instructions (-> always zero because
aligned on 2-byte boundaries at least).

It is well possible that "hacks" with the CQ
will be rendered impossible by some
hardware, or maybe a trap.

*** end of !!! Important remark !!! ***


Another instruction :

"cpq" : "copy queue", used for loops.
operands :
 - source queue (2 or 3 bits ?)
 - source register (counter)
 - limit (src reg or imm)



3) The next class of instructions is the
"computational operations".

There are 4 "units" that perform operations
on the 1 operands :
 00 [ASU] adder
 01 [SHL] shifter
 10 [ROP2] logic
 11 [IE] insert/extract fields

As you can see, it's as simple as a 4MUX
controlled by the current instruction word.

The corresponding instructions are :

[ASU]
 add
 sub
 carry
 borrow

[SHL]
 SHR shift logical right
 SHL shift left
 SAR Shift arithmetic right
 ROL
 ROR
 (mask generation ? bitfield ins/extract ?)

[ROP2]
 AND
 OR
 XOR
 NAND
 NOR
 XNOR
 ORN
 ANDN

[IE]
 LZB (load zero-extended byte)
 LSB (load sign-extended byte)
 LZW (load zero-extended word)
 LSW (load sign-extended word)
 SB  (store byte)
 SW  (store word)
 IHB (shift left 8 bit and insert a byte)
 IHW (shift left 16 bit and insert a word)

This last IE unit is meant to support pointer-oriented
byte and word access to the PFQs, helping emulate
"load" and "store" instructions.
They are here only to provide dynamic 16-bit and 8-bit
accesses to and from a PFQ data register, though other
registers can be used instead, with different results.

This functions as a byte and word "insert" (for store)
and "extract" (for load) instruction that uses 2 operands
(only, as the immediate field could be ignored)
as well as the _implicit_ LSB of the associated pointer.
Because the data and address registers of a PFQ are linked,
the address can be found easily and it can feed the
alignment logic. Furthermore, if the used prefetch queue
has read or write de/increment, the pointer can
be updated on demand by the correct amount of bytes.

If a non-data register is used, then the "pointer" is
0 by default, so these instruction can be used for
sign extension or other purposes.


4) the I/O register instructions :

This is optional and appeared later in the design.
It further stresses on the fact that the VSP is
a microcontroller (a high-class one,
but still a microcontroller).

A new range of 32-bit registers is added :
 - 8 registers deal with data : the P registers
 - 8 registers control the direction of every bit :
       the R registers

Note : this space is not a context of a running program
so it will not be saved/restored upon IRQ.

Besides the obvious "move to io" and "move from io"
instructions (they do not accept an immediate data),
specific instructions can support hardwired bit-to-bit
operations for bit-banging, or complex
and fast interfaces (SPI, I²C ...).

8 ports provide 256 bits
of I/O : that is enough to control many simple
surrounding devices like LEDs, buttons, LCD screens,
or why not : integrated FPGA or sea-of-gates
for specific coprocessing functions.

These registers are not in the SR space
for some reasons : SRs do only support get and put,
they can have a long latency and are for slow or one-time
device configuration.

I/O as processed by microcontrollers need more bandwidth
(single-cycle operations) and more specific, direct processing.
Enabling access to the main data pipeline
can save precious cycles in time-critical functions,
as well as precious program space for I/O intensive code.


There is a total of 5+8+29+2=44 opcodes.
The whole opcode space is not used, otherwise
the control logic's size could explode.
But up to now, the instruction set is still nicely
orthogonal, even though the SHL operations need
to have a reversed operand order when the immediate
form is used.


      -----------------------------
          Internal architecture
      -----------------------------

The goal is to keep the processor as simple,
as small and power-saving as possible.
Pipelining is not used (well, not yet), as it increases the
complexity of the control logic and may add
more silicon area to contain the pipeline barreers.

However the core is split in two parts, one performs
the operation and instruction flow control, the other
manages the memory access and buffers. These parts are
separated by a set of registers which are the
only timing barreers.

At first glance, the operating core is simpler
and easier to implement than the memory buffer,
which is also the part that makes the VSP so interesting
to design (data and control paths are common skills
but memory interfaces and buffer coherency is still
some kind of black magic).

Let's start with the easy and "user visible" portion
of the core : the instruction decode and execute stage.
It takes a whole clock cycle to read the input register,
decode and execute the current instruction and write
the result back to the register. However the control logic
requires great care and multiplexing the signals can
consume a lot of resources.

Another important point is the instruction fetch and
decode logic. The PFQ registers and pointers operate only
on 32-bit words and the instructions can be 16-bit wide.
However, the 2 LSB of each pointer are stored inside
the execution part of the core so 8-bit and 16-bit operations
can be performed. These LSB are not seen outside of this
half of the core but can be used transparently by all
instructions.

The instruction is loaded from one of 4 D-reg (D0 to D3)
and one half of the word is selected as the current
instruction, depending on the current pointer's LSB.

After this first selection, the instruction word indicates
which registers to read and write, and selects the immediate
operand when needed. This is performed by a simple multiplexer
and sign extension of the immediate word. The multiplexer
also select the 2 LSB of one of the A registers if a D register
is accessed, in case there is an alignment instruction
in the IE unit.

Then come the 4 operation units and the read and write ports of
the GET and PUT instructions (used to access other devices,
I/O ports or the timers, the IRQ controller, the DMA etc.).

The 4 units operate in parallel and the result is selected
according to the 2-bit EU field of the instruction. The GET
and PUT instructions are a bit less constraining in timing
and the result of GET overrides the EU multiplexer.

The results are written back to the register that is selected
by the instruction, after some decoding.

Concerning the EUs :
 - The ASU is a classical 32-bit adder with the usual xor
and carry-in controlled by the instruction word, in order
to perform SUB as well.
   There is another operation type that only generates the
carry and borrow. There is no "status register" or
"carry/borrow bit", or multiple write ports,
so a specific instruction is the only solution.
 - The ROP2 unit works just as on F-CPU, except that
there is no Combine nor MUX mode. It's implemented with
a simple 4->1 multiplexor per bit, and a small lookup
table in front.
 - The SHL is probably the most complex unit but it's a classical
barrel shifter that performs shifts and rotations. The only
potential worry is the space and time.
   A couple of instructions can be added to support
bit field insertion and extraction.
 - The IE unit shifts 8-bit and 16-bit words and possibly
multiplexes them for emulating byte and word-wide load
and store instructions. The shift depends on the 2 least
significant bits from the associated A register in the selected
PFQ. If a the source register is not a D register, the
shift defaults to 0.
  The control logic must also update the pointers according
to its (read/write inc/dec) settings and the size of the moved word.

Note that there is also a "bypass" path used for MOVE and the
implicit pushes in the JLxx instructions.

Concerning the control logic, it's where most of the
complexity hides. Most fields of the instruction can be
reused "as is" but some signals must be extracted from
different points.

First problem : data from the PFQ might not be ready
(the transaction has been sent to the SDRAM but the
program wants the result immediately). There must be a way
to "stall" the core just as in a normal pipeline core.
There are 3 possible conditions for the stall to occur :
either the instruction is not ready, or one of the input
data is not. This last pair can only be known if the
instruction is fetched, which make timing more tight.
One obvious way to implement the "PFQ not ready" flag
is to multiplex this flag (coming from the PFQ) along
with the data, the result being fed to the stall logic.
The two parts of the core have to communicate with
handshake signals to indicate that the decode logic
has updated a register, or that the memory interface
has not finished transfering data. The stall signal
is also used "in software" for the HALT instruction.

Another problem : get the next instruction.
Here we are relieved from the burden of computing
the "next IP" because it's done automatically by the
memory interface unit, but the instruction must be
chosen. The key is the instruction multiplexer and
the second LSB bit from the current instruction
pointer. The immediate field is simply extracted
from the instruction because it is allowed only
when the instruction is correctly aligned.

One optimisation to consider :
the "next instruction" signal is either 0
(stop the core when an error occured), 2
(when the current instruction is "short")
or 4 (for a long instruction). But there
is a big chance that short instructions
preceding a long one are simply "nops"
in order to align the long instruction.
So one optimisation is to detect a nop
at the odd position in parallel with the rest,
in order to increment the IP with 4.


Yet another key problem : the decoder must send the
signal that a PFQ has been "touched" and must be updated.
That is probably the most complex problem because
the pointers (and the data) must be refreshed if
a PFQ data is read or written, or if a pointer register
is written to. Finally, the current queue must be advanced
if the instruction word is exhausted, and this depends on
the current instruction and whether it is stalled...

The challenge is to decode these conditions as fast
as possible so that the command can be sent to the
memory stage soon enough and the data can be used
in the next cycle, without creating a stall.

The "registers" have several implementations that
that depend on the purpose.
 - The "normal" register set is a classical 2R1W array
(though i believe it can be simply implemented with multiplexers).
 - the "Ax" registers are also similar but
    - the LSB is implemented as a counter or something
      more complex (see later)
    - any write triggers a read transaction
 - the "Dx" registers can be read and written from
    the memory controller and from the core. That one
    is also quite tough to design.


[... to be continued ...]

List of traps :
 - fatal error
 - invalid opcode (when opcode is unknown)
 - 1R1W in long instruction
     happens when a register-to-register move
     is found in a long instruction (for example)
 - unaligned instruction
     when bit 15 of instruction and bit 1 of IP are set,
     which should not happen (long instructions can't
     happen in the second half of the word).
     Maybe this could "open" another instruction set later.
 - jump to middle of instruction :
     when bit 31 of target instruction and bit 1 of IP are set,
     meaning that it jumps to the middle of a long instruction
 - unaligned pointer access
     read or write to a Dx when the corresponding Ax has a LSB that is set.
 - CQ not accessed through jump (??)
 - invalid pointer (accessing a Dx register
    for which the corresponding Ax does not
     point to a valid address)

maybe later :
 - protection error
     (if protection is ever implemented)

and maybe (but separated from the rest) :
 - reset
 - (re)init task

 [and more in the future]

The trap base address is determined by a SPR.
Each entry is separated by a gap of 16 words
(32 short instructions). The base address is
aligned to a power of two, corresponding to the
number of traps suported by the core.
To manage this, a certain number of LSB are not
implemented (set to zero) in the SPR, which is
set to zero after reset.


External Interrupts :

VSP can manage up to 32 interrupt sources,
because the registers are 32 bit wide and
that's already enough. First versions will
implement 8 or 16 channels to keep the circuit
small but there is still some room left.
A fully working embedded system can have
enough with 16 ways, though this may be
"compressed" further down to 8 lines.

Just like the traps, the IRQ routines
are managed with a "base address SPR".

All interrupts are individually prioritized.
The priority encoder circuits can become quite large
and maybe slow, so don't send it 5MHz signals
there :-) Whether it is level or transition-triggered
is an annoying detail, but this is reduced by
the fact that "handshaking" is preferred, so
the trigger is less important. Maybe it could
even be converted to a level trigger with the
help of a Set/Reset latch, if needed.

Interrupts can be nested and masked, so
a channel is inhibited when it is either
in use or masked (these are 2 separate registers
that are ORed together). The interrupt enable
flag is stored in the stack with the return address,
but it is stored in the LSB (which is usually zero
because instructions use 2 or 4 bytes).


When an interrupt occurs, the core
saves the current instruction pointer on the stack (???)
and starts executing a small block of code that is stored
in on-chip SRAM, in order to decrease the response time.
If the IRQ entry point was in SDRAM or FLASH, there would
be a potential conflict with ongoing transfers and the VSP
would have to wait for its turn before using the shared
memory bus.





Errata :

problème : définir les conventions des mots et de leur taille,
Christophe propose "h" pour "half-word", il faut mettre le fichier
à jour.

problème 2 : ajouter une nouvelle fonction pour les "branch",
version qui n'utilise pas le no de PFQ mais le registre auparavant
utilisé pour le "link" pour indiquer où on branche.

problème 3 : différence de fréquence d'horloge avec le contrôleur
mémoire.

problème 4 : mettre les flags de PFQ dans les MSB des pointeurs
   -> 256 MO adressables max.



PFQ priorities :

Imagine that CQ=0 and an instruction such as
  add D0, D0
is decoded : the effect is hard to predict and
difficult to justify. So there must be some conventions
on the core's behaviour.

* The CQ (current queue)' priority is instruction fetch (over data r/w).
The user can't modify A(cq) or D(cq). Read works normally
but pointer update can't happen.

* Write has priority for pointer updates : if read and write
occur on the same PFQ, the pointer is updated with the parameters
of the write update.

* If two source PFQs are identical, then the update is performed
only once. But this should not be a senseful rule because
this can't happen with the RR or RRI forms :
 - RR writes to this same queue and has priority.
 - RRI only reads one register so no priority is needed.

These rules could change in the future,
adding more possibilities and better exploiting this domain.
For example, accessing D(cq) could bring some useful data.



Multithreading :

So the old CDC design is haunting us.
It can be ressurrected in the VSP however :-)

The architecture does not allow for a classic
pipeline because the burden of checking
hazards and bypasses is too high.

The execution core is however split into 4 "stages"
so there is a good potential for advanced techniques.

SMT is the simple way : 4 contexts made of 4 rolling
sets of 4+12 registers can fit almost easily.
The potential MOPS/MHz ratio is potentially
quadrupled, compared to a roughly doubled performance
with a simple pipelined core. Furthermore,
the lack of "normal" registers would make coding
more difficult.

A further extension of this idea considers the
fact that some kind of prioritization might
need to appear, and must take the latencies (memory
and get/put) into account. A faulty thread
also requires cleanup code. IRQs require the
selection of a new thread.

One answer is to define more physical thread
contexts than 4. The core then needs to select
a new instruction cycle-by-cycle, based on factors like :
 - is the instruction ready ?
 - is it the most prioritised thread ?
 - is there a free thread available to handle an incoming IRQ ?
 - has the last instruction from this thread completed ?

This is getting rather complex and might extend the pipeline
by one cycle, but instead of one processor running at 10MHz,
we would get the equivalent of 5 or 6 CPUs running 16 threads.

Now, there are not many reasons to make something that complex
now. There is the "fun factor" but few applications where
so many threads are needed. Except maybe for a
handheld gaming console but i don't focus on this market.


--------------------------------------------------------------

VSP vs F-CPU

Both projects share some characteristics while they differ
on others.

 - F-CPU was started by other people than me, and VSP
     is my idea (though instilled by someone else's needs)
 - Same license, same tools, but different targets.
 - F-CPU is designed for infinite scalabily, VSP is
     meant to stay in its small ballpark.
 - Corrolary : F-CPU is difficult and long to design,
     VSP should be easier and require only one guy (me)
 - VSP's addressing range is limited to 28 bits (256MBytes)
     but F-CPU is virtually unlimited.
 - The instruction set designs are quite similar, most
     important opcodes are common and the "no status register"
     idea remains (even though there is less problems
     with it, but who knows)
 - The instruction format is different : VSP has 2 forms
     and F-CPU enforces strict 32-bit instructions.
 - VSP has a 2/3-address instruction (RR and RRI)
     but F-CPU has more powerful instructions : 2R2W and 3R1W.
 - The F-CPU pipeline is designed for extreme performance
     but VSP is designed for scarcity and simplicity.
 - Hence the exception handling and scheduling :
     a headache for F-CPU, hardly a trouble for VSP.
 - Same idea with the GET/PUT instructions and the SRs :
     complex and multicyle stuffs are taken out of the
     execution core.
 - Same "decoupled", bipolar architecture where the
     memory interface plays a critical role and is more
     challenging than the obvious execution core.
 - VSP implements a radically different method to access
     memory, while F-CPU mimicks a "classical CPU".

--------------------------------------------------------------