Short version : the VSP stands for "Very Simple Processor", but "Very Silly Processor" fits well too. It is a microcontroller core with 16/32-bit instructions and 16×32-bit registers, with emphasis on simplicity, small size and memory bandwidth.
It is also an experimental core that uses very unusual techniques. The VSP introduces and tests new methods in many domains : ISA design methodology, architecture, software development environment...
Being a microcontroller, there is no need for sophisticated instructions such as multiply/divide. The VSP's intended job is (hardware and environment) management : it moves data around a SoC, keeping other specialised IP blocks fed with data, answering user requests through a keyboard/LCD screen interface, and overseeing the system's health (power management and system configuration / hotswap).
So it's an embedded core. Target speed is in the range of 10M instructions per seconds with an old SiO2 process. Every instruction is single-cycle and not pipelined. It's rather simple, until you look at the memory interface, which (like in FC0) plays a critical role in the system's efficiency.
When there is no need for horsepower, implementing a complex stuff is a waste of resources, time, money, efforts, silicon, energy...
So it naturally takes the barebone RISC route, with a very clean instruction format, but it further simplifies this by merging some of the instruction fetch mechanism into the data fetch machinery (or is it the reverse ???). Both are software controlled and this spares more instructions.
This comes at some price : it is furiously uncommon and might create new issues when programming. Such as register usage through function calls. But you don't want to use C with VSP (unless the compiler is designed specifically for this architecture).
However, the desing shines with its simplicity and low resource consumption, as you can see from the draft below. It has evolved a bit, since, but not radically.
Because i needed something like this at one time (around the end of 2002). Well, now i don't need it anymore for a "real industrial project" but the idea sounded so coooool and there was no reason to stop, even if the original project was cancelled.
And because F-CPU development is almost stalled (huh .....). It is a good way to develop something fun and helpful later for FC0, which has many many implementation problems. I'll probably be able to solve them once i've put VSP together.
Like other projects, you may be interested because it is fun, it is very instructive and contributes to the Free Hardware movement.
Another better reason is that it is weird and you may want to scratch itches here and there. Help yourself.
Still better reason : you may need something like that. ARM's license costs may annoy you, and other "free" cores don't suit your needs. In fact, most fall in the same category of the 16/32-bit small processors for low power and low-resource applications. But VSP has an edge when it is about memory bandwidth.
Finally, because i will reinvest all the efforts from VSP into F-CPU/FC0, you may want to see how F-CPU works (well, roughly) by examining a small-scale loosely related core.
The VSP is NOT a workstation processor meant to run UNIX (or whatever). It is NOT meant to have protected or paged memory and it is specifically designed to NOT run fast. You are even discouraged to make it fast or run *nux. If you want horsepower, try F-CPU.
If you're brave enough, you can attempt to create a C compiler for VSP. Anyone has his very own vices. If you want to lose time for nothing, port GCC. Learning C compiler structures would be faster, though !
VSP's architecture does not suit C nor GCC well. In fact it is designed to be programmed in assembly langage, which is quite simple thanks to a barebone RISC background. Hey, it's a microcontroller, after all.
If you're feeling fit and lucky, you'll certainly bounce against weird problems like pointer comparisons (mask the 4 MSB !) or PFQ optimisations...
However one side effect of VSP's structure is that it is quite easy to use stack-based langages, particularly FORTH, and maybe JAVA. Porting FORTH is probably the easiest entry point and the best suited environment for such a microcontroller. Everything written mostly from scratch, much room for improvement, no bloat, flat access to all resources...
Another "solution" is the GNL projet that i have also restarted at http://f-cpu.seul.org/whygee/gnl/, i will certainly "link" the online architecture simulator with the GNL code generation backend. Some early interface experimentations are available here.
The VSP reuses some parts and concepts developped for FC0. It also inherits some common ideas developped in the 90's for my personal projects.
The instruction set is very orthogonal, making decoding particularly straightforward. Several opcodes may seem redundant or congruent, but eliminating them would make the decoder too complex.
More than FC0, instructions are so simple that they all take the same amount of time to complete, so there is no scheduling problem. The possible points of stall are only in the memory interface and the buffers (instruction or data not ready). Furthermore, because the core is not pipelined, invalid instructions are detected and handled without needing specific synchronisation circuitry.
The same separation of the configuration space (the Special Registers and the get/put instructions) from the memory space make VSP and FC0 close cousins.
The VSP reuses the same development methodology and tools as F-CPU, not being targeted at a specific industrial process. Being smaller, VSP will fit in a cheaper FPGA, though ;-)
However, the VSP is not designed to be scalable, fast or running Linux. It will always remain 32-bit, running a ad-hoc monitoring program at around 10MHz. Heavy processing tasks are handled by adapted coprocessors.
Please note that I make a distinction between the execution core and the whole VSP "IP core" because the current analysis lacks the memory interface and controller, which is quite complex. The execution core runs almost separately from the memory controller thanks to small cache buffers that are not yet designed.
A look at the instruction set of a computer tells a lot about its structure, features and inherent complexity.
VSP was designed for economy and code compactness so a mixed 16-bit and 32-bit format was chosen. Well, in fact, "compactness" does not matter that much, and extreme code compression would make the instructions too complex and difficult to decode with as few logic gates as possible, which goes against the primary requirement of economy. Memory is dirt cheap today, and the bandwidth/speed ratio is comfortable. So simplicity (and orthogonality) is more important than feature creep.
There are no 16/32-bit "modes", only a single bit in the instruction that tells the decoder how to behave. And even then, the instruction is almost identical. The only difference between 16-bit and 32-bit modes is the 16-bit immediate field (often called Imm16). The rest of the instruction fits in 16 bits, including the 16/32-bit flag.
The instructions generally consist of the "immediate flag", followed by a 7-bit opcode and two 4-bit register addresses. So in 16-bit mode, the core is "2-address machine", where one address points to both one source and the destination (think x86...). In the presence of the "immediate flag", a 3rd immediate operand is provided so the 2 register addresses point to one source and one destination.
The intructions are aligned on natural boundaries : short instructions are aligned on a 2-byte boundary and long instructions (with immediate data) are aligned on a 4-byte boundary.
This is quite efficient because most instructions benefit from both forms. Some exceptions will trigger an error, but most instructions that don't need Imm16 will silently ignore it (often for upward compatibility and alignment purposes, see below). There is also the case when the Imm/Reg flag is set to "immediate" when the instruction is in an odd position (unaligned). These errors are easily detected and may eventually, maybe, open new opcodes in the far future.
The alignment requirement imposes that a padding NOP is sometimes (with 1/3 probability) present before a long instruction (for the alignment). Or the preceding short instruction can be extended to long if it ignores the Imm16 field. This removes some benefits from the instruction format, unless unaligned long instruction become allowed later... This would make the core a bit more complex but given today's memory sizes, it is less a concern than 20 to 40 years ago.
Furthermore, a decoding mechanism skips the "padding nops" by detecting them at the odd locations, so the execution time is not impacted.
The opcode map shows that the opcodes are grouped into 16 groups of 8 (or less) closely related functions. Currently defined groups are :
The VSP handles integers only. The natural data format is the 32-bit Word, used by most operations. Some operation can also process 16-bit Half-word and 8-bit Bytes (in the IE unit). Bit-granularity operations are provided with the ROP2 and SHL unit.
OK so we have 16 register addresses but no load/store instruction nor relative branches, or even absolute jumps. That's where it is getting weird. Beware, you are entering in another dimension.
Only registers #12 to #15 are "normal registers". There is also a "scratch area" in the SR space, but that's all.
The 12 remaing registers are a set of 6 pairs of data/address registers. Each register in the pair is linked to the other : when the address register is written to, the data register is updated with the value (if any) of the 32-bit word stored at the new address. That's a "read" operation. The write is performed by writing to the data register when the address register points to the memory that you want to update.
The address/data pair is often named a "queue" or "prefetch queue" ("Q" or "PFQ").
This is quite similar to how the CDC6600 Central Processing Unit worked in the 60's. Except that the VSP has another twist : the 4 MSB of the pointer contain "update flags" which indicate how to update the pointer after the data register is read or written. Depending on the configuration, the pointer can be post-incremented or post-decremented, or not changed at all (so the data register becomes a simple register).
Stacks are emulated with the 4th option : pre-increment. The address register of the pair thus becomes the "stack pointer" and the data register is the "stack top". Of course, the stack could extend down or up, at will, but having only a pre-increment option limits the direction somewhat...
This can be used on all six "queues" so up to six stacks can be used at the same time. Or almost so because in practice, one must fetch instructions somewhere, and in the VSP, guess where they come from ?
Instruction fetch is shared with data access. The decoder takes instructions out of one of the four first queues, leaving the two others to stacks. When the instruction stream is linear, a couple of queues are enough, so four queues are used for data moves. When the code becomes more complex, the four queues can contain pointers and instructions from several entry points in loops or functions, leaving one stack (or none) and one (or two) queue(s) for everyday data moves.
Out of the four 'instruction' queues, only one is used to fetch the current instruction at any time. It is indicated by the "CQ" (Current Queue) two-bit register.
Like F-CPU, jumps usually require the distination to be prepared, then the right destination is chosen/selected according to the outcome of a comparison instruction, which contains the candidate new queue. If the test succeeds, the number of the new queue is copied into CQ. During the next cycle, the CQ will choose the new instruction stream.
The difference with F-CPU is the use of an explicit queue number, while the FC0, which emulates a more "standard" architecture, needs to maintain lookup tables for associating a register (containing a pointer to the instruction) with an address in the cache memory. This is all simplified in VSP !
The VSP now provides 4 different conditional (and unconditional) groups of instructions, with different modes of operation :
On top of that, because of the absence of condition code register, hence the impossibility to store a carry/borrow bit, the add and substract instructions can skip 1, 2 or 3 half-words if the operation generated no carry or borrow.
The branches or skips can occur under some (homogenous) conditions, usually requiring the read of a register.
Except for the unconditional case, the conditions can be negated, so each group of conditional instructions has 1+(2×3)=7 sub-functions.
The VSP can address 256MiB of memory.
The main reason comes from the 4 MSB of the data pointers that hold the auto-update flag. 32 bits - 4 = 28 bits for the addresses. This is well enough for embedded devices.
Programming with this processor requires a lot of caution to avoid accidental change of the uptade flags. Alterations of the 4 MSB as the result of additions (for example) could trap the processor to signal a pointer overflow. This can also occur when auto-inc/decrement wraps around (but then, only 28-bit address calculations are performed).
These 4 bits are less used in the instruction pointer, but the 2 MSB store the CQ when a trap occurs. The processor can thus restart at the right address with the right CQ, and needs only to save/restore 10 others registers (9 if it is smart) :
|0||CQ|IP||Address of the instruction with the CQ|
|4||A4||Queue #4's address register|
|5||A5||Queue #5's address register|
|6||R0||Standard register #0|
|7||R1||Standard register #1|
|8||R2||Standard register #2|
|9||R3||Standard register #3|
So you see how fast context swap can be : because a queue's data just comes out of the memory, there is no need to save it. Only the pointer has to be saved, sparing six items from being transfered :-)
The catch is that the position of the 3 other queue numbers must be deduced from the indicated CQ, otherwise it would need 11 elements instead of 10. Well, that's just a few logic gates.
I have not yet determined how the address computations are performed. That might be inside or outside the execution core. Every cycle can perform 4 memory accesses (3 reads and 1 write, including instruction fetch) but 3 pointer updates (convention says that write has precedence over read for pointer updates). This means that 3 26-bit adders are needed in parallel with the core.
For ease of design, i have split the core into four stages as shown below :
This does not mean however that the core is pipelinable, despite some potential. It's not even worth considering this because it would bring little performance boost (50% ?) at the price of a lot of circuitry. The core works well, slowly but simply, without pipeline gates (here, the registers act as clock boundaries).
However, there might be another possibility later : transforming it into a SMT core (Simultaneous MultiThreading) to execute 2 or 4 threads, or more, concurrently. There, all the latencies are hidden from each others and interrupt latency might be greatly reduced. There is relatively few hazards to checks and the performance boost is much better than when simply pipelining. Each threads runs as slowly as in a Single-thread processor, but four threads would be able to run at the same time, boosting the instuctions per second rating. The only practical limitation would be the memory bandwidth and some 32KB of onchip cache or SRAM would be helpful.
But that's for waaayyy later.
The VSP can do things other processors can do. It's just done differently.
First thing to know : the CQ's value. It is inehrent, implicit. It tells you which queue you're running on. Usually, when you start, it is set to 0 so you have (generally) Q1 to Q5 available for other purposes.
Without careful planning and a smart resource allocation, the queues will easily be saturated, leaving no room for further execution. But hey, VSP is a "small" microcontroller and if there was idle or unused, wasted resources, they would be removed ASAP. So bear with the tight constraints and remember that some people have written webservers on much, much, much more reduced processors.
So the register allocation is usually like this :
|0||A0||Queue #0's address register, default instruction pointer||Executable|
|1||D0||Queue #0's data register, default current instruction|
|2||A1||Queue #1's address register|
|3||D1||Queue #1's data register|
|4||A2||Queue #2's address register|
|5||D2||Queue #2's data register|
|6||A3||Queue #3's address register|
|7||D3||Queue #3's data register|
|8||A4||Queue #4's address register, stack pointer||Data-only|
|9||D4||Queue #4's data register, stack top|
|10||A5||Queue #5's address register, alternate stack pointer|
|11||D5||Queue #5's data register, alternate stack top|
|12||R0||Standard register #0||Static|
|13||R1||Standard register #1|
|14||R2||Standard register #2|
|15||R3||Standard register #3|
Having two programmable stacks makes it easy to implement the basic FORTH functions, though it was not the original intention. The first VSP draft included only 4 queues but that was certainly a limitation, so it was increased to 5 queues with the fifth being able to act as a stack.
However static registers are not the most precious resource in a microcontroller, particularly when a queue can emulate one static register (you have to point the address register to a scratch area). Complex or heavy algorithms could even virtually have infinite numbers of registers when playing with the autoincrement features. So lately i decided to put a sixth queue.
Let's start easy. Computational examples here use the static registers so there is no side-effect. Furthermore, the operations use both the short and long instruction forms without raising mch issues.
You can familiarize yourself and practice with the interactive Execution Units testbench or interactive assembler, and better understand the issues raised by the operand orders.
The first execution unit is ASU : it is the "Add and Substract Unit".
add r0 r1; adds r0 and r1, then puts the result in r1 add r0 123 r1; adds r0 and immediate data 123, then puts the result in r1Is that simple enough ?
Now if you want to perform multi-precision computations, the VSP has no carry flag nor dual-write capability. I used a special instruction to create the carry in a register but it was not very clever in a 2-address processor. So I recently (4/2007) changed this system to a simpler and more powerful one : the instruction skips 0 to 3 half-words
; Addition of R2 to the 64-bit value R0:R1 adds2 r2 r0 ; r0 = r0+r2 ; The next instruction is skipped if no carry was generated add 1 r1 ; carry : r1 = r1+1 (long form : 2 half-words)
The same works for substraction too. The SUB instruction has 3 "skipping" versions that operate the same way. All the available opcodes are summarized below :
ADD ADDition ADDS1 ADD and Skip 1 half-word if carry ADDS2 ADD and Skip 2 half-word if carry ADDS3 ADD and Skip 3 half-word if carry SUB SUBstract SUBS1 SUBstract and Skip 1 half-word if carry SUBS2 SUBstract and Skip 2 half-word if carry SUBS3 SUBstract and Skip 3 half-word if carry
This unit is called "ROP2" because it performs all useful Raster OPerations with 2 operands. 16 possibilities exist but several are congruent so 8 boolean operations are available :
Because of the core's limitations, there is no MUX instruction (unlike F-CPU, it must be emulated with three instructions and a couple of temporary registers).
Note : in the OPN operations (ANDN, and ORN), the source that is inverted is the first one, not the same address as the destination nor the immediate field (when any). This should make it easier to code real stuff. Notice how it impacts the order of the operands :
ANDN r1 r2 ; r2 = r2 & ~r1 ANDN r1 123 r2 ; r2 = 123 & ~r1 ORN r1 r2 ; r2 = r2 | ~r1 ORN r1 123 r2 ; r2 = 123 | ~r1
The "shuffle" unit (SHL) moves bits around the register. It's a very stripped-down version of the corresponding F-CPU unit and it does only rotation and shifts on 32-bit data. These are the 5 basic, unavoidable operations :
SHR logic SHift Right SAR SHift Arithmetic Right SHL logic SHift Left ROL ROtate Left ROR ROtate Right
Note : The ROR and ROL instructions are congruent, but having a single ROT instruction creates an ambiguous problem : should ROT be ROR or ROL of the first 15 positions ? The assembler can emulate ROL with ROR (and vice versa) by negating the immediate operand. However, this is more complex problem with the "shift amount" is given by a register (this impacts the algorithm).
At the binary level, the order of the operands reflects the architectural constraints. In order to keep things useful and practical, the assembler hides these details (remember that the destination/result register is always the last operand). Here comes an example of the syntax :
SHR r1 r2; r2 = r2 >> r1 SHR r1 12 r2; r2 = r1 >> 12 (note that the immediate could be larger but only the 5 LSB are used)
The VSP contains no load/store unit and treats only 32-bit words. The Insert/Extract unit eases access to 8-bit and 16-bit quantities by shifting words apropriately, with a direct communication with the PFQ's pointers. It is thus possible to have the equivalent of "load" and "store" operations, with the added benefits of pointers that are auto-incremented with the right values. And maybe more in the future.
LSB Load Sign-extended Byte (and inc ptr) LZB Load Zero-extended Byte (and inc ptr) LSH Load Sign-extended 16-bit Half-word (and inc ptr) LZH Load Zero-extended 16-bit Half-word (and inc ptr) SB Store Byte (and inc ptr) SH Store 16-bit Half-word (and inc ptr) SHH Store 16-bit Half-word High (shift it and ignore the pointer) MOV Copy the register or the sign-extended Imm16 field to the destination register.
The shift unit can load bytes from any position. However, because the unit can't cross word boundaries, it can't shift 16-bit words to any position (only offsets 0, 1 and 2 are possible). A trap should be triggered if a pointer offset 3 is found.
The MOV instruction was moved here because the IE instructions are the most similar.
"Load" specifics :
These operations extract one byte or one "word" from a given register. The data is shifted right, according to the implicit pointer associated to this register. If the register is a static register, then the offset is zero. If this is the Data register of a PFQ, the 2 LSBs of the associated pointer are used as offset.
D4 = 12345678h A4 = 00001BADh (offset : A4 & 3 = 1) LSB D4 R1 => R1=56h, A4+=1 now : A4 = 00001BAEh (offset : A4 & 3 = 2) LZB D4 R2 => R2=34h, A4+=1
Note that the long instruction form is not used, because the added immediate is useless. It may be used in the future to extend the offset, using the ASU in parallel to compute a new pointer. But it's too early now.
Note also that the pointer in question is only the 2 LSB of said pointer. When an overflow occurs, the PFQ hardware will increment its own counters to provide the next/previous word from memory. And the pointer's inc/dec flags must also be taken into account...
"Store" specifics :
This is quite similar to the "load" instructions, except that here, the immediate field makes sense in the long instruction form. But then, we have an excess register, or (the other way around) we can't use the extended pointer increment. The chosen approach (today) uses the same form as the load, and either the 2nd operand is used for the stored data (in the short/RR form), or else the immediate field is used (but then the 2nd operand is left unused, so it's not written).
; D4 = 12345678h ; A4 = 00001BADh (offset : A4 & 3 = 1) ; R1 = 9ABCDEF0h SB R1 D4; => D4=1234F078h, A4+=1 ; now : A4 = 00001BAEh (offset : A4 & 3 = 2) SH R1 D4; => D4=DEF0F078h, A4+=2 ; now : A4 = 00001BB0h (offset : A4 & 3 = 0), a new word is loaded in D4=89ABCDEFh SB 0123h D4; => D4=89ABCD23h, A4+=1 ; now : A4 = 00001BB1h (offset : A4 & 3 = 1) SH 4567h D4; => D4=89456723h, A4+=2
The "Store Half-word High" instruction is derived from "Store Half-word", but without checking/using the pointer : it unconditionally shifts the imm/reg's LSB by 16 bits to replace the destination's MSB. The main use is for loading 32-bit immediate data into a register, when preceded by a simple "SH"
SH 5678h R1; => R1=00005678h SHH 1234h R1; => R1=12345678h
Note that SH must come before SHH because SH sign-extends Imm16. The MSB must be corrected by the following SHH :
SH 89ABh R1; => R1=FFFF89ABh (constant is sign-extended) SHH CDEFh R1; => R1=CDEF89ABh
Not all opcodes are used, according to the above descriptions. With some little added HW, it is possible to perform several other operations :
EXPND the 4 LSB of the first operand are "expanded" to 4 byte masks (0 or FFh) and ANDed to the 2nd operand MATCH each byte in src1 is checked for equality with the corresponding byte in src2, creating a bitfield. BMASK each byte in src1 is checked for equality with the corresponding byte in src2, creating a bytemask. BSWAP reverse the word's endian
Because the VSP is meant to manage byte streams, it must be able to scan through them. A specific operation is provided that detects byte patterns : the MATCH instruction XORs both operands and ANDs the resulting 4 bytes, generating a 4-bit field. This can be used in the detection of byte patterns, the loop running while the result is zero. When it becomes non-zero, the bitfield is useful as an index for computed jumps or calls, to functions that deal with alignment for example.
BMASK is similar but creates a byte mask, instead of a bit field. Like the previous instruction, the immediate field can be used as input for the XOR.
EXPND "expands" the 4 LSBs of the first operand to create a byte mask, too. The result is ANDed to the 2nd operand (register or immediate) to add some flexibility. This is the kind of operation that is useful when doing bitmap graphics, like writing a bitmap font to a byte-map raster, for example
Do i need to explain why BSWAP is useful ? The VSP is a little-endian machine and might appreciate communication with other "kinds" of computers. Note that the immediate field is useless in this instruction, so it is simply ignored.
Four groups of instructions provide the developper with different granularities of instruction flow control.
The operands and fields are :
|bit 4||bit 3||mode|
|bit 7||bit 6||type|
|1||0||Odd (Not Even) [LSB]|
|CMOV||destination register (written)|
|JMP||register containing the target address (read)|
|SKIP||skip length (1 to 16 half-words)|
|Q||target queue (bits 8 and 9 only)|
The different combinations create the 28 following opcodes :
JMP, JZ, JNZ, JO, JNO, JS, JNS,
SKIP, SZ, SNZ, SO, SNO, SS, SNS,
Q, QZ, QNZ, QO, QNO, QS, QNS,