And now you may ask :
Short version : the VSP stands for "Very Simple Processor", but "Very Silly Processor" fits well too. It is a microcontroller core with 16/32-bit instructions and 16 32-bit registers, with emphasis on memory bandwidth, easy and fast interrupt/context switch.
Being a microcontroller, there is no need for sophisticated instructions such as multiply/divide. The main job is to move data around a SoC, keeping other specialised IP blocks fed with data, answering user requests through a keyboard/LCD screen interface, and managing the system's health (power management and system configuration / hotswap).
So it's an embedded core. Target speed is in the range of 10M instructions per seconds on a moderate cheap process. Every instruction is single-cycle and not pipelined. It's rather simple, until you look at the memory interface, which (like in FC0) plays a critical role in the system's efficiency.
When there is no need for horsepower, implementing a complex stuff is a waste of resources, time, money, efforts, silicon, energy...
So it naturally takes the barebone RISC route, with a very clean instruction format, but it further simplifies this by merging some of the instruction fetch mechanism into the data fetch machinery. Both are software controlled and this spares more instructions.
This comes at some price : it is furiously uncommon and might create new issues when programming. Such as register usage through function calls. But you don't want to use C with VSP (unless the compiler is designed specifically for this architecture).
Because i needed something like this at one time (around the end of 2002).
Well, now i don't need it anymore for a "practical project" but the idea sounded so coooool and there was no reason to stop, even if the reasons were cancelled.
And because F-CPU development is almost stalled (huh .....) it is a good way to develop something fun and helpful later for FC0, which has many many implementation problems. I'll probably be able to solve them once i've put VSP together.
Like other projects, you may be interested because it is fun, it is very instructive and contributes to the Free Hardware movement.
Another better reason is that it is weird and you may want to scratch itches here and there. Help yourself.
Still better reason : you may need something like that. ARM's license costs may annoy you, and other "free" cores don't suit your needs. In fact, most fall in the same category of the 16/32-bit small processors for low power and low-resource applications. But VSP has an edge when it is about memory bandwidth.
Finally, because i will reinvest all the efforts from VSP into F-CPU/FC0, you may want to see how F-CPU works (well, roughly) by examining a small-scale loosely related core.
This is a hard fact : the VSP is NOT a workstation processor meant to will run UNIX (or whatever). It is NOT meant to have protected or paged memory and it is specifically designed to NOT run fast. You are even discouraged to make it fast or run *nux.
If you're brave enough, you can attempt to create a C compiler for VSP. Anyone has his very own vices. If you want to lose time for nothing, port GCC.
VSP's architecture does not suit C nor GCC well. In fact it is designed to be programmed in assembly langage, which is quite simple thanks to a barebone RISC background. Hey, it's a microcontroller, after all.
If you're feeling fit and lucky, you'll certainly bounce against weird problems like pointer comparisons (mask the 4 MSB !) or PFQ optimisations...
However one side effect of VSP's structure is that it is quite easy to use stack-based langages, particularly FORTH, and maybe JAVA. Porting FORTH is probably the easiest entry point and the best suited environment for such a microcontroller. Everything written mostly from scratch, much room for improvement, no bloat, flat access to all resources...
Another "solution" is the GNL projet that i have also restarted at http://f-cpu.seul.org/whygee/gnl/, i will certainly "link" the online architecture simulator with the GNL code generation backend.
The VSP reuses some parts and concepts developped for FC0. It also inherits some common ideas developped in the 90's for my personal meditations and projects.
The instruction set is very orthogonal, making decoding particularly straightforward.
More than FC0, instructions are so simple that they all take the same amount of time to complete, so there is no scheduling problem.
The same separation of the configuration space (the Special Registers and the get/put instructions) from the memory space make VSP and FC0 close cousins.
The VSP reuses the same development methodology and tools as F-CPU, not being targeted at a specific industrial process. Being smaller, VSP will fit in a cheaper FPGA, though ;-)
However, the VSP is not designed to be scalable, fast or running Linux. It will always remain 32-bit, running a ad-hoc monitoring program at 10MHz.
Here i make a distinction between the execution core and the whole VSP "IP core" because the analysis lacks the memory interface and controller, which has a greater complexity. The execution core runs almost separately from the memory controller thanks to small cache buffers.
A look at the instruction set of a computer tells a lot about its structure, features and inherent complexity.
VSP was designed for economy and code compactness so a mixed 16-bit and 32-bit format was chosen. Well, in fact, "compactness" does not matter that much, and extreme code compression would make the instructions too complex and difficult to decode with as few logic gates as possible, which goes against the primary requirement of economy.
There are no 16/32-bit "modes", only a single bit in the instruction that tells the decoder how to behave. And even then, the instruction is almost identical.
The only difference between 16-bit and 32-bit modes is the 16-bit immediate field. The rest of the instruction fits in 16 bits, including the 16/32-bit flag.
The instruction consists of the "immediate flag", followed by a 7-bit opcode and two 4-bit register addresses. So in 16-bit mode, the core is "2-address machine", where one address points to both one source and the destination (think x86...). In the presence of the "immediate flag", a 3rd immediate operand is provided so the 2 register addresses point to one source and one destination.
The intructions are aligned on natural boundaries : short instructions are aligned on a 2-byte boundary and long instructions (with immediate data) are aligned on a 4-byte boundary.
This is quite efficient because most instructions benefit from the dual forms. There are a few exceptions that will trigger an error. There is also the case of the flag set to "immediate" when the instruction is in an odd position. These errors are easily detected and may eventually, maybe, open new opcodes in the far future.
The alignment requirement imposes that a padding NOP is present before a long instruction. This removes some benefits from the instruction format, unless unaligned long instruction become allowed later... This would make the core a bit more complex but given today's memory sizes, it is less a concern than 20 years ago.
Furthermore, a decoding mechanism skips the "padding nops" by detecting them at the odd locations, so the execution time is not impacted.
What is missing from this descriptions ? Well there are 4 groups of opcodes, one deals with the average ALU operations on 32-bit numbers, another performs the jumps, the other one does the rest. A last group is still undefined and will probably be used for looping.
The ALU group is divided into four units with each eight possible operations. There is the usual ASU, SHL and ROP2 (as taken and stripped-down from FC0) and another unit performing short and char insertions and extractions from words (because the core only operates on 32-bit values, thus sparing precious bits in the opcode).
OK so we have 16 register addresses but no load/store instruction nor relative branches, or even absolute jumps. That's where it is getting weird. Beware, you are entering in another dimension.
Only registers #12 to #15 are "normal registers". There is probably also a "scratch area" in the SR space, but that's all.
The remaing registers are a set of 6 pairs of data/address registers. Each register in the pair is linked to the other : when the address register is written to, the data register is updated with the value (if any) of the 32-bit word stored at the new address. That's a "read" operation. The write is performed by writing to the data register when the address register points to the memory that you want to update.
The address/data pair is also often named a "queue".
This is quite similar to how the CDC6600 Central Processing Unit worked in the 60's. Except that the VSP has another twist : the 4 MSB of the pointer contain "update flags" which indicate how to update the pointer after the data register is read or written. Depending on the configuration, the pointer can be post-incremented or post-decremented, or not changed at all (so the data register becomes a simple register).
Stacks are emulated with the 4th option : pre-increment. The address register of the pair thus becomes the "stack pointer" and the data register is the "stack top". Of course, the stack could extend down or up, at will, but having only a pre-increment option limits the direction somewhat...
This can be used on all six "queues" so up to six stacks can be used at the same time. Or almost so because in practice, one must fetch instructions somewhere, and in the VSP, guess where they come from ?
Instruction fetch is shared with data access. The decoder takes instructions out of the four first queues, leaving the two others to stacks. When the instruction stream is linear, a couple of queues are enough, so four queues are used for data moves. When the code becomes more complex, the four queues can contain pointers and instructions from several entry points in loops or functions, leaving one stack (or none) and one (or two) queue(s) for everyday data moves.
Out of the four 'instruction' queues, only one is used to fetch the current instruction at any time. It is indicated by the "CQ" (Current Queue) two-bit register.
Like F-CPU, jumps require the distination to be prepared, then the right destination is chosen/selected according to the outcome of a comparison instruction, which contains the candidate new queue. If the test succeeds, the number of the new queue is copied into CQ. During the next cycle, the CQ will choose the new instruction stream :-D
The difference with F-CPU is the use of an explicit queue number, while the FC0, which emulates a more "standard" architecture, needs to maintain lookup tables for associating a register (containing a pointer to the instruction) with an address in the cache memory. This is all simplified in VSP :-)
The VSP can address 256MiB of memory.
The main reason comes from the 4 MSB of the data pointers that hold the auto-update flag. 32 bits - 4 = 28 bits for the addresses. This is well enough for today's handhelds.
Programming with this processor requires a lot of caution to avoid accidental change of the uptade flags. Alterations of the 4 MSB as the result of additions (for example) could trap the processor to signal a pointer overflow. This can also occur when auto-inc/decrement wraps around (but then, only 28-bit address calculations are performed).
These 4 bits are less used in the instruction pointer, but the 2 MSB store the CQ when a trap occurs. The processor can thus restart at the right address with the right CQ, and needs only to save/restore 10 others registers (9 if it is smart) :
Context buffer | ||
Pos. | Data | Description |
0 | CQ|IP | Address of the instruction with the CQ |
1 | A(CQ+1) | |
2 | A(CQ+2) | |
3 | A(CQ+3) | |
4 | A4 | Queue #4's address register |
5 | A5 | Queue #5's address register |
6 | R0 | Standard register #0 |
7 | R1 | Standard register #1 |
8 | R2 | Standard register #2 |
9 | R3 | Standard register #3 |
So you see how fast context swap can be : because a queue's data just comes out of the memory, there is no need to save it. Only the pointer has to be saved, sparing six items from being transfered :-)
The catch is that the position of the 3 other queue numbers must be deduced from the indicated CQ, otherwise it would need 11 elements instead of 10. Well, that's just a few logic gates. Adding might not be the best solution, maybe XORing is better.
I have not yet determined how the address computations are performed. That might be inside or outside the execution core. Every cycle can perform 4 memory accesses (3 reads and 1 write, including instruction fetch) but 3 pointer updates (convention says that write has precedence over read for pointer updates). This means that 3 26-bit adders are needed in parallel with the core.
For ease of design, i have split the core into four stages as shown below :
This does not mean however that the core is pipelinable, despite some potential. It's not even worth considering this because it would bring few performance boost (50% ?) at the price of a lot of circuitry. The core works well, slowly but simply, without pipeline gates.
However, there might be another possibility later : transforming it into a SMT core (Simultaneous MultiThreading) to execute 2 or 4 threads, or more, concurrently. There, all the latencies are hidden from each others and interrupt latency might be greatly reduced. There is relatively few hazards to checks and the performance boost is much better than when simply pipelining. Each threads runs as slowly as in a Single-thread processor, but four threads would be able to run at the same time, boosting the instuctions per seconds rating.
But that's for waaayyy later.
Being uncommon, the VSP can still do most things another processor can do. It's just done differently.
First thing to know : the CQ's value. It is inehrent, implicit. It tells you which queue you're running on. Usually, when you start, it is set to 0 so you have (generally) Q1 to Q5 available for other purposes.
Without careful planning and a smart resource allocation, the queues will easily be saturated, leaving no room for further execution. But hey, VSP is a "small" microcontroller and if there was idle or unused, wasted resources, they would be removed ASAP. So bear with the tight constraints and remember that some people have written webservers on much, much, much more reduced processors.
So the register allocation is usually like this :
# | Name | Description | Type |
0 | A0 | Queue #0's address register, default instruction pointer | Executable queues |
1 | D0 | Queue #0's data register, default current instruction | |
2 | A1 | Queue #1's address register | |
3 | D1 | Queue #1's data register | |
4 | A2 | Queue #2's address register | |
5 | D2 | Queue #2's data register | |
6 | A3 | Queue #3's address register | |
7 | D3 | Queue #3's data register | |
8 | A4 | Queue #4's address register, stack pointer | Data-only queues |
9 | D4 | Queue #4's data register, stack top | |
10 | A5 | Queue #5's address register, alternate stack pointer | |
11 | D5 | Queue #5's data register, alternate stack top | |
12 | R0 | Standard register #0 | Static registers |
13 | R1 | Standard register #1 | |
14 | R2 | Standard register #2 | |
15 | R3 | Standard register #3 |
Having two programmable stacks makes it easy to implement the basic FORTH functions, though it was not the original intention. The first VSP draft included only 4 queues but that was certainly a limitation, so it was increased to 5 queues with the fifth being able to act as a stack.
However static registers are not the most precious resource in a microcontroller, particularly when a queue can emulate one static register (you have to point the address register to a scratch area). Complex or heavy algorithms could even virtually have infinite numbers of registers when playing with the autoincrement features. So lately i decided to put a sixth queue.
Let's start easy. Computational examples here use the static registers so there is no side-effect.
The first execution unit is ASU : it is the "Add and Substract Unit".
add r0, r1; adds r0 and r1, then puts the result in r1 add r0, r1, 123; adds r0 and immediate data 123, then puts the result in r1Is that simple enough ?
Now if you want to perform multi-precision computations, there is a problem because the VSP has no carry flag nor dual-write capability. So a specific operation is needed to generate the carry and put it into a register :
Adding a constant (123) to the 96-bit value r0:r1:r2 (r3 is a carry register) add r0, r0, 123; r0 = r0 + 123 addc r0, r3, 123; r3 = carry(r0+123) [generate the carry : r3 becomes 0 or 1] add r3, r1; r1 = r1 + carry addc r1, r3; r3 = carry(carry+r1) add r3, r2; r2 = r2 + carryIt's not optimally compact (another architecture would require only 3 instructions) but there is no carry register to store, manage etc. so it reduces the instruction count and removes the related headaches. And multi-precision is not so common for a microntroller, especially when the registers are 32-bit wide.
The same works for substraction too. There are the sub and subb instructions that operate the same way.
The whole spectrum of boolean operations is also available :
AND ANDN OR ORN NAND NOR XOR XORNbut because of the core's limitations, there is no MUX instruction (it must be emulated with three instructions and a couple of temporary registers).
Note : in the OPN operations (ANDN and ORN), the source that is inverted is the first one, not the same address as the destination nor the immediate field (when any). This should make it easier to code real stuffs.
The "shuffle" unit (SHL) moves bits around the register. It's a very stripped-down version of the corresponding F-CPU unit and until more code is written, it does only rotation and shifts on 32-bit data.