vsp/docs/vsp05.txt created mar oct 15 17:28:49 CEST 2002 by whygee@f-cpu.org version sat. oct. 19, 2002 version Sun Oct 27 02:48:09 CET 2002 (d'oh, it's now winter time...) version Sun Jul 25 04:06:28 CEST 2004 : qqs ajouts, conflits de PFQ et d'autres détails. version Tue Jul 27 00:41:41 CEST 2004 : 6xPFQ + flags d'update version Sun Aug 8 01:27:22 CEST 2004 : inversion des bits de l'instruction version Tue Aug 31 04:16:58 CEST 2004 : ajout de la plage d'instructions d'I/O, optimisation des padding NOP, SMT, une page d'histoire... version Thu Sep 9 10:08:24 CEST 2004 : VSP vs F-CPU ------------------------------------------------- !!!!!!!!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!!!!!!!! This text is a work draft and is subject to arbitrary changes for whatever reason that pleases me. It is by definition highly incomplete and inacurate. In fact, it only gives a rough idea of what this stuff is, please read the source files for a more acurate and up-to-date definition. Don't dare complaining for anything. !!!!!!!!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!!!!!!!! ------------------------------------------------- Introduction The goal of this small project is to specify, design and implement a "Very Simple Processor", a kind of 32-bit RISC microncontroller, with tight constraints on core size (as small as reasonably possible) and power (ultra-low power consumption) for tiny embedded, battery-operated consumer devices. It started as a serious project but almost immediately degenerated into a weird but fun and instructive experience. F-CPU was slowly loosing ground and i needed new toys so i could test ideas. There is a lot of common concepts between F-CPU and VSP and i believe they both benefit from each others. However, some crucial characteristics differ radicaly : F-CPU is wide and scalable while VSP is stricly limited to 32-bit registers with scarce support for 8-bit and 16-bit data. F-CPU is designed for speed and raw performance when VSP is aimed at low power and dirty I/O background tasks. I believe they complement each other well and could be used in the same system. Target This small processor shall be used as a "SoC Area Controller" : it schedules hard real-time events and resources but it is not in charge of raw computational parts, which must be performed by specialized processors that operate in parallel. It can however handle some low-CPU background tasks (system monitoring, garbage collector...) or simply sit idle between two interrupts. This is why a specific processor, wich consumes very few resources, is needed : the overall system can be more efficient and smaller if the tasks are processed by specialized processors. Relieving the other processors from the management chores reduces the constraints and simplifies their design. The VSP is not a multi-purpose general processor, like the coprocessors which can in turn get rid of any interrupt management. This is a particularly important point in the case of a DSP, where most registers are usually doubled to keep interrupt response time short. The target performance of the VSP is between a baseline ARM and a PIC or AVR microcontroller, but these are proprietary architectures. Jürgen proposed to name this "new" processor "LEG" but since it is not a direct replacement of ARM, i still prefer the name "VSP". I'l maybe use this name if someone finds a good signification of the acronym "LEG", but it misses the main point : it's something completely "different" which brings (and also throws away) several ideas. It's not a clone, it has been designed from scratch for very particular application. Rationale It is highly questionable whether rewriting a core from scratch is a good idea. The software development tools must be rewritten from scratch and the "operating system" must be created for this specific architecture. It is even possible that no compiler is available during a while. However a highly application-specific processor has some interesting advantages. For example, it is possible to tightly integrate peripherals, instead of "glueing" "IP cores" together. Another reason and motivation for not using cores like these from Tensilica and others, is that the VSP is implemented in full compliance with the Copyleft (GPL) world, instead of a simple "open sourced" proprietary product that is bound by additional licences and patents. Finally, not everything is rewritten "from scratch" because it reuses many aspects that were developed for the Freedom CPU project. In many ways, it is not simply "yet another me-too architecture". Well, nobody would have thought or dared doing this crazy stuff, but it is possible *here* because it's a "quick and dirty" project. Historical background One computer in particular has amazed me and influenced me : the Control Data Corp. CDC6600 designed by Seymour Cray in the 60's. The first machine that was named a Supercomputer. Basicly, the system had just a few main (large) parts : ample core memory (8 banks accessed in round-robin to increase the memory bandwidth), one (or two if you were really really really rich) main processor (doing the math in 60-bit FP format) and a set of small processors (accumulator-based with 15-bit instructions) that interfaced to dedicated I/O channels like discs, tapes, typewriters, ... The most remarkable trick is that these PPU ("Peripheral Processing Units", reminds us of "Peripheral Interface Controller" by another company which spun off to Microchip) were in fact one single computer with 5 contexts, each dedicated to one PPU and one I/O channel (though the channels were SW-selectable IIRC). Yep, that was a Simultaneous Multithreading computer that did many simple low-level things at the same time. The comparison with F-CPU is easy : the long and painful context switches have been widely discussed. If a small CPU, running only "trusted code", can help reduce the number of IRQs sent to the CPU, this could simplify the OS, reduce the burden of device drivers, etc ... But here, we deal only with "slow" I/O like keyboard, mice, serial I/O, USB1... or just buttons, BIOS, front LCD display, power management, hotplug, power sequencing ... But that is only one application : the VSP can sustain itself without coprocessors when only simple 32-bit operations are needed. External architecture The working environment is a consumer-class SoC with several data streams and interfaces : SDRAM interface, digital sound in and out, video streams, hard disk and/or DVD, LCD display, user input interface, dedicated coprocessors, communication links, power management.... The purpose of this VSP is to control and configure the interfaces and possibly handle some simple transfer protocols (mass storage or peripherals) but not much more. There are things that the VSP is not meant to do or be : some additional functions and tasks are ruled out because they don't fit with the profile of a microcontroller. The goal is not to crunch a lot of computations because this must be performed by "coprocessors" (other processors with their own instruction stream that run in parallel on a specific task). Similarly, the goal is not to run Linux either : virtual memory or protection rings are completely useless in the very specific tasks that the microcontroller executes. There is the risk that a virus appears, but since the architecture is "open" and user-modifiable, the potential lifelength of such malicious code is quite short... The VSP runs a small-footprint real-time chip management software. It can communicate with the other cores in the chip but must also access data in main memory. This is usually implemented with a single chip of SDRAM today. There are two consequences : - this must be a 32-bit core, as pointers can be quite large. 24-bit is not enough because 16MB of SDRAM can be much too small in the near future. However, it is not likely that 4GB is used soon, so the MSB can contain some flags. It is also practical to have 32-bit integers to reduce the register pressure when handling such large numbers, as they are very commonly used today. - SDRAM chips work with 4 simultaneous "banks" and this must be reflected in the architecture. VSP does not use a classical cache but rather a set of direct-mapped buffers of the SDRAM lines. These buffers are also directly accessed by other devices to reduce cache coherency problems. From these points, it is obvious that : - no VM or protection (supervisor mode) is necessary. The VSP accesses and controls everything but it does not run user applications and no swapping is necessary. - fast interrupt response is needed (a few cycles). - it's not going to run Linux or anything like that. in fact it is designed in such a way that it is not possible to use it "as is", hehe... - the VSP must transparently but directly interface to the SDRAM chip (to a certain extent) to reduce the buffer sizes and response time. - it is not necessary to run *really* fast. The chosen target frequency is 10MHz to 20 MHz. Faster clocks are only possible through faster circuits, not with architectural changes. If more power is needed, then another core must be used or designed instead, or a coprocessor must do the work. - it doesn't need to be complex at all and with a lot of resources : a very small pipeline (if any) is a good choice, given the low operating frequency. This also reduces the decoding logic's complexity. - software size (code density) is not an issue, but core size is much more critical. SDRAM and FLASH capacity is much cheaper than FPGA cells. Feeding the core with instructions or data is easy, considering the available memory bandwidth (4 or 8 32-bit words per burst at around 50 or 60MHz, or 240MB/s peek) and the low core speed (maximum theoretical throughput is 4x32-bit words per 100ns cycle, or 160 Mbytes/sec) ==> there is a comfortable margin for other applications. Register organisation The data types are : - "byte" (8 bits) - "half word" (16 bits) - "word" (32 bits) These data are right-aligned in the registers and stored in memory in little-endian order (but this could be changed in the VHDL source code if otherwise is needed). From the SDRAM structure context, it becomes obvious that a PFQ-based architecture (PFQ means "PreFetch Queue) is certainly desired, as it manages blocks of data easily through a pair of registers (without load and store instructions), thus emphasising on the communication side of the targetted use. One of the main task, beside answering to IRQs, would be to scan incoming blocks from mass storage, and parse MPEG streams in search of block delimitters, in order to hand the decoding job to a DSP. Or display data on a raster LCD screen with fonts or sprites. The rest consumes so few instructions and cycles that it's not worth "optimising" them. There must only be a "cheap" way (in terms of time and space) to assert IRQs and manage the integrated peripherals. The first idea was to use a 16-register architecture with 8 normal registers and 4 PFQ (8 registers in total). It seemed too tight and it evolved into 5 PFQ and 6 registers that are all mapped into the 16-register range. The added PFQ (#4) was reserved for the stack and was hardwired to read pre-increment and write post-decrement (or the reverse if you would like). But it seemed still too short so a 6th PFQ was added, all of them are programmable for pre- and post-increment and post-increment. So multi-stack algorithms can be implemented and there is some room for nested loops. On top of that, the bandwidth increases and there are at most only 10 registers to save when an IRQ or trap occurs. Table 1 : register map # name function 0 A0 default PC 1 D0 default Instruction register 2 A1 PFQ1 3 D1 PFQ1 4 A2 PFQ2 5 D2 PFQ2 6 A3 PFQ3 7 D3 PFQ3 8 A4 PFQ4 9 D4 PFQ4/ 10 A5 PFQ5/Stack pointer 11 D5 PFQ5/Stack top 12 R0 13 R1 14 R2 15 R3 Register decoding is as follows : D = 0 and /(3 and 2) A = /0 and /(3 and 2) R = /3 and /2 Of course this can be simplified a lot or even changed (this draft must be considered as highly preliminary !). ------------------------------------ Not a "Load/Store" architecture : the principles of the Prefetch Queues ------------------------------------- The VSP is the smallest possible implementation of the PFQ concept : it is something looking like a DSP with several simple address generators, or a modified CDC6600 computer, or even a processor with several register windows. Or none of them. This kind of architecture decouples the computer into two parts : the operating (control, decoding and execution) part which "computes", and the memory interface which contains buffers that are transferred in short but efficient bursts. The interface between the two parts can be more or less sophisticated but it is efficient when several simultaneous data stream are processed. The principle is simple : a PFQ models a buffer of several words that are accessed through a pair of user-visible registers. The size of the buffer depends on the implementation but does not matter much. One register (the "Data register" or Dx) contains data that is pointed to by the address stored in the associated register (the "Address Register" or Ax for the x'th PFQ). Data goes from and to main memory by accessing A and D registers : - When data is written to the data register, a store cycle is started with memory, using the pointer register for the address. - When the pointer register is changed, a load cycle is started and the corresponding data register is loaded with a new value. This is basicly the principle used on CDC6600 with some modification (number and use of the register pairs are a bit different). There is no classical "load" or "store" instruction, and pointer arithmetics is pretty straight-forward, even though it is rather unusual for people that are accustomed to classical CISC or RISC computers. Now, here are two important aspects of this principle : - a PFQ pair of registers can be used either for handling data or instructions. this means that a pointer can point to code or data. A jump is performed by writing to the pointer register or (preferred solution) by prefetching code and then changing the "current queue" (CQ). A branch instruction will simply copy the "contents" of a specified queue. - the pointer can be automatically updated when the corresponding data is accessed. Typical mode is auto-increment on write and auto-decrement on read, for implementing a stack. Other complex pattern accesses are of course possible but are unused in order to keep the core simple. The pointer update bits require several flag bits per PFQ and they are stored in the MSB of the A registers, so they are saved and restored automatically between function calls or IRQs. From a programming point of view, accessing a whole block is as simple as reading a register as many times as needed. The memory will try to prefetch as many data as possible, but if the main memory is not ready, the core will simply stall. The goal is to interleave as many instructions as possible between two PFQ accesses, and reduce stall cycles. Fortunately, the VSP is slow enough for not being as hurt by "dirty code" as a more sophisticated and pipelined processor would be. If a 60MHz SDRAM chip is used with 4 32-bit words bursts and 2 cycles of latency, a 10MHz VSP will only wait during one cycle. However, note that the pointer update is performed in parallel with the current computation. Difference between post and pre-update is only valid for the PFQ4 which is hardwired as a stack with a trick for performing pre-update. The other queues can be only post-increment, post-decrement or nothing (2 bits). Table 2: Pointers have the following format : bit 0 : byte selection - it is always cleared when pointing to instructions because instructions are 16-bit or 32-bit wide. However, in the case of a trap or a severe event, it can contain the IRQ enable flag. - only the instructions "LSB", "LZB" and "SB" can use the associated data register if this bit is set (otherwise there is an alignment trap). bit 1 : half-word selection - only the instructions "LSB", "LZB", "LSW", "LZW", "SB" and "SW" can use the associated data register if this bit is set (otherwise there is an alignment trap). - This also selects the "current instruction half-word" if this is the current queue. These two first bits are "seen" only by the internal datapath, which has a simple 2-bit counter per PFQ. When this counter under- or overflows, it also triggers the increment of the next counter and Dx is loaded with the next word from the memory buffer. bits 2 and 3 : word selection This is another counter which selects one word out of four in the memory buffer's line associated to the PFQ. This is in the case where 4 words are stored per line, otherwise it would be "bits 2 to 4". bits 4 to 27 : Line selection This is mapped to whatever structure, usually SDRAM and FLASH memories. When the word selection counter overflows, the line's address is incremented or decremented accordingly and a new burst is enqueued. Usually, prefetch is implemented by enqueuing a new request as soon as a line is used. The new read line is stored in another line and double-buffering is implemented. bits 28 and 29 : Current Queue (instruction pointers only) The CQ bits indicate which queue is currently active. This is saved when there is a trap or an interrupt just like the IRQ enable in the LSB. This is valid only for the instruction pointers. bits 28 to 31 : increment bits (data pointers) bits 28, 29 : read update bits 30, 31 : write update These fields have the following meaning : 0 0 : no update (nop) 0 1 : post decrement 1 0 : post increment 1 1 : pre decrement -> allows stacks As you can see, the format of the pointer indicates a lot of architectural things : - traps and interrupts store a few bits in the unused parts of the pointers, this helps keep the response time short and spares some cycles here and there. - the memory hierarchy has 3 levels with their own management and strategies. The execution datapath is decoupled from the rest and each level handles data with varying granularities : - sub-words for the datapath, - lines of words (4 or 8) in the memory interface - millions of lines in the main memory chip. However, thanks to prefetching and double-buffering, it is relatively easy to reach high bandwidths when data is accessed linearly, for example. Instruction format : The instructions have 16 bits for the register to register form, called "RR" or "short instruction format", and 32 bits with a 16-bit sign-extended data, called "RRI" or "long instruction format". In fact, most instructions have the possibility to use both forms, so the 16/32 bit format is simply indicated by the MSB of the instruction. Remark : to simplify things, just as on the plain old CDC6600, 32-bit (RRI) forms must be correctly aligned in a 4-byte word, otherwise it will trap or show other signs of unhappiness. Coding density is not wonderful but rather satisfying anyway, as the decoding logic is still quite simple, due to the absence of pipeline and hazard detection logic. Here is a description of the instruction word format : Table 3: bit 0 : Imm16 flag --> 16 bits follow. bits 1,2 : instruction class number bits 3,4 : Exection unit number bits 5-7 : OP number bits 8-11 : source/dest register number bits 12-15 : source register number bits 16-31 : (optional) immediate data There is no operand size flag because this would make the PFQ more complex. This also saves a couple of useful bits but increases the coding complexity a bit, as specific instructions must handle bytes and half-words (at least this operation does not consume much resources). The register-register ("RR") form takes the 2 indicated registers as source and writes the result to the first register (a la x86) to spare instruction bits. For example : add r1, r2 computes r1=r1+r2. However, the register-register-immediate ("RRI") reads only the second source and the immediate data, the result is put into the destination register. The immediate data "replaces" the first register source, for example : add r1, r2, 123 computes r1=r2+123 (the immediate operand is ALWAYS sign-extended so be VERY careful). There are 4 instruction "classes": - 00 control - 01 ALU operations - 10 jump - 11 I/O registers Each "class" can have a specific instruction format. 1) The simplest ones are control instructions : NOP (encoded 0x0000) HALT (stop the core and wait for an IRQ) MOV (copy a register to another) GET (used for reading the configuration space, IRQ controller, DMA, ...) PUT (idem, write) RETI (but i'm not sure yet) CPQ : copy queue (loop entry) LOOP : copy a queue to another conditionally (same a jump but no current IP saved, there are some 4 bits left to specify - 2 bits -> PFQ - 2 bits -> condition (?) other opcodes are reserved for the future and will appear. 2) The next class is the jump class : bits 11-12 : [Q] queue where to branch bit 10 : [L] "link" bit (push next IP if set, emulates a subroutine call) bit 9 : [N] condition negation bit 8 : [Z/O] Zero or Odd (LSB) The first register field [8..11] indicates the register where the next IP is written to, it's usually D4 or D5 (stack top). The other register field [12..15] indicates the register that must be tested, either zero or the LSB. The immediate field gives a sign-extended 15-bit word that is XORed with the tested register. This is defaulted to zero if no immediate is given. These codes can be "encoded" as : 1 0 9 8 0 0 0 JZ "jump if zero" 0 0 1 JO "jump if odd" 0 1 0 JNZ "jump if not zero" 0 1 1 JEV "jump if even" 1 0 0 JLZ "jump and link if zero" 1 0 1 JLO "jump and link if odd" 1 1 0 JLNZ "jump and link if not zero" 1 1 1 JLEV "jump and link if even" The other flag is the 2-bit flag that encodes which queue is next used. *** !!! Important remark !!! *** There are 3 ways to "jump" : - write to the Address register of the currently used queue (CQ) - modify the CQ - use the "jump" instruction however the last way is the only one recommended even though the others can be enabled or disabled by a specific version of the core. This is to ensure that no problem will occur with timing (the other ways may create delay slots) and maximum performance is used through the use of "split branches" (the queue must first be prefetched, this gives some time to load the new instruction stream from slow memory). Another detail is that jump instructions are the only conditional instructions here. unconditional jumps can be done by testing a register that is known to not be zero, or testing the LSB of a pointer that is used for instructions (-> always zero because aligned on 2-byte boundaries at least). It is well possible that "hacks" with the CQ will be rendered impossible by some hardware, or maybe a trap. *** end of !!! Important remark !!! *** Another instruction : "cpq" : "copy queue", used for loops. operands : - source queue (2 or 3 bits ?) - source register (counter) - limit (src reg or imm) 3) The next class of instructions is the "computational operations". There are 4 "units" that perform operations on the 1 operands : 00 [ASU] adder 01 [SHL] shifter 10 [ROP2] logic 11 [IE] insert/extract fields As you can see, it's as simple as a 4MUX controlled by the current instruction word. The corresponding instructions are : [ASU] add sub carry borrow [SHL] SHR shift logical right SHL shift left SAR Shift arithmetic right ROL ROR (mask generation ? bitfield ins/extract ?) [ROP2] AND OR XOR NAND NOR XNOR ORN ANDN [IE] LZB (load zero-extended byte) LSB (load sign-extended byte) LZW (load zero-extended word) LSW (load sign-extended word) SB (store byte) SW (store word) IHB (shift left 8 bit and insert a byte) IHW (shift left 16 bit and insert a word) This last IE unit is meant to support pointer-oriented byte and word access to the PFQs, helping emulate "load" and "store" instructions. They are here only to provide dynamic 16-bit and 8-bit accesses to and from a PFQ data register, though other registers can be used instead, with different results. This functions as a byte and word "insert" (for store) and "extract" (for load) instruction that uses 2 operands (only, as the immediate field could be ignored) as well as the _implicit_ LSB of the associated pointer. Because the data and address registers of a PFQ are linked, the address can be found easily and it can feed the alignment logic. Furthermore, if the used prefetch queue has read or write de/increment, the pointer can be updated on demand by the correct amount of bytes. If a non-data register is used, then the "pointer" is 0 by default, so these instruction can be used for sign extension or other purposes. 4) the I/O register instructions : This is optional and appeared later in the design. It further stresses on the fact that the VSP is a microcontroller (a high-class one, but still a microcontroller). A new range of 32-bit registers is added : - 8 registers deal with data : the P registers - 8 registers control the direction of every bit : the R registers Note : this space is not a context of a running program so it will not be saved/restored upon IRQ. Besides the obvious "move to io" and "move from io" instructions (they do not accept an immediate data), specific instructions can support hardwired bit-to-bit operations for bit-banging, or complex and fast interfaces (SPI, I²C ...). 8 ports provide 256 bits of I/O : that is enough to control many simple surrounding devices like LEDs, buttons, LCD screens, or why not : integrated FPGA or sea-of-gates for specific coprocessing functions. These registers are not in the SR space for some reasons : SRs do only support get and put, they can have a long latency and are for slow or one-time device configuration. I/O as processed by microcontrollers need more bandwidth (single-cycle operations) and more specific, direct processing. Enabling access to the main data pipeline can save precious cycles in time-critical functions, as well as precious program space for I/O intensive code. There is a total of 5+8+29+2=44 opcodes. The whole opcode space is not used, otherwise the control logic's size could explode. But up to now, the instruction set is still nicely orthogonal, even though the SHL operations need to have a reversed operand order when the immediate form is used. ----------------------------- Internal architecture ----------------------------- The goal is to keep the processor as simple, as small and power-saving as possible. Pipelining is not used (well, not yet), as it increases the complexity of the control logic and may add more silicon area to contain the pipeline barreers. However the core is split in two parts, one performs the operation and instruction flow control, the other manages the memory access and buffers. These parts are separated by a set of registers which are the only timing barreers. At first glance, the operating core is simpler and easier to implement than the memory buffer, which is also the part that makes the VSP so interesting to design (data and control paths are common skills but memory interfaces and buffer coherency is still some kind of black magic). Let's start with the easy and "user visible" portion of the core : the instruction decode and execute stage. It takes a whole clock cycle to read the input register, decode and execute the current instruction and write the result back to the register. However the control logic requires great care and multiplexing the signals can consume a lot of resources. Another important point is the instruction fetch and decode logic. The PFQ registers and pointers operate only on 32-bit words and the instructions can be 16-bit wide. However, the 2 LSB of each pointer are stored inside the execution part of the core so 8-bit and 16-bit operations can be performed. These LSB are not seen outside of this half of the core but can be used transparently by all instructions. The instruction is loaded from one of 4 D-reg (D0 to D3) and one half of the word is selected as the current instruction, depending on the current pointer's LSB. After this first selection, the instruction word indicates which registers to read and write, and selects the immediate operand when needed. This is performed by a simple multiplexer and sign extension of the immediate word. The multiplexer also select the 2 LSB of one of the A registers if a D register is accessed, in case there is an alignment instruction in the IE unit. Then come the 4 operation units and the read and write ports of the GET and PUT instructions (used to access other devices, I/O ports or the timers, the IRQ controller, the DMA etc.). The 4 units operate in parallel and the result is selected according to the 2-bit EU field of the instruction. The GET and PUT instructions are a bit less constraining in timing and the result of GET overrides the EU multiplexer. The results are written back to the register that is selected by the instruction, after some decoding. Concerning the EUs : - The ASU is a classical 32-bit adder with the usual xor and carry-in controlled by the instruction word, in order to perform SUB as well. There is another operation type that only generates the carry and borrow. There is no "status register" or "carry/borrow bit", or multiple write ports, so a specific instruction is the only solution. - The ROP2 unit works just as on F-CPU, except that there is no Combine nor MUX mode. It's implemented with a simple 4->1 multiplexor per bit, and a small lookup table in front. - The SHL is probably the most complex unit but it's a classical barrel shifter that performs shifts and rotations. The only potential worry is the space and time. A couple of instructions can be added to support bit field insertion and extraction. - The IE unit shifts 8-bit and 16-bit words and possibly multiplexes them for emulating byte and word-wide load and store instructions. The shift depends on the 2 least significant bits from the associated A register in the selected PFQ. If a the source register is not a D register, the shift defaults to 0. The control logic must also update the pointers according to its (read/write inc/dec) settings and the size of the moved word. Note that there is also a "bypass" path used for MOVE and the implicit pushes in the JLxx instructions. Concerning the control logic, it's where most of the complexity hides. Most fields of the instruction can be reused "as is" but some signals must be extracted from different points. First problem : data from the PFQ might not be ready (the transaction has been sent to the SDRAM but the program wants the result immediately). There must be a way to "stall" the core just as in a normal pipeline core. There are 3 possible conditions for the stall to occur : either the instruction is not ready, or one of the input data is not. This last pair can only be known if the instruction is fetched, which make timing more tight. One obvious way to implement the "PFQ not ready" flag is to multiplex this flag (coming from the PFQ) along with the data, the result being fed to the stall logic. The two parts of the core have to communicate with handshake signals to indicate that the decode logic has updated a register, or that the memory interface has not finished transfering data. The stall signal is also used "in software" for the HALT instruction. Another problem : get the next instruction. Here we are relieved from the burden of computing the "next IP" because it's done automatically by the memory interface unit, but the instruction must be chosen. The key is the instruction multiplexer and the second LSB bit from the current instruction pointer. The immediate field is simply extracted from the instruction because it is allowed only when the instruction is correctly aligned. One optimisation to consider : the "next instruction" signal is either 0 (stop the core when an error occured), 2 (when the current instruction is "short") or 4 (for a long instruction). But there is a big chance that short instructions preceding a long one are simply "nops" in order to align the long instruction. So one optimisation is to detect a nop at the odd position in parallel with the rest, in order to increment the IP with 4. Yet another key problem : the decoder must send the signal that a PFQ has been "touched" and must be updated. That is probably the most complex problem because the pointers (and the data) must be refreshed if a PFQ data is read or written, or if a pointer register is written to. Finally, the current queue must be advanced if the instruction word is exhausted, and this depends on the current instruction and whether it is stalled... The challenge is to decode these conditions as fast as possible so that the command can be sent to the memory stage soon enough and the data can be used in the next cycle, without creating a stall. The "registers" have several implementations that that depend on the purpose. - The "normal" register set is a classical 2R1W array (though i believe it can be simply implemented with multiplexers). - the "Ax" registers are also similar but - the LSB is implemented as a counter or something more complex (see later) - any write triggers a read transaction - the "Dx" registers can be read and written from the memory controller and from the core. That one is also quite tough to design. [... to be continued ...] List of traps : - fatal error - invalid opcode (when opcode is unknown) - 1R1W in long instruction happens when a register-to-register move is found in a long instruction (for example) - unaligned instruction when bit 15 of instruction and bit 1 of IP are set, which should not happen (long instructions can't happen in the second half of the word). Maybe this could "open" another instruction set later. - jump to middle of instruction : when bit 31 of target instruction and bit 1 of IP are set, meaning that it jumps to the middle of a long instruction - unaligned pointer access read or write to a Dx when the corresponding Ax has a LSB that is set. - CQ not accessed through jump (??) - invalid pointer (accessing a Dx register for which the corresponding Ax does not point to a valid address) maybe later : - protection error (if protection is ever implemented) and maybe (but separated from the rest) : - reset - (re)init task [and more in the future] The trap base address is determined by a SPR. Each entry is separated by a gap of 16 words (32 short instructions). The base address is aligned to a power of two, corresponding to the number of traps suported by the core. To manage this, a certain number of LSB are not implemented (set to zero) in the SPR, which is set to zero after reset. External Interrupts : VSP can manage up to 32 interrupt sources, because the registers are 32 bit wide and that's already enough. First versions will implement 8 or 16 channels to keep the circuit small but there is still some room left. A fully working embedded system can have enough with 16 ways, though this may be "compressed" further down to 8 lines. Just like the traps, the IRQ routines are managed with a "base address SPR". All interrupts are individually prioritized. The priority encoder circuits can become quite large and maybe slow, so don't send it 5MHz signals there :-) Whether it is level or transition-triggered is an annoying detail, but this is reduced by the fact that "handshaking" is preferred, so the trigger is less important. Maybe it could even be converted to a level trigger with the help of a Set/Reset latch, if needed. Interrupts can be nested and masked, so a channel is inhibited when it is either in use or masked (these are 2 separate registers that are ORed together). The interrupt enable flag is stored in the stack with the return address, but it is stored in the LSB (which is usually zero because instructions use 2 or 4 bytes). When an interrupt occurs, the core saves the current instruction pointer on the stack (???) and starts executing a small block of code that is stored in on-chip SRAM, in order to decrease the response time. If the IRQ entry point was in SDRAM or FLASH, there would be a potential conflict with ongoing transfers and the VSP would have to wait for its turn before using the shared memory bus. Errata : problème : définir les conventions des mots et de leur taille, Christophe propose "h" pour "half-word", il faut mettre le fichier à jour. problème 2 : ajouter une nouvelle fonction pour les "branch", version qui n'utilise pas le no de PFQ mais le registre auparavant utilisé pour le "link" pour indiquer où on branche. problème 3 : différence de fréquence d'horloge avec le contrôleur mémoire. problème 4 : mettre les flags de PFQ dans les MSB des pointeurs -> 256 MO adressables max. PFQ priorities : Imagine that CQ=0 and an instruction such as add D0, D0 is decoded : the effect is hard to predict and difficult to justify. So there must be some conventions on the core's behaviour. * The CQ (current queue)' priority is instruction fetch (over data r/w). The user can't modify A(cq) or D(cq). Read works normally but pointer update can't happen. * Write has priority for pointer updates : if read and write occur on the same PFQ, the pointer is updated with the parameters of the write update. * If two source PFQs are identical, then the update is performed only once. But this should not be a senseful rule because this can't happen with the RR or RRI forms : - RR writes to this same queue and has priority. - RRI only reads one register so no priority is needed. These rules could change in the future, adding more possibilities and better exploiting this domain. For example, accessing D(cq) could bring some useful data. Multithreading : So the old CDC design is haunting us. It can be ressurrected in the VSP however :-) The architecture does not allow for a classic pipeline because the burden of checking hazards and bypasses is too high. The execution core is however split into 4 "stages" so there is a good potential for advanced techniques. SMT is the simple way : 4 contexts made of 4 rolling sets of 4+12 registers can fit almost easily. The potential MOPS/MHz ratio is potentially quadrupled, compared to a roughly doubled performance with a simple pipelined core. Furthermore, the lack of "normal" registers would make coding more difficult. A further extension of this idea considers the fact that some kind of prioritization might need to appear, and must take the latencies (memory and get/put) into account. A faulty thread also requires cleanup code. IRQs require the selection of a new thread. One answer is to define more physical thread contexts than 4. The core then needs to select a new instruction cycle-by-cycle, based on factors like : - is the instruction ready ? - is it the most prioritised thread ? - is there a free thread available to handle an incoming IRQ ? - has the last instruction from this thread completed ? This is getting rather complex and might extend the pipeline by one cycle, but instead of one processor running at 10MHz, we would get the equivalent of 5 or 6 CPUs running 16 threads. Now, there are not many reasons to make something that complex now. There is the "fun factor" but few applications where so many threads are needed. Except maybe for a handheld gaming console but i don't focus on this market. -------------------------------------------------------------- VSP vs F-CPU Both projects share some characteristics while they differ on others. - F-CPU was started by other people than me, and VSP is my idea (though instilled by someone else's needs) - Same license, same tools, but different targets. - F-CPU is designed for infinite scalabily, VSP is meant to stay in its small ballpark. - Corrolary : F-CPU is difficult and long to design, VSP should be easier and require only one guy (me) - VSP's addressing range is limited to 28 bits (256MBytes) but F-CPU is virtually unlimited. - The instruction set designs are quite similar, most important opcodes are common and the "no status register" idea remains (even though there is less problems with it, but who knows) - The instruction format is different : VSP has 2 forms and F-CPU enforces strict 32-bit instructions. - VSP has a 2/3-address instruction (RR and RRI) but F-CPU has more powerful instructions : 2R2W and 3R1W. - The F-CPU pipeline is designed for extreme performance but VSP is designed for scarcity and simplicity. - Hence the exception handling and scheduling : a headache for F-CPU, hardly a trouble for VSP. - Same idea with the GET/PUT instructions and the SRs : complex and multicyle stuffs are taken out of the execution core. - Same "decoupled", bipolar architecture where the memory interface plays a critical role and is more challenging than the obvious execution core. - VSP implements a radically different method to access memory, while F-CPU mimicks a "classical CPU". --------------------------------------------------------------