/* CS 341 Spring 2002 */ /* Note segment 2 */ /* 22-Jan-2002 */ /* Taken by Apostolos Paul Pantazis */ Last Time: Basic CPU Architecture. MemoryLess Logic: An input is initiated on a device, shortly after the device generates an output. The catch is that the device does not remember anything for the input after the output is generated. A good example is an adder. Adder: (1) -->|-------------| | | --> (16 bit out) (2) -->|-------------| (1) & (2) ARE 16 bIT INPUTS. Remember that in building CPU's the presence of a clock and latch are vital. There are 2 things that deal with designing faster CPU's: 1. We care for the amount of time (picoseconds) it takes for the output to be produced. 2. We need a latch. Last time we have talked about the one-adress accumulator machine: It works in a very simple way, it holds in 1 value in the accumulator and one in the PC(Recall PC = Program Counter). It is also important to know that a 1 adress machine accesses memory at least twice. There are 2 ways of building a computer: 1. Synchronous --> every part of the computer produces its values based on a clock. 2. Asynchronous --> No clock is present, a signal is generated to denote that a value is ready. Easier to understand the division circuits on a Pentium are implemented in this way. The rest of the chip is Synchronous. Back to 1-adress accumulator machine: ADD 17 MUL 21 * How does that work??* * See class notes Memory Diagram for values* --> PC will initialy have a value of 100. ACC (for accumulator) will start at 0. At the first clock cycle, 100 will be placed on the adress bus.. We will get back the value 17. So we also place 17 on the adress bus. A value of 40 comes back and it is added to teh ACC wgich is now 40 and PC is 101. The next instruction is fetched and the process is repeated. * For each cycle of the clock we are writing 1 thing in memory so each instruction takes 2 cycles of the clock. * Better: (2 memory accesses per instruction) * (1 cycle per memory access) == [2 cycles per instruction] | |--> If ALU is really slow the above would not hold. Modern Electronics: The cycle time is Pretty fast so 1 cycle is 1/2 a nanosec (500 picosec). One of the MAJOR concerns on CPU design is Accessing Memory less, hiding this memory latency that exists. --> How do we hide this latency? (1). Do more things inside the CPU. | |--> Use memory Register instead of 1 accumulator. So...Multiple registers..What is that? Inside the CPU you have a Register Bank, r0....r7 lets say. Instructions will not just operate on the ACC instead you would have a register like: ADD r0, r1 /* result in r1*/ MUL r2, r3 LOAD r3, r5 | |--> 2-adress machine. This is a LOAD/STORE 2-adress architecture with 8 registers (r0...r7) An Instruction would look like: |------------------------------| | | |-----|------------|-----------| (1) (2) (3) (1) --> Opcode (2) --> 1st register 3 bits. (3) --> 2nd register 3 bits. * Recall 8 registers so each register == 3 bits long. Opcode Classes: 1. Aritmetic like ADD, MUL 2. Memory Access like STORE, LOAD 3. Opcodes for control, for Branching. (see *1 bellow). 4. Boolean Opcodes. (*1) : like JUMP, Give an absolute adress. JUMP on top of the instruction, get an adress and JUMP to that adress. JUMP conditionaly: Will only Branch if the most recent value calculated by the CPU is non zero. Often the ALU will produce an output and some extra things that occured during runtime like overflow info. These are called the CONDITION CODES (CC). Is it negative? Was there an Overflow? Is it equal to zero? The branch instruction will look at the CC. Back to hiding latency --> Solution #2... first let us note: 2 notions of latency: A) Latency: Initiating reguest until response is received. B) Bandwidth: Amt of info you get per/sec. Memory has a big latency. You can hide this by using bandwidth. On a memory read give instead of the next byte 512 bytes ( a sector) or even 16 sectors. Cache is the answer to #2. (L1, L2, L3) registers are sometimes wrongly thought as another level in the memory hierarchy. registers are managed by the programmer, cache is not visible to the programmer, it is only managed by us under special circumstances. 3-adress machine. --> Specify both the sources and destination register. --> ADD r0, r1, r2 --> MOV r1, r2 --> ADD r0, r2 0-adress machine AKA Stack machine. --> Both sources and destination are implicit. --> Stack machine solves the problem of not knowing how many register to have by just having a stack. ALU inputs are always the 2 top values in the stack. Pop(val_1, val_2) and Push(new_val). Pretty fast. /* Stack machine sample code */ /* implement A = B*C+D*E */ LOAD B --> push LOAD C --> push MUL --> pop LOAD D LOAD E MUL ADD STORE A Virtuall Memory --> Treating memory as cache for Disk (MSRM) --> MASS STORAGE ROTATING MEDIA. Virtuall memory --> Physical translation. Branch Delay Slot ADD ro, r1 MUL BRZ L31 SUB r3, r4 | |-->BDS: It does not branch unitl the instruction following the branch is executed and then JUMP label and do whta is specified there. L31: ( a label to branch to) (do some things ..)