ECE 558 / 658 VLSI Design
Lab 3: Design of a 4-Bit Accumulator

Due Monday, November 9, 11:59 PM

Objective: Design a CMOS circuit and layout for a 4-bit accumulator with four instances of the bitslice accumulator from Lab 2. Note that a fully functional Lab 2 is a prerequisite for this lab!

A 4-bit accumulator consists of a 4-bit full adder and a resettable 4-bit register. Its 7 inputs are phi, {A3, A2, A1, A0}, c_in, and reset. Its 5 outputs are {Q3, Q2, Q1, Q0} and c_out. The adder computes the sum of {A3, A2, A1, A0}, {Q3, Q2, Q1, Q0}, and c_in, and generates a sum {S3, S2, S1, S0} and a carry c_out. The register samples {S3, S2, S1, S0} on the rising edge of phi and stores the result on {Q3, Q2, Q1, Q0}. Note: The braces { } in the above description are for readability of the 4-bit vectors only and are NOT part of the signal names.

Be sure to include each required item (indicated by POST:) in your report. You must also explain what you did and why; images alone are not sufficient. Analyze your results, draw conclusions, and describe what you learned.




0. Block Diagram

                * There are 4 1bit_accumulator. The Cin and Cout is chainning. {A3,A2,A1,A0} is primary input data with Cin. Another input of adder is from output of Flip-Flop.
                * The design style of 4bit adder is ripple carry adder. and This adder can calculate 0000~1111. If the result of  calculation is bigger than 1111, final Cout will be set, and the result is back to 0000.
                * The carry logic is designed as domino style, The precharging phase is when phi(clk) is low. Thus, the valid value for Cout is when phi(clk) is high(evaluation phase).
                * {Q3,Q2,Q1,Q0} is the primary result vector of this design. There are buffer to handle big next gate.
                * Flip-Flop sample at rising edge, and


1.     Draw a schematic for the entire 4-bit accumulator using the schematic editor tool (Dsch2 for undergrads, Cadence for grads). Capture a screen image of the editor window.



             I designed the circuit hierarchily. which means I made single carry logic simbol and single sum logic simbol and made adder simbol by this two simbols. and than I also made a simbol for Flip-Flop. And then with adder and flip-flop, I made 1 bit accumulator simbol. Finally, for 4bit accumulator, I connected each 1bit-accumulator.
             There are 2 kind of capacitor. one of them is a load for Cout(100fF), the other is for Q(100fF). Because Q is 4bit vector, there are 4 capacitor for each bits.
             I also made sum from each adder for testing. This is not actually usful for this circuit. This value is valid only clk is low, because this phase is evaluating phase.

             {A3,A2,A1,A0} and Cin is primary input data. There are also Clk and Rst for carry logics and Flip-Flops, which is connected parallely.


2.     Use the logic simulator tool (schm2sim.pl + IRSIM) to validate your schematic.

          The main idea of the reduced testbench generation is to make the testbench which can test each identical module parallelly
          Each bit slice accumulator has almost the same runtime behavior except only few things.

          Look at this table. This is the truth table for 1bit-accumulator

A
Q
Cin
S
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
0
1
1
0
1
1
0
0
1
0
1
0
1
0
1
1
1
0
0
1
1
1
1
1
1

          Inputs of each adder is A,Q,and Cin. A and Q is independent from other bit slice accumulator.
          Cin has depentent with other bit slice accumulator, and It is connected to Cout of others.
          Thus, if Cin and Cout has same value it can be tested parallelly.
         
          The color marked row should be tested sequencially.
          Luckly, this two low can be overlaped with next bit slice accumulator.  After rising edge, from the input vector {A,Q,Cin}={0,0,1}
          the Q will store value of 1. At the next cycle, if You give A={1} for the second colored low, the carry will be rippled to next accumulator ( this           is the first colored low condition for  second accumulator ).

          This parallel testing idea and overlap idea can reduce the test sequence dramatically. Here is full chip test sequence.
          The thing is that this sequence has the same coverage with complete case test sequence. which means every possible case can be tested               with this testbench completely.

          Here is the test sequence. all zero input is redundant case. so I removed.
          I changed the timing from other irsim test. I gave inputs 'just before' the clock rather than 'just after' clock. By doing this, I reduced 1cycle. You can see the value of {A3,A2,A1,A0} and the value of {Q3,Q2,Q1,Q0} in the same phase.
          I didn't gave initial reset for 1 full cycle. If the next phase is just for state setup, I did it in reset phase.

Phase(Clk)
Rst
A3
A2
A1
A0
Q3_n
Q2_n
Q1_n
Q0_n
Cout2(Cin3) Cout1(Cin2) Cout0(Cin1)
Cin
Cout
Comment
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
sequential test
2
0
0
0
0
1
0
0
0
1
0
0
1
0
0

3
0
0
0
1
0
0
0
1
0
0
1
0
0
0

4
0
0
1
0
0
0
1
0
0
1
0
0
0
0

5
0
1
0
0
0
1
0
0
0
0
0
0
0
1

6
1
0
0
0
0
0
0
0
0
0
0
0
0
0
reset.
7
0
1
1
1
1
1
1
1
1
0
0
0
0
0
A,Q,Cin = 100
8
0
1
1
1
1
1
1
1
1
1
1
1
1
1
A,Q,Cin = 111
9
0
0
0
0
0
1
1
1
1
0
0
0
0
0
A,Q,Cin = 010
10
1
0
0
0
0
0
0
0
0
0
0
0
0
0
reset
11
0
1
1
1
1
0
0
0
0
1
1
1
1
1
A,Q,Cin = 101
12
1
1
1
1
1
1
1
1
1
0
0
0
0
0
reset
13
0
0
0
0
0
0
0
0
0
0
0
0
1
1
A,Q,Cin = 001

    For this test, every state can be visited. and be accomplished complete combinational logic validation.

    Total cycle is 13. Only 10 cycle is for validation and 3 of them is for setting states.

    Here is IRSIM Result.



3.     Design a layout for your accumulator (Cadence).



             * This is the layout for 4bit-accumulator. Totol size is 500(lamda)*500(lamda) = 600um*600um. The ratio is 1:1
             * Input {A3,A2,A1,A0} and output {S3,S2,S1,S0},{Q3,Q2,Q1,Q0} can be accessable top of the layout. Clk and Rst is connected all over the design by horizontal metal1 line. Cin and Cout is connected each other.

             * I used simbolized 1bit accumulator and I didn't give any additinal wire. To do that I have to re-make the label and pins, because the pin and label is included in a simbol which means, there 4A rather than {A3,A2,A1,A0}
             * Almost every horizontal line is by metal 2, and almost vertical like is metal 1, and I didn't use the poly for wiring and try to minimize the length of it for delay reason.

4.     Test your layout for functionality by executing the following algorithm. Sequentially add the last 4 digits of your student ID number, where each digit is represented as a 4-bit word.

             (1) Hand Calculation

            <Flow Table>
             * The input vector is {5,3,5,1} = {0101,0011,0101,0001}.
             *Cin is keeping 0.
             * I gave Rst for 1st phase only.
Phase(Clk)
Rst
Cin
A3
A2
A1
A0
S3
S2
S1
S0
Q3
Q2
Q1
Q0
Cout
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
1
0
1
0
1
0
1
0
0
0
0
0
3
0
0
0
0
1
1
1
0
0
0
0
1
0
1
0
4
0
0
0
1
0
1
1
1
0
1
1
0
0
0
0
5
0
0
0
0
0
1
1
1
1
0
1
1
0
1
0
6
0
0
0
1
1
0
0
1
0
0
1
1
1
0
1

                      i) Phase 1 : I gave reset only. All the other signals are keeping 0.
                                            There are no unknown stage here. because the Rst is already 1 at the first clock rising edge.
                      ii)Phase 2 : I gave 5=0101. and Sum is 0101 because Cin and Q is zero(FF was reset at phase1)
                      iii)Phase 3 : I gave 3=0011. the Sum is 1000(5+3) because Sum of Phase 2(0101) is stored to the FF at the begining of Phase 3(rising clock).
                                           Sum is always valid only clock is low.
                      iv)Phase 4: I gave 5=0101. Previous Sum(phase3 1000) was saved at rising clock. The Sum is 1101(8+5).
                      v)Phase 5: I gave 1=0001. Previous Sum(phase4 1101) was saved at rising clock. The Sum is 1110(13+1).
                      vi)Phase 6: I gave 6=0110 additionally, because I couldn't check Cout  with this vector only.
                                            Previous Sum(phase5 1110) was saved at rising clock. The Sum is 0100 and Cout is 1(20) .

                      Final value of S = {0100} because the result is 16 (over 1111)
                      Final value of Cout = {1} because the result is 16 (over 1111)
                      Final value of Q = { 1110} because the final sum is not stored yet.

                (2) IRSIM result for layout


                      * It represent the same result with hand calculated.
                      * S is stored rising edge of the clock. Valid sum is present when clk is low, because cout logic is domino logic and I gave clk' instad clk.
                      * Because the value from adder is valid at low clock, the value which is stored into FF at rising edge is valid.
                        Otherwise I will store precharged value.

                   (3) IRSIM input vector  - [Click Here]

                   (4) HSPICE Result.
                         i) Overall result
1
                   
                               I gave exactly the same input sequence with the one of IRSIM. I have gotten the same result with the result of IRSIM,  except the delays.
                               There are 2 glich on S2, because sum logic is combinational logic and the arrival time of each input of sum logic(A - primary input, Cin' - carry logic, Q - from Flipflop) is different.
                               I gaved enough time for clock period, because of checking functionality.

                         ii) Power
                               * Overall power plot.


                               - Because this is synchronus circuit, all activity is accur when clock edge. For the flipflop, it is sample at rising edge. For the carry logic, it start precharging  at the begining of high level of clock, and it start evaluating at the begining of low level of clock.
                                  Thus, most of power consumption is accur at the clock edge.
                                  At the near the rising of the clock, the sampling and precharging is done together, which means there are more power consumption at the rising edge than falling edge.
                                  peak power is more than 10mA.

                                     * Average power result.
$DATA1 SOURCE='HSPICE' VERSION='W-2005.03-SP1   '
.TITLE '************************** 4bit accumulator **************************'
 pstath                  pstatl           pdynavg         poverall     temper              alter#
  2.653e-06        1.452e-06        6.537e-03        7.169e-04    25.0000           1.0000
                                    * pstath is static power when clock is high.
                                    * pstatl is static power when clock is low.
                                    * pdynavg is the worst case dynamic power consumption.
                                       At the begining of phase4, the overall result waveform say that every output is changed except Cout,Q1,S1,S3.
                                       And the overall power plot also say that.
                                    * poverall is the average power from 0 to 48 which means from the starting of the scenario to the scenario
                              
                               - power summery.
Power
Value
Peak power
11mA1
dynamic power(worst case)
6.537mA
static power while clk is high
2.653uA
static power while clk is low
1.452uA
total average power for the testing
0.717mA
                       
                         iii) Hspice test file - [Click Here]


5.     Identify the critical (slowest) timing path in your circuit which will limit the clock frequency of your accumulator.

             The critical path is the maximum delay path among the synchronized unit. which is primary input, promary output, flipflop, and carry logic, respectibly.
             We don't need to think about primary input and primary out in this case, because there are no delay unit in this path.
             The flipflop works at rising edge only, but the carry logic works high level and low level.
             And to find maximum clock rate, we should adjust the high level period of the clock and low level period of the clock, separately.

             Thus, we can think the critical path in two way, rising and falling.
             1) rising or high level of the clock.
                   At the rising edge, the flipflop sample the input value, and the carry logic start to precharge.
                   So, sampling time(including setup time) and precharging time is  the candidate.
                   Therefore, we can find the critical path from max(FF sampling time, worst case precharging time of carry logic).
                   There are only two case for FF sampling time - sampling 1 while old value is 0 or opposite case.
                   The worst case of precharging time is :
                         Every input should be 1 for the 0 output(Cout') - it means every n-mos was discharged.
                         Therefore, the capacitance of all nmos should be charged while precharging phase.

                   The input sequence is :
Phase(Clk)
Rst
Cin
A3
A2
A1
A0
S3
S2
S1
S0
Q3
Q2
Q1
Q0
Cout(Cout')
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0(1)
2
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0(1)
3
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0(1)
                   Just after phase 2(begining of phase 3) every nmos for carry logic should be recharged, and the flipflop samples new values.
                   Phase 3 is for clock to q delay. If clock to q delay is longer than precharge time, the delay will dominate.
                             

             2) falling or low level of the clock.
                     At the falling edge, the flipflop doesn't do any activity.
                   Well known problem of the ripple carry is the delay for rippling carry.
                    While evaluating phase of domino logic, the carry must be rippled from Cin of first adder to Cout of last adder. Sum logic cannot  produce the valid output until the valid Cout' is proceduced.
                            Sum = ABCin + Cout'(A+B+Cin)

                    Following logic of the sum logic is FF which is the end of synchronizing path. Thus, the critical path is the path from the input of the carry logic  of the first adder to the sum logic of the last adder.

                    We can estimate the delay :

                               high voltage level delay for critical path  = max(precharging time, clock to q delay)
                               low voltage level delay for critical path = 4 carry logic propagation delay + 1 sum logic propagation delay + setup time
         
                   All this this calculation should be completed while clock is low.

                   To make input sequence  it must be met this condition,
                        i)The carry should be rippled to Cout' of last adder.
                        ii)The output of last sum logic must be changed.

                The input sequence is :

Phase(Clk)
Rst
Cin
A3
A2
A1
A0
S3
S21
S1
S0
Q3
Q2
Q1
Q0
Cout(Cout')
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0(1)
2
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0(1)
3
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0(1)
4
0
0
0
0
0
1
0
0
0
0
1
1
1
1
1(0)

                   I filled every FF to 1. After phase 3, it is ready to ripple.
                   Because I gave 1 for A0 at phase 4, the carry will ripple. and  S3 is changed from 1 to 0.

                   * The test for the period of low level : 10 ps precision.
                   1.60 ns :  voltage for high is no more 2.5Vns the 1.56 ns is pure worst case evauation period.
                   1.56 nsvoltage for high is 2.27V <= minimum period.
                   1.55 ns :  voltage for high is 2.24V  <= below than 10% of the 2.5V

                   * The test for the period of high level : 10 ps precision.
                   0.3 ns : works well - Cout goes down to 0 perfectly.
                   0.28 ns : Cout goes down almost 0.
                   0.26 ns : Cout goes down almost 0.
                   0.24 ns : low voltage is little far from 0.
                   0.23 ns : 10% point. The voltage for low is 0.24V <=minimum period

                   The setup time can be obtained by solving next problem.
                   worst setup time(falling) = 150ps

                   * The Clock to Q delay.
                   While determining the period of low level I gave enough high level period. It garanties every input for the carry logic is arrived  before the evaluation period.  This means the 1.56 ns is pure worst case evauation period.
                   If clock to q delay is more than 0.23ns, the input Q cannot be arrived before carry evaluation.

                   <The result of Clock to Q delay>
                  
                   The clock to q delay is 1.05ns.

                   < Calculation of Maximum Clock rate>
                   In this case, we can think this way.

                   Clock period = Low level period + High level period
                   High level period = Max(Clock to Q delay, Precharging time)
                   Low level period =  (Carry  propagation delay from primary  Cin  to out of last carry logic) + (output of final sum logic) + (setup time).
                   The setup time can be optained by solving next Problem.

                   Thus, the period of 1 cycle = 1.05ns + 1.56ns + 0.15ns = 2.76ns
                   Maximum clock rate = 1/2.76ns = 362.319MHz

                   I tried Problem 4 with this result. It works well. When I increased clock rate little bit, it start malfunctional. Which means the value is                         quite accurate.

                   <The result of  Problem 4 with this values >

                  

6.     Again using trial-and-error simulation (lo2spice.pl + HSPICE), determine the setup times for the A3, A2, A1, A0, and c_in inputs to the nearest 10 ps.


          Setup time is the time that input signal must be reached some amount of time before the clock is rising. Setup time is not dynamic factor, in other words it is intrinsic factor. Thus, in this case setup time for every case (A3,A2,A1,A0,c_in) is the same.
          Moreover, the input of flipflop is Sum from adder logic, we should make input vector to accur sum signal in appropriate timing.

          The test vector is like this.
          1) Setup time for rising.
              First of all the FF must be 0. And while keeping every input to 0, except A0.  Sum  must become 1 just before rising edge of clock. and then try and fix.
             - hspice test vector.
Vclk Clk gnd dc 0 pulse (0 2.5 0n 100p 100p 2n 4n)
Vrst Rst gnd dc 0 pulse (2.5 0 2.5n 100p 100p 3000n 6000n)
Va1 A1 gnd dc 0 pulse (0 0 2.5n 100p 100p 2.5n 4n )
Va2 A2 gnd dc 0 pulse (0 0 2.5n 100p 100p 2.5n 4n )
Va3 A3 gnd dc 0 pulse (0 0 2.5n 100p 100p 2.5n 4n )
Va0 A0 gnd dc 0 pulse (0 2.5 7.52n 100p 100p 4n 8n )
Vin Cin gnd dc 0 pulse (0 0 2.5n 100p 100p 1n 1n )

              * Try and fix flow - the precision is 10ps.
                    A : 7.5ns
                    Term between sum and rising edge 0.079 (S0->1 , clk0->1)  : works fine.
                    [result]

                    A: 7.51ns
                    Term between sum and rising edge 0.070(S0->1, clk0->1) : works fine
                    [result]

                   A: 7.52 ns
                    Term between sum and rising edge 0.058 : doesn't work
                    [result]

                  Thus, setup time for rising activity of A is 0.070

                   C : 7.52 ns
                    Term between sum and rising edge 0.180 : works
                      [result]

                   C : 7.62 ns
                    Term between sum and rising edge 0.082 : works
                      [result]

                   C : 7.63 ns
                    Term between sum and rising edge 0.072 : works
                      [result]

                   C : 7.64 ns
                    Term between sum and rising edge 0.063 : doesn't works
                      [result]

                The delay about Cin and A should be the same, this little difference is from precision. Basically the rising and falling time of S is different, if the source of it(A,Cin) is different.  Thus slope of voltage plot is different.
                Because the smallist number is 0.70, it is more accurate value.

          2) Setup time for falling.
             The input sequence must make the FF to be 1 first, and Sum must be 1->0 just before rising edge of clock. and try ,fix.
             - hspice test vector.
Vclk Clk gnd dc 0 pulse (0 2.5 0n 100p 100p 2n 4n)
Vrst Rst gnd dc 0 pulse (2.5 0 2.5n 100p 100p 3000n 6000n)
Va1 A1 gnd dc 0 pulse (0 0 2.5n 100p 100p 2.5n 4n )
Va2 A2 gnd dc 0 pulse (0 0 2.5n 100p 100p 2.5n 4n )
Va3 A3 gnd dc 0 pulse (0 0 2.5n 100p 100p 2.5n 4n )
Va0 A0 gnd dc 0 pulse (0 2.5 8n 100p 100p 4n 7.25n )
Vin Cin gnd dc 0 pulse (0 0 2.5n 100p 100p 1n 1n )

                A : 15.22 ns
                   Term between sum and rising edge 0.18ns : works well
                   [result]

                A : 15.24 ns
                   Term between sum and rising edge 0.16ns : works well
                   [result]

                A : 15.25 ns
                   Term between sum and rising edge 0.15ns : works
                   [result]

                A : 15.26 ns
                   Term between sum and rising edge 0.14ns : doesn't work
                   [result]

                  Thus, setup time for falling activity is 0.070

           * Summery
                The setup time is not a function of primary input but the input of flipflop. In this case, the S0,S1,S2,S3 is that.
                The setup time is highly related the flipflop itself.

Setup time for Rising activity
Setup time for falling activity
A0(S0)
70ps
150ps
A1(S1)
70ps
150ps
A2(S2)
70ps
150ps
A3(S3)
70ps
150ps
Cin(Sn)
72ps
152ps
             Setup time of Cin has bigger value. but the slope of voltage plot is bigger than An as well, which means the precision for input(10ps) is less accurate for Cin. We can make sure it from the malfunctional point (0.063). There are big gap between 0.072<->0.063.

             Thus setup time for this FF is 150ps.

7.     Again using trial-and-erro simulation (lo2spice.pl + HSPICE), determine the propagation delay for the outputs Q3, Q2, Q1, Q0, and c_out to the nearest 10 ps.

          (1) Delay for Q.
          To deal this delay. we should make the input sequence first.
          I gave long enough clock period.
          The flipflop sample the input value at rising edge.
         
             i) tp from reset.
                We should make Q to 1. And give reset reset while clock is not edge. Because reset is asyncronize signal, the Q will be go to zero, clock independly.
            ii) tpLH from clock rising.
                First of all, we should clean the Q to 0, and give A to 0 for a while and then make A to 1. At the next rising clock edge, the Q will be propagated.

          The 4 accumulator logic is identical. which means the input and state is same, the delay will be the same as well.
          The reset make Q always 0, so the tp means tpHL.
         

tpHL from Clock rising
tpLH from Clock falling
tp from reset
Q0
670ps
885ps
690ps
Q1
670ps 885ps
690ps
Q2
670ps 885ps
690ps
Q3
670ps 885ps
690ps
         
             [Result . delay for Q]


       (2) Delay for Cout.
             There are 6 source of this delay.

             i) from Cin : This is ripple carry adder. If other signal is keeping current condition, the Cin can occur the change of Cout.
                                  First of all we should make {Q3,Q2,Q1,Q0}={1,1,1,1}. While there are no signal changing from any other signal (including phi) at evaluation phase(low clock), I changed the value of Cin only. this is blind environment for Cout.

                   [Result]

             ii) frim phi : This is domino logic. and to propagate correct value when the rising of the FF's clock, the phi for this logic is inversed.
                                  At low level of phi is evaluation period. and at high level of phi is precharging period. We can measure the delay from falling edge of phi to cout.
                                  First of all we should make {Q3,Q2,Q1,Q0}={1,1,1,1} first. and give Cin from precharge phase to evaluation phase.   We can measure the delay from the beginning of evaluation phase, even though the Cin is arrived before then.

                   [Result]

             iii) From A3,A2,A1,A0 : A3,A2,A1,A0 is one of input of each accumulator, respectly. I can occur Cout directly.
                                  First of all weh should make {Q3,Q2,Q1,Q0}={1,1,1,1} and Cin is keeping 0 in this case, because A should occur the output.
                                  And then, I gave one of {A3,A2,A1,A0} to 1, it generate Cout.
                                  The path from Cout to A3 is shortest. and to A0 is longest. Thus the delay from A3 will short and A0 will be long.

                                [Result A1]           
                                [Result A2]
                                [Result A3]  
                                [Result A0]


             * Summery of the Result

Q
Cout
reset
690ps
-
rising clock
tpLH:885ps , tpHL:670
-
Cin
-
1.32ns
falling edge of phi
-
1.54ns
A3
-
390ps
A2
-
700ps
A1
-
1.02ns
A0
-
1.33ns