VLSI SoC Design

June 14, 2019

IR Drop Analysis - II

A couple of years back, I wrote about IR Drop Analysis in one of my earlier posts. Fortunately. I got to work on IR Drop Analysis more extensively over past couple of months, and I thought I'll share my perspective gained from the work in form of a new post!

During static timing analysis, the voltage (Vdd) at all the devices is assumed to be a constant. Similarly, the ground pin (Vss) is assumed to be held at a constant 0 V. In reality, this voltage is not a constant and it varies with time. This variance in the voltage on the power and ground lines is referred to as Power noise and Ground bounce respectively. This noise is collectively referred to as Power noise. IR drop on the data path cells will impact setup-timing, while on the clock cells, it may cause both setup and hold timing problems.

Voltage Droop and Ground Bounce

The robustness of power grid needs to be tested thoroughly under various modes of operation. These two modes are referred to as Static IR Drop and Dynamic IR Drop.

I. Static IR Drop

Static IR drop takes into account the average current drawn from the power grid assuming average switching conditions. This analysis is performed early in the design cycle when simulation vectors are not quite available to the design teams. Instead, static IR drop relies on average data switching to compute the average current drawn from the power grid over 1 clock cycle.

Static IR drop can highlight power grid weakness in the design. Static IR drop violations spread all across the design point to the fact that the power grid needs to be re-designed to reduce the overall power grid resistance. There may be cases where static IR drop violations may be concentrated around the regions with inherent power grid weaknesses- like the regions with one-sided power delivery- around the floorplan boundary, around the macros, within the macro channels.

Power distribution network is usually a mesh in top most metal layers with strategic drop downs to lower metal layers which eventually feed the standard cells. Power is routed in top metal layers to keep the resistance minimum which will also ensure uniform power delivery to all parts of the chip.

If PDN is not design carefully, it will result in creation of one-sided power delivery which will create areas of high resistance.

Power grid strengthening can be achieved by:

Making the power grid denser by adding wider PG straps to improve the current conductivity.
Incrementally inserting via or via ladders along the power grid to drop from a higher metal layer to lower metal layers.

Increasing the clock frequency (with or without optimizing for higher frequency target) has a direct impact on static IR drop, because it increases the average current drawn from the power grid.

Lowering the clock frequency decreases the average current, and hence also decreases static IR drop

II. Dynamic IR Drop

Dynamic IR drop, also known as Instantaneous Voltage Drop (IVD), is the instantaneous drop in the voltage rails because of high transient current drawn from the power grid. Dynamic IR drop takes into account the instantaneous current drawn from the power grid in a switching event. This analysis is usually performed towards the end of design cycle when design team has the simulation vectors available from their functional or test pattern simulations. This mode of analysis is most time consuming, but nevertheless critical to ensure no surprises on silicon.

Dynamic IR drop is a function of:

Power Distribution Network (PDN): Just like the static IR drop, weak PDN affects dynamic IR as well. A weaker power grid is not equipped to meet the peak current demand by switching standard cells and it exacerbates the dynamic IR drop.
Simultaneous Switching: Higher simultaneous switching of standard cells tends to create local hotspots where peak current demand is higher, which causes voltage to drop in these hotspots.

Potential ways to mitigate dynamic voltage drop are as follows:

Augmenting the power grid to minimize PG resistance- Adding more power/ground straps facilitate better distribution of current to the standard cells, thereby reducing the susceptibility to dynamic IR drop.
Cell Padding- Another effective way to reduce dynamic IR drop is to space apart cells which switch simultaneously to reduce the peak current demand from the power grid. This works especially well for clock cells which tend to display temporal switching and spatial locality.

Cell Spacing to solve instantaneous voltage drop

Downsizing- Downsizing cells reduces the instantaneous current demand, with a possible downside on setup timing.

Downsizing cells to solve instantaneous voltage drop

Splitting the output capacitance- The amount of current drawn from the power grid is directly proportional to output capacitance that’s being driven. Splitting the output capacitance can reduce the peak current demand, and also improve timing in most cases.

Split output capacitance to reduce peak current drawn from the power grid

Inserting decap cells- Decap cells are decoupling capacitors that tend to act as charge reservoirs that can supply current to the standard cells in event of high requirement, especially when there’s simultaneous switching of cells in a local region. However, just like any capacitor, decaps tend to be leaky and add to the leakage power dissipated in the design.

Inserting decaps to minimize dynamic voltage drop

With shrinking geometries, designs are moving from gate-dominated designs to wire-dominated designs. Also, the operating frequencies have been increasing. More signal wires mean lesser routing resources for the power distribution network. Moreover, lower technology nodes allow higher packing density of standard cells. Higher frequencies cause higher switching resulting in higher voltage droop and higher ground bounce.

Due diligence is necessary not just to design the power grid but also to analyze and fix the dynamic IR drop violations to avoid seeing any timing surprises on silicon.

December 16, 2018

Maze Router (Lee's Algorithm)

In this post, let's talk about Maze Routing Algorithm which is a manifestation of Breadth First Search (BFS) Algorithm to find the shortest path between two nodes in a grid.

A crude version of this algorithm is also known as Lee's Algorithm. I will discuss Lee's Algorithm, and few improvements to that to improve the run time and memory.

Here's the problem statement. You need to connect the node S (source) and the node T (target or the destination) with the shortest possible path. These nodes are shown in red. The grids in blue represent a routing blockage, meaning you cannot route over these grids. You'll need to find a way around these to reach the destination node.

Problem Statement: Maze Router (Lee's Algorithm)

VLSI routes are laid orthogonal in X and Y direction. Diagonal (also known as X-routing) is usually forbidden. Let's say you need to start out from node S, you have 4 possible directions in which you can proceed:

4 possible directions from a given node

The number 1 represents the distance traveled from the source node. Once you have traveled a distance 1, here is how your grid looks like:

Grid after traveling a distance of 1 unit

Similarly, after traveling a distance of 2 units, the grid is shown below.

Grid after 2 iterations of Lee's algorithm

Now you've hit a wall, and it will be apparent that you cannot hop over the wall (or the blockage) from the next figure. After multiple iterations of the Lee's algorithm, the grid would look something this:

Grid after 8 iterations of Lee's algorithm

Continue doing this till you hit the target or the destination node.

Grid after you hit the target node

Now you need to backtrace from the target to the source following successive lower integers to find the shortest path. Note that you may have many possible shortest choices, but all of them are guaranteed to be the shortest. Usually, there's a cost associated with turns (vias in a physical context), so practically, you may assign a weight or parameter to minimize the number of turns to choose among more than 1 possible shortest paths.

Back-tracing to find the shortest path

This embodiment of the Lee's algorithm has high complexity, especially if the grid size is higher than the one shown in the example above. Notice how much wasteful computation we had to perform over to the right. This can be minimized if we initiate the same computation from both the target and the source, and back-trace to the target and the source respectively once the two wavefronts (the one in green and one in yellow) intersect. This results in far less time complexity and much less wasteful computations.

Modification to the Lee's algorithm to start computation from both target and the source

One another possible improvement to the above algorithm is the memory required to save the distance numbers for each node. Imagine a 10x10 grid. The worst distance could be 100, and you'd require 7 bits to store numbers up to 100. That means a worst space complexity of 700 bits for 10x10 grid. For 20x20 grid, worst distance could be 400, requiring 9 bits per box and a total space complexity of 3600 bits. In order to reduce the complexity, it's possible to go only up to 3 while counting, and then counting down to 1, and so on.. Back-tracing is slightly more complicated, but it saves you a ton of space!

April 24, 2018

False Path v/s Case Analysis v/s Disable Timing

Often people have asked me the difference between set_false_path, set_case_analysis and set_disable_timing. While the difference between these three is quite easy, it's the implications that leave many designers stumped.

Let me take a shot at explaining the difference.

1. FALSE PATH: All the timing paths which designers know won't be exercised on the fly, and they don't really need to meet any timing constraints on that path can be marked as false paths.
Tools would compute delays on all arcs on the false-path, would try to meet slopes/max-fanout/max-capacitance targets for all nodes along the path, but these paths would never surface up as timing (setup and hold) violations. However, if designers are too concerned about meeting slope and max cap targets, they usually prefer to mark such paths as set_multicycle_path instead.

Some examples of false path:

Consider the circuit above. The select line of the two multiplexers is complement of each other. STA tool, however, doesn't understand this logic and would treat all nodes as X (either 0 or 1). In practice, there can never be a timing path between

C -> E -> G
D -> F -> G

And these can be marked as false paths.

2. CASE ANALYSIS: Using set_case_analysis, any node can be constrained to a boolean logic value of 1 or 0. All case values are evaluated and propagated through the design. For example, if one input of an AND gate is 0, 0 being the controlling value, the output of AND gate would also be 0 and this 0 is propagated downstream. The timing arcs for set_case_analysis are not evaluated and they never show up in the timing reports. However, PnR tooks would still fix max transition, max capacitance and max-fanout violations on these nets/pins.

Some latest tool versions also support a case value of static which means that the node will always be static (never toggle), and this is used to reduce the pessimism which doing noise analysis.

Case analysis is also particularly useful for DFT modes where you would want to set a few configuration registers and drive the chip into a particular DFT mode: like atspeed, shift or stuck-at mode. This acts as an additional level of verification because you'd expect to see only scan chains in the shift mode with scan enable being 1. You'd expect to see functional paths in the atspeed mode with scan enable being X, and you'd expect to see only paths ending at functional register inputs in the stuck-at mode with scan enable being 0.

3. DISABLE TIMING: This disables a particular timing arc, and that timing arc or any timing path through the disabled timing arc is not computed. This tends to be a bit disruptive as compared to false paths or case analysis, but in some cases this is indispensable and the easiest way to achieve the intent. For example if you have a MUX based divider which receives the clock signal at the select line of the multiplexer, and two functional enables at the multiplexer inputs, STA tool would try to propagate the clock to the output of the MUX via the MUX select line to the output. But for a MUX, a select line only controls what gets propagated to the output. In practice, there's no arc between select and output and should be disabled.

Both case analysis and disable timing result in fewer timing paths to be analyzed. False path still tries to fix the design rule (max cap, max transition and max fanout) violations.

April 08, 2018

Leakage Power: Input Vector Dependence

Leakage Power of a standard cell depends on various transistors parameters like the channel length, threshold voltage, substrate or the body bias voltage etc. Apart from these physical parameters, leakage power also depends upon the input vector applied.

Consider a 2-input NAND gate and a 3-input NAND gate. Can you arrange the input combinations: (AB = 00, 01, 10, 11 for a 2-input NAND gate), and (ABC = 000, 001, 010, 011, 100, 101, 110, 111 for a 3-input NAND gate) in increasing order of leakage current, with a word of two about the logical reasoning behind it?

Note that the order of transistors in a stack matters here.

2-input NAND and 3-input NAND Gates

April 15, 2017

Tuning CTS Recipe

I've been trying to debug and tune my CTS recipe for quite some weeks now, and this gave me the basic insight into the CTS algorithm, various knobs available to the designers to be able to tune their CTS results to achieve the desired skew, transition and latency targets.

In this blog post, I'll discuss about those knobs while trying my best not to go into tool specific commands/constructs to be able to keep the discuss more conceptual and tool independent. Before we delve any deeper into these knobs, let's ask the basic question first: why do we need CTS to begin with, or what goals do we expect CTS to achieve for us? The answer is to be able to create a balanced clock tree. A balanced clock tree would simply mean: minimum skew between your sequentials in the design (of course we would only be interested in skew within the same clock group. Let me know in comments if this part is not clear). In addition to minimizing the skew, we would also like to achieve minimum latency by adding minimum number of clock buffers on the clock path thereby ensuring lesser area, lesser routing congestion and most importantly no extra dynamic power dissipation!

Now, we have the required background to discuss the CTS knobs in detail! :)

1. Creating Skew Groups: Skew groups are basically groups of sink-pins (clock end-points) which need to be balanced against each other. Now, some skew groups may be default, some might need to be created explicitly to help CTS engine. We'll take a look at some use-cases.
Default skew groups: Let's say you have 5 clocks in your design.
Group1: CLK1, CLK2 and CLK3 are synchronous to each other.
Group2: CLK4, CLK5 are synchronous to each other.

Group1 and Group2 are logically exclusive and therefore clocks within each group are implicitly asynchronous to the clocks in other group.
In this case, by defining clock groups, we have implicitly defined skew groups. CTS engine would try and balance latencies of CLK1, CLK2 and CLK3. And independently try and balance clock latencies of CLK4 and CLK5.

Sometimes, however, designers might want to create some explicit skew groups on top of the implicit ones. Let's take a look at those use-cases.

The figure highlights the sequential cloud of devices working on CLK1, CLK2 and CLK3 respectively. Assume there's heavy traffic and interaction between CLK1 and CLK2 sequentials while only a very few sequentials working on CLK3 interact with those working on CLK1 and CLK2. Clock enters the partition via three different clock ports on the left side, and certainly distance between the CLK3 port and CLK3 sequentials is the largest, thereby CTS engine would need to insert more clock buffers to maintain the transition (Ask yourself why? What would be the caveat if clock transition goes bad? Puzzle: Clock Transition). Assuming average latency that CTS can manage for CLK3 sequentials is 150 ps, while for CLK1 and CLK2 sequentials, it's 100 ps. In order to balance these three clocks, it will push the clock latency for CLK1 and CLK2 sequentials to match that of the longest latency: 150 ps. If, as designers, we know that interaction between CLK3 sequentials and CLK1, CLK2 sequentials is not too much, or even if it's too much, we know from timing perspective (both hold and setup) we have sufficient positive slack, we don't really need to balance these three clocks. We can create a separate skew group for CLK3 sequentials thereby preventing the extra latency on CLK1 and CLK2 buffers. This would help us in minimizing clock tree buffers, the associated area, routing resources, power and perhaps even the detrimental impact of OCVs on the uncommon clock path. (Read the post: Common Path Pessimism for greater insight).

Another case could be let's say a hard IP in your design which is placed far away from rest of the sequentials working on the same clock. And you know that there's minimal interaction between the sequentials and hard IP, you might need to create a separate skew group for the hard IP clock pin.

2. Sequential Clustering: (Different from Register Banking) CTS is performed after the placement step and by that time all the sequentials and standard cells have been placed. And this placement of sequentials is invariably driven only by the data path optimization constraints. In other words, placement engine would place sequentials at locations which it finds convenient to meet timing assuming ideal clock distribution. As depicted in the figure below, for some reason, placer decided to place a small bunch of sequentials working on CLK1 far away from the port thereby threatening to shoot up the clock latency of all the CLK1 sequentials. Now, either you can try and create a separate skew group to decouple these sequentials, or you can re-run placement tightly bounding all CLK1 sequentials togther to prevent latency (and hence clock skew) shoot-up.

3. Clock Ordering and "dont touch subtree": You might have cases in your design where there's clock multiplexing, let's say between functional and scan clocks, and you need to create a clock tree for both of them. compile_clock_tree usually works on a clock by clock basis. Let's say you were smart enough to enforce the order to command CTS engine to build the CTS network for fast functional clock first and then for the slower scan clock. That's a reasonable approach considering skew, transition and latency targets would be more difficult and constrained to meet for faster clocks, and by building the CTS for faster clocks first, you are giving the engine the leeway to do it's best possible job. However, when it will try and balance the network for scan clocks, it can touch the functional clock network as well. One key difference between functional and scan clocks, in addition to the difference in clock frequencies, would be the scan clock would have a greater fan-out than the functional clocks and therefore more scope for the CTS engine to goof-up! To prevent this, we need to do two things:

a) Enforce CTS order to construct the clock tree for faster clocks first and slower clocks next

b) In order to prevent slow clock from altering the clock tree network of fast clocks, we need to apply a dont_touch_subtree exception on the MUX input of the slower clock.

4. Divided Clocks and "stop_pins": By default, all the sequentials which are flop-based dividers, their CLK is treated as a default "non-stop-pin". Meaning CTS would consider clk -> out arc of these divider flops to be a "through-pin" and try to balance the latencies of the master clock and the generated clock. Now, consider the case as shown below. There are many ways to solve the problem and which of the two methods give you better results would depend on the design:

a) Creating a different skew group for the sequentials placed far away. This would de-couple the sequentials placed nearby and the ones placed far away. And CTS engine would be able to do a decent job.

b) Another experiement well worth a shot could be defining at CLK pin of the divider flop as a "stop_pin" so that latency of the master clock would be in check considering it will treat all it's sequentials including the divider flop as one group and do a relatively good job in balancing out these sequentials. This would avoid latency shoot-up of the master clock.

5. Exclude Clock from CTS: If there are two clocks defined at the same pin/port with different clock periods, whether they be synchronous or asynchronous, it might be a good idea to exclude the slower clock from CTS all-together to prevent CTS from touching the same clock network twice and surprising you with the results.

6. Clock used as data and "exclude pins": You might have some cases where clock is being used as data inside your design. CTS engine would be oblivious of this fact and might go crazy while building the clock tree. In these cases, it would be a good idea to explicitly mark the beginning of data path as "exclude_pin" to guide CTS engine to exclude anything further from clock tree balancing!

I couldn't think of any more cases. If you have some interesting use cases that I might have missed, kindly share them in the comments. :)

March 02, 2017

Simultaneous Setup-Hold Critical Node

I've got this question multiple times- How do we fix timing violations on paths that have at least one node which is both setup critical and hold critical simultaneously. To answer that question, one must realize that (generally speaking) for the same PVT and same RC corner, there cannot be paths where all nodes are simultaneously setup and hold critical.

Let's take an example:

Test Case

Now, if we buffer at node C, path from B to C which was already setup critical will start violating.

Buffering at C

If we buffer at Node A, the path from A to D which was already setup critical would start violating.

What shall we do here now? Any suggestions? Thoughts? I'd like to hear from you and I'll post the right answer (at least one of the right answers soon!). Just like always, looking forward to engage in the comments section below.

March 01, 2017

OCV v/s AOCV

When I had started my career around 6 years back, we were introduced to the term called OCV. While the OCV concept was quite simple and fascinating, it didn't me long to realize that OCV can be a nightmare for every STA engineer out there. I had introduced OCV long time back while explaining the difference between OCV v/s PVT. In this post, I intend to draw a distinction between OCV (On-Chip Variation) and AOCV (Advanced On Chip Variation).

Before we discuss anything about OCVs, it would be prudent to talk about the sources and types of variations that any semiconductor chip may exhibit.

The semiconductor device manufacturing process exhibit two major types of variations:

Systematic Variations: As the name suggests, systematic variations are deterministic in nature, and these can usually be attributed to a particular manufacturing process parameter like the manufacturing equipment used, or perhaps even the manufacturing technique used. Systematic variations can be experimentally calibrated and modeled. They also exhibit spatial correlation- meaning two transistors close to each other would exhibit similar systematic variation- which makes them easier to gauge. Example would be inter-chip process variations between two different batch of manufactured chips.
When a certain technology is in its nascent stage (let's say 10-nm technology), the process engineers would typically be more concerned about these variations and as the technology matures, process engineers are able to calibrate and tune their manufacturing process to reduce this variation component.
Random Variations: These are totally random, and therefore non-deterministic in nature. Random variations do not show spatial correlation and therefore very difficult to gauge and predict. Unlike systematic variations, random variations usually have a cancelling effect owing to their random nature. Examples are subtle variations in transistor threshold voltage.

As the semiconductor node shrinks, the susceptibility to the variations increase. And the effect of these variations need to be taken into account while doing timing analysis, or perhaps during the overall design planning to some extent. Shifting our focus back to OCV and AOCV. At this time one may ask themselves in what form would these variations manifest themselves? Well, these variations can manifest themselves in form of increase or decrease in the threshold voltage of devices, shift the process of the manufactured devices, perhaps vary the oxide thickness or change the doping concentration..

There might be infinite such manifestations and we engineers like to make our lives easier, don't we? ;)

Experienced folks must have guessed where am I headed. If you haven't guessed it yet, stay with me, take a step back and what does all these parameters have in common? What's that one quantifiable metric that these will impact and the answer is the delay! OCV and AOCV are essentially models which guide us on how the cell delay varies in light of the systematic and random variations.

On-Chip-Variations (OCV): OCVs are simplistic and (generally) pessimistic view of modelling process variations. Here we use that the delay of all cells can show, let's say X% variation in their delays. Now you would either model this variation as -X% to +X%, or perhaps -(X/2)% to +(X/2)%. Let's say we choose the latter. Now we would model the delay of all cells and subject them to OCVs in a manner that our timing becomes pessimistic and we can claim that in the worst case, as long as process guys can ensure that the variation would be within the bracket of -X% to +X%, we'd be safe.

Setup Analysis under OCV: In order to make setup analysis immune to process variations on silicon, we need to model the OCVs such that setup check becomes more pessimistic. That would be the case if we increase the data path delay by X% (you can take a call whether or not to apply a derate on the net delays. One can choose to apply a net derate based on the net length, and the metal layer in which the net is routed, a separate discussion for a separate post! :)); increase the launch clock path delay by X% and decrease the capture clock path delay by X%. Here you might want to check the post on Common Path Pessimism to see what type of clock path cells need to be exempted from OCVs.

Setup Analysis under OCV

Hold Analysis under OCV: Hold check would be the exact opposite of what we did for setup, namely decrease the data path delay by X% (you can take a call whether or not to apply a derate on the net delays. Usually, we don't apply derate on net delays); decrease the launch clock path delay by X% and increase the capture clock path delay by X%.

Hold Analysis under OCV

We talked so much about spatial correlation, then inherent cancellation of random variations but didn't use either of these concepts while explaining OCVs. This is the precise reason OCVs tend to be generally pessimistic. And as we shrink the technology node, a need arises for an intelligent methodology to perform variation aware timing analysis. And the answer is AOCV.

Let's take a look at AOCV in detail:

Advanced On-Chip Variations (AOCV): AOCV methodology hinges on three major concepts:

Cell Type: Variations should take into account the cell-type. Surely an AND gate an an OR gate can't exhibit the same variation pattern. Nor could an AND3X and an AND6X cell. The impact of variation should be calculated for each individual cell.
Distance: As the distance in x-y coordinates increase, the systematic variations would increase and we might need to use a higher derate value to reflect the uncertainty in timing analysis to mitigate any surprises on silicon.
Path Depth: If within a given distance, path depth is more, the impact of systematic variations would be constant, but the random variations would tend to cancel each other. Therefore as the path depth increases (within the same unit distance), the AOCV derates tend to decrease.

Bounding Box Creation for AOCV

While performing reg2reg timing analysis, AOCV methodology finds the bounding box containing the sequentials, clock buffers between two sequentials and all the data cells. Now within a unit distance, if the path depth increases, the AOCV derate decreases due to cancelling of random variations. However, if the distance increases, AOCV derates increases due to increase in the systematic variations. These variations are modeled in form of a LUT.

Sample AOCV Table for Setup Analysis

Now some final comments for OCV vs AOCV.

For small path depths, OCV tends to be more optimistic than AOCV. (AOCV is more accurate).
For higher path depths, OCV tends to be more pessimistic than AOCV. (AOCV is still more accurate).

I hope you were able to draw the above inference. If not, I'd be willing to engage in discussion down in the comments section. See you all till next time! :)

Pages