I think that the buffering constraints do not behave as expected on reconvergent paths when some edges are constrained to contain zero buffers. Let's have a look at the following example.
We have three primitives, a, b, and c, and b is a primitive with 1-cycle latency while the others are combinational. The dotted edge represents the virtual edge inserted to account for the latency of b.
In general, the throughput of a channel c connecting primitives u -> v and assuming c is not a back edge, must respect the following constraints:
$$\begin{align*}
\Theta_c &= r_v - r_u \\\
\Theta_c &\leq \# \text{buffer slots on channel c}
\end{align*}$$
Therefore, if we set the slot constraint to 0 on a channel, we prevent retiming in a channel by setting the source and destination retiming variables to the same value. On the example above, if we want to constrain the edge with associated throughput $\Theta_0$ to not contain any buffer, we are essentially setting $r(a) == r(c)$. We can then write the throughput constraints for the three channels on the path $a \rightarrow b_{in} \rightarrow b_{out} \rightarrow c$:
$$\begin{align*}
\Theta_1 &= r_{b_{in}} - r_a \\\
\Theta_2 &= r_{b_{out}} - r_{b_{in}} \\\
\Theta_3 &= r_c - r_{b_{out}}
\end{align*}$$
And since all throughput variables are positive (and again we do not have a back edge), we get:
$$r_c \geq r_{b_{out}} \geq r_{b_{in}} \geq r_a$$
However, since $r(a) == r(c)$, the only solution to the equations above is to set all retiming variables to the same value, thereby preventing retiming across the pipelined unit. However, the pipelined unit must satisfy:
$$\begin{align*}
\Theta_2 &= r_{b_{out}} - r_{b_{in}} \\\
\Theta_{CFDFC} \times L_b &\leq \Theta_2
\end{align*}$$
And therefore, the maximum throughput of the ILP becomes 0 in that case. As a result, presolve would optimize away all throughput equations, and I suppose the optimizer would pick the initial buffer allocation, placing buffers on back edges. As a result, the ILP could fail to optimize circuit altogether. For example, we can have a look at the following example with two re-convergent paths.
The top part of the graph is identical to the example at the start of this issue. The bottom part contains a primitive e with a latency of 4 and d is combinational. In the optimal case, the ILP would place (transparent) buffers on both edges $a \rightarrow c$ and $c \rightarrow d$, mirroring the latencies of the pipelined units, and achieve an II of 1. If we set a 0 slot constraint on the edge from $a \rightarrow c$, the optimal II is 2 since the output of a needs to stay valid for 2 cycles. However, due to the 0 slot constraint on $a \rightarrow c$, the ILP would not place any buffers and achieve an II of 5: no (transparent) buffers on the second part of the graph would imply that c needs to wait for 4 cycles before d can consume it.
I created a simple toy example to reproduce the issue on the branch custom_constraints_debug. The example is in custom_constraints and contains a simple for loop with a few multipliers inside. You can run it with the regular commands:
./bin/dynamatic
set-src integration-test/custom_constraints/custom_constraints.c
compile --buffer-algorithm fpl22
write-hdl --hdl vhdl
cat integration-test/custom_constraints/out/sim/report.txt | grep "Simulation done! Latency"
cat integration-test/custom_constraints/out/comp/buffer-placement/custom_constraints/placement.log | grep "Throughput of CFDFC"
a) Without constraint, the program latency is 2032 cycles and the ILP achieves a CFDFC throughput of 1/2.
b) With constraints on the path from the fork to the second multiplier (see bufProps in handshake_transformed_annos), the latency becomes 25009 cycles, and the CFDFC throughput is now 0.
c) However, it is easy to extend the above solution with one forced buffer (see bufProps in handshake_transformed_annos_correct) and get a program latency of 13512 cycles with a CFDFC throughput still of 0. I tried to come up with something closer to the performance of (a), but I did not manage to place a TEHB of depth 12, only an OEHB of depth 1 and a TEHB of depth 11, unfortunately...
To try the different buffer constraints, you can comment/uncomment selectively the cp lines in compile.sh:263.
Could we rewrite the buffer constraints so we constrain the buffer placement boolean variables (both for the OEHBs and for the TEHBs so the signal variables) to 0 and not the number of slots? In that case, the retiming variables would no longer be affected by the 0-slot constraints, since in the optimal case, no retiming should be placed on channels with no buffers, since that would only decrease the amount of retiming available for other channels. Although maybe this has some other effect which also makes it incorrect...
Let me know what you think,
Thanks,
I think that the buffering constraints do not behave as expected on reconvergent paths when some edges are constrained to contain zero buffers. Let's have a look at the following example.
We have three primitives, a, b, and c, and b is a primitive with 1-cycle latency while the others are combinational. The dotted edge represents the virtual edge inserted to account for the latency of b.
In general, the throughput of a channel c connecting primitives u -> v and assuming c is not a back edge, must respect the following constraints:
Therefore, if we set the slot constraint to 0 on a channel, we prevent retiming in a channel by setting the source and destination retiming variables to the same value. On the example above, if we want to constrain the edge with associated throughput$\Theta_0$ to not contain any buffer, we are essentially setting $r(a) == r(c)$ . We can then write the throughput constraints for the three channels on the path $a \rightarrow b_{in} \rightarrow b_{out} \rightarrow c$ :
And since all throughput variables are positive (and again we do not have a back edge), we get:
However, since$r(a) == r(c)$ , the only solution to the equations above is to set all retiming variables to the same value, thereby preventing retiming across the pipelined unit. However, the pipelined unit must satisfy:
And therefore, the maximum throughput of the ILP becomes 0 in that case. As a result, presolve would optimize away all throughput equations, and I suppose the optimizer would pick the initial buffer allocation, placing buffers on back edges. As a result, the ILP could fail to optimize circuit altogether. For example, we can have a look at the following example with two re-convergent paths.
The top part of the graph is identical to the example at the start of this issue. The bottom part contains a primitive e with a latency of 4 and d is combinational. In the optimal case, the ILP would place (transparent) buffers on both edges$a \rightarrow c$ and $c \rightarrow d$ , mirroring the latencies of the pipelined units, and achieve an II of 1. If we set a 0 slot constraint on the edge from $a \rightarrow c$ , the optimal II is 2 since the output of a needs to stay valid for 2 cycles. However, due to the 0 slot constraint on $a \rightarrow c$ , the ILP would not place any buffers and achieve an II of 5: no (transparent) buffers on the second part of the graph would imply that c needs to wait for 4 cycles before d can consume it.
I created a simple toy example to reproduce the issue on the branch custom_constraints_debug. The example is in custom_constraints and contains a simple for loop with a few multipliers inside. You can run it with the regular commands:
a) Without constraint, the program latency is 2032 cycles and the ILP achieves a CFDFC throughput of 1/2.
b) With constraints on the path from the fork to the second multiplier (see bufProps in handshake_transformed_annos), the latency becomes 25009 cycles, and the CFDFC throughput is now 0.
c) However, it is easy to extend the above solution with one forced buffer (see bufProps in handshake_transformed_annos_correct) and get a program latency of 13512 cycles with a CFDFC throughput still of 0. I tried to come up with something closer to the performance of (a), but I did not manage to place a TEHB of depth 12, only an OEHB of depth 1 and a TEHB of depth 11, unfortunately...
To try the different buffer constraints, you can comment/uncomment selectively the cp lines in compile.sh:263.
Could we rewrite the buffer constraints so we constrain the buffer placement boolean variables (both for the OEHBs and for the TEHBs so the signal variables) to 0 and not the number of slots? In that case, the retiming variables would no longer be affected by the 0-slot constraints, since in the optimal case, no retiming should be placed on channels with no buffers, since that would only decrease the amount of retiming available for other channels. Although maybe this has some other effect which also makes it incorrect...
Let me know what you think,
Thanks,