By interlacing the paths after a conditional branch like test, both paths are provided to the cpu with a simple double wide instruction fetch mechanism. Following the outcome of the conditional branch test, one of the two interlaced paths will be executed. The condition-true instructions could be in odd address words and the condition-false instructions in even address words.
For an "if-then-continue" construct: D O U B L E - W O R D S EVEN WORDS ODD WORDS code, conditional test continue-code-1<--\ then-code-1 continue-code-2 \ then-code-2 continue-code-3 \ then-code-3 continue-code-4 \ then-code-4 continue-code-5 \----- branch to interlaced code continue-code-6 continue-code-7 ; begin sequential continue-code-8 continue-code-9
When the condition is true, the "then" instructions in the odd words are executed.
When the condition is false, the "continue" instructions in the even address words are executed.
Interlaced execution continues until 1.a branch in the executing path is executed or 2.a branch is detected in the not executing path, after which sequential instruction fetch resumes (because the not executing path has ended).
At the end of the "then" path, a branch to interlaced intructions takes the cpu to the beginning of the "continue" instructions and tells the cpu to execute interlaced instructions. Note that the continue code can be entered by a failed conditional test or by the branch at the end of the "then" path. No code replication is required for this example. Overall, most of the performance gain can be achieved with modest code replication.
As I recall, apart from loop control, the majority of conditional branches are used in single level if-then-continue and if-then-else- continue constructs. For these, no code replication is required. The "branch to interlaced instructions" is the key to avoiding code replication.
For paths seldom taken, the interlaced portion could just be a branch to where the rest of the path is, thereby mostly avoiding fetching instructions unlikely to be used. More generally, the number of instructions in the interlaced portion of a path could reflect the probability of that path being taken.
With regard to conventional conditional branches, it takes time to lookahead in the instruction stream, identify a conditional branch instruction, calculate the target address, and access the cache for the target instructions. To mask this time, one must fetch further ahead in the instruction stream, and discard more instructions when the predicted path(s) are not taken. With interlacing, one does not have to lookahead as far, and so would not need to discard as many instructions.
Consider that if one accesses a block on each instruction cache access, say eight instructions per block, then one will sometimes get the desired instructions with a single block fetch, e.g. a test with the beginnings of two paths interlaced, versus fetching two blocks in a conventional conditional branch cpu. This aspect plus the reduced need for lookahead should keep instruction cache bandwidth reasonable.