TensorExtPasses

`-collapse-insertion-chains`

Collapse chains of extract/insert ops into rotate ops when possible

This pass is a cleanup pass for insert-rotate. That pass sometimes leaves behind a chain of insertion operations like this:

%extracted = tensor.extract %14[%c5] : tensor<16xi16>
%inserted = tensor.insert %extracted into %dest[%c0] : tensor<16xi16>
%extracted_0 = tensor.extract %14[%c6] : tensor<16xi16>
%inserted_1 = tensor.insert %extracted_0 into %inserted[%c1] : tensor<16xi16>
%extracted_2 = tensor.extract %14[%c7] : tensor<16xi16>
%inserted_3 = tensor.insert %extracted_2 into %inserted_1[%c2] : tensor<16xi16>
...
%extracted_28 = tensor.extract %14[%c4] : tensor<16xi16>
%inserted_29 = tensor.insert %extracted_28 into %inserted_27[%c15] : tensor<16xi16>
yield %inserted_29 : tensor<16xi16>

In many cases, this chain will insert into every index of the dest tensor, and the extracted values all come from consistently aligned indices of the same source tensor. In this case, the chain can be collapsed into a single rotate.

Each index used for insertion or extraction must be constant; this may require running --canonicalize or --sccp before this pass to apply folding rules (use --sccp if you need to fold constant through control flow).

`-insert-rotate`

Vectorize arithmetic FHE operations using HECO-style heuristics

This pass implements the SIMD-vectorization passes from the HECO paper.

The pass operates by identifying arithmetic operations that can be suitably combined into a combination of cyclic rotations and vectorized operations on tensors. It further identifies a suitable “slot target” for each operation and heuristically aligns the operations to reduce unnecessary rotations.

This pass by itself does not eliminate any operations, but instead inserts well-chosen rotations so that, for well-structured code (like unrolled affine loops), a subsequent --cse and --canonicalize pass will dramatically reduce the IR. As such, the pass is designed to be paired with the canonicalization patterns in tensor_ext, as well as the collapse-insertion-chains pass, which cleans up remaining insertion and extraction ops after the main simplifications are applied.

Unlike HECO, this pass operates on plaintext types and tensors, along with the HEIR-specific tensor_ext dialect for its cyclic rotate op. It is intended to be run before lowering to a scheme dialect like bgv.

`-rotate-and-reduce`

Use a logarithmic number of rotations to reduce a tensor.

This pass identifies when a commutative, associative binary operation is used to reduce all of the entries of a tensor to a single value, and optimizes the operations by using a logarithmic number of reduction operations.

In particular, this pass identifies an unrolled set of operations of the form (the binary ops may come in any order):

%0 = tensor.extract %t[0] : tensor<8xi32>
%1 = tensor.extract %t[1] : tensor<8xi32>
%2 = tensor.extract %t[2] : tensor<8xi32>
%3 = tensor.extract %t[3] : tensor<8xi32>
%4 = tensor.extract %t[4] : tensor<8xi32>
%5 = tensor.extract %t[5] : tensor<8xi32>
%6 = tensor.extract %t[6] : tensor<8xi32>
%7 = tensor.extract %t[7] : tensor<8xi32>
%8 = arith.addi %0, %1 : i32
%9 = arith.addi %8, %2 : i32
%10 = arith.addi %9, %3 : i32
%11 = arith.addi %10, %4 : i32
%12 = arith.addi %11, %5 : i32
%13 = arith.addi %12, %6 : i32
%14 = arith.addi %13, %7 : i32

and replaces it with a logarithmic number of rotate and addi operations:

%0 = tensor_ext.rotate %t, 4 : tensor<8xi32>
%1 = arith.addi %t, %0 : tensor<8xi32>
%2 = tensor_ext.rotate %1, 2 : tensor<8xi32>
%3 = arith.addi %1, %2 : tensor<8xi32>
%4 = tensor_ext.rotate %3, 1 : tensor<8xi32>
%5 = arith.addi %3, %4 : tensor<8xi32>