TensorExtPasses
-align-tensor-sizes
Resize tensors into tensors with a fixed size final dimension
This pass resizes input tensors with arbitrary sizes into
tensors with whose final dimensions has a fixed size. All input tensors are
required to be one-dimensional. The --size
option specifies the size of the
final dimension of the output tensors, and is required to be a power of two.
To align the tensors in the input IR, the pass first zero pads the input to
the nearest power of two before replicating or splitting it into the output
shape determined by size
. The resulting transformation is described in a
SIMDPackingAttr
encoding attribute on the final tensor.
For example, with size=16
,a tensor with 7 elements will be zero-padded to 8
elements, and then replicated twice to fill a tensor with size 16. The
SIMDPackingAttr
will encode the input shape, the number of elements that
were zero-padded, and the output shape.
Input:
%0 = tensor.empty() : tensor<7xi32>
Output:
%0 = tensor.empty() : tensor<16xi32, #tensor_ext.simd_packing<in = [7], padding = [1], out = [16]>>
A tensor with 30 elements will be zero padded with 2 elements and split into two tensors of size 16.
Input:
%0 = tensor.empty() : tensor<30xi32>
Output:
%0 = tensor.empty() : tensor<2x16xi32, #tensor_ext.simd_packing<in = [30], padding = [2], out = [16]>>
Note that this pass does not insert any new operations like reshape
, but
rather transforms the IR to use tensors with a fixed dimension. This pass may
be used to align the sizes of tensors that represent plaintexts and
ciphertexts in RLWE schemes that support SIMD slots and operations.
Options
-size : Power of two size of the final dimension of the output tensors.
-collapse-insertion-chains
Collapse chains of extract/insert ops into rotate ops when possible
This pass is a cleanup pass for insert-rotate
. That pass sometimes leaves
behind a chain of insertion operations like this:
%extracted = tensor.extract %14[%c5] : tensor<16xi16>
%inserted = tensor.insert %extracted into %dest[%c0] : tensor<16xi16>
%extracted_0 = tensor.extract %14[%c6] : tensor<16xi16>
%inserted_1 = tensor.insert %extracted_0 into %inserted[%c1] : tensor<16xi16>
%extracted_2 = tensor.extract %14[%c7] : tensor<16xi16>
%inserted_3 = tensor.insert %extracted_2 into %inserted_1[%c2] : tensor<16xi16>
...
%extracted_28 = tensor.extract %14[%c4] : tensor<16xi16>
%inserted_29 = tensor.insert %extracted_28 into %inserted_27[%c15] : tensor<16xi16>
yield %inserted_29 : tensor<16xi16>
In many cases, this chain will insert into every index of the dest
tensor,
and the extracted values all come from consistently aligned indices of the same
source tensor. In this case, the chain can be collapsed into a single rotate
.
Each index used for insertion or extraction must be constant; this may
require running --canonicalize
or --sccp
before this pass to apply
folding rules (use --sccp
if you need to fold constant through control flow).
-insert-rotate
Vectorize arithmetic FHE operations using HECO-style heuristics
This pass implements the SIMD-vectorization passes from the HECO paper.
The pass operates by identifying arithmetic operations that can be suitably combined into a combination of cyclic rotations and vectorized operations on tensors. It further identifies a suitable “slot target” for each operation and heuristically aligns the operations to reduce unnecessary rotations.
This pass by itself does not eliminate any operations, but instead inserts
well-chosen rotations so that, for well-structured code (like unrolled affine loops),
a subsequent --cse
and --canonicalize
pass will dramatically reduce the IR.
As such, the pass is designed to be paired with the canonicalization patterns
in tensor_ext
, as well as the collapse-insertion-chains
pass, which
cleans up remaining insertion and extraction ops after the main simplifications
are applied.
Unlike HECO, this pass operates on plaintext types and tensors, along with
the HEIR-specific tensor_ext
dialect for its cyclic rotate
op. It is intended
to be run before lowering to a scheme dialect like bgv
.
-rotate-and-reduce
Use a logarithmic number of rotations to reduce a tensor.
This pass identifies when a commutative, associative binary operation is used to reduce all of the entries of a tensor to a single value, and optimizes the operations by using a logarithmic number of reduction operations.
In particular, this pass identifies an unrolled set of operations of the form (the binary ops may come in any order):
%0 = tensor.extract %t[0] : tensor<8xi32>
%1 = tensor.extract %t[1] : tensor<8xi32>
%2 = tensor.extract %t[2] : tensor<8xi32>
%3 = tensor.extract %t[3] : tensor<8xi32>
%4 = tensor.extract %t[4] : tensor<8xi32>
%5 = tensor.extract %t[5] : tensor<8xi32>
%6 = tensor.extract %t[6] : tensor<8xi32>
%7 = tensor.extract %t[7] : tensor<8xi32>
%8 = arith.addi %0, %1 : i32
%9 = arith.addi %8, %2 : i32
%10 = arith.addi %9, %3 : i32
%11 = arith.addi %10, %4 : i32
%12 = arith.addi %11, %5 : i32
%13 = arith.addi %12, %6 : i32
%14 = arith.addi %13, %7 : i32
and replaces it with a logarithmic number of rotate
and addi
operations:
%0 = tensor_ext.rotate %t, 4 : tensor<8xi32>
%1 = arith.addi %t, %0 : tensor<8xi32>
%2 = tensor_ext.rotate %1, 2 : tensor<8xi32>
%3 = arith.addi %1, %2 : tensor<8xi32>
%4 = tensor_ext.rotate %3, 1 : tensor<8xi32>
%5 = arith.addi %3, %4 : tensor<8xi32>