This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Design

This section contains documentation on the high level design of the project. Readers are intended to have basic familiarity with MLIR and FHE.

1 - Secret

The secret dialect contains types and operations to represent generic computations on secret data. It is intended to be a high-level entry point for the HEIR compiler, agnostic of any particular FHE scheme.

Most prior FHE compiler projects design their IR around a specific FHE scheme, and provide dedicated IR types for the secret analogues of existing data types, and/or dedicated operations on secret data types. For example, the Concrete compiler has !FHE.eint<32> for an encrypted 32-bit integer, and add_eint and similar ops. HECO has !fhe.secret<T> that models a generic secret type, but similarly defines fhe.add and fhe.multiply, and other projects are similar.

The problem with this approach is that it is difficult to incorporate the apply upstream canonicalization and optimization passes to these ops. For example, the arith dialect in MLIR has canonicalization patterns that must be replicated to apply to FHE analogues. One of the goals of HEIR is to reuse as much upstream infrastructure as possible, and so this led us to design the secret dialect to have both generic types and generic computations. Thus, the secret dialect has two main parts: a secret<T> type that wraps any other MLIR type T, and a secret.generic op that lifts any computation on cleartext to the “corresponding” computation on secret data types.

Overview with BGV-style lowering pipeline

Here is an example of a program that uses secret to lift a dot product computation:

func.func @dot_product(
    %arg0: !secret.secret<tensor<8xi16>>,
    %arg1: !secret.secret<tensor<8xi16>>) -> !secret.secret<i16> {
  %c0_i16 = arith.constant 0 : i16
  %0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
    %1 = affine.for %arg4 = 0 to 8 iter_args(%arg5 = %c0_i16) -> (i16) {
      %extracted = tensor.extract %arg2[%arg4] : tensor<8xi16>
      %extracted_0 = tensor.extract %arg3[%arg4] : tensor<8xi16>
      %2 = arith.muli %extracted, %extracted_0 : i16
      %3 = arith.addi %arg5, %2 : i16
      affine.yield %3 : i16
    }
    secret.yield %1 : i16
  } -> !secret.secret<i16>
  return %0 : !secret.secret<i16>
}

The operands to the generic op are the secret data types, and the op contains a single region, whose block arguments are the corresponding cleartext data values. Then the region is free to perform any computation, and the values passed to secret.yield are lifted back to secret types. Note that secret.generic is not isolated from its enclosing scope, so one may refer to cleartext SSA values without adding them as generic operands and block arguments.

Clearly secret.generic does not actually do anything. It is not decrypting data. It is merely describing the operation that one wishes to apply to the secret data in more familiar terms. It is a structural operation, primarily used to demarcate which operations involve secret operands and have secret results, and group them for later optimization. The benefit of this is that one can write optimization passes on types and ops that are not aware of secret, and they will naturally match on the bodies of generic ops.

For example, here is what the above dot product computation looks like after applying the -cse -canonicalize -heir-simd-vectorizer passes, the implementations of which do not depend on secret or generic.

func.func @dot_product(
    %arg0: !secret.secret<tensor<8xi16>>,
    %arg1: !secret.secret<tensor<8xi16>>) -> !secret.secret<i16> {
  %c1 = arith.constant 1 : index
  %c2 = arith.constant 2 : index
  %c4 = arith.constant 4 : index
  %c7 = arith.constant 7 : index
  %0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
    %1 = arith.muli %arg2, %arg3 : tensor<8xi16>
    %2 = tensor_ext.rotate %1, %c4 : tensor<8xi16>, index
    %3 = arith.addi %1, %2 : tensor<8xi16>
    %4 = tensor_ext.rotate %3, %c2 : tensor<8xi16>, index
    %5 = arith.addi %3, %4 : tensor<8xi16>
    %6 = tensor_ext.rotate %5, %c1 : tensor<8xi16>, index
    %7 = arith.addi %5, %6 : tensor<8xi16>
    %extracted = tensor.extract %7[%c7] : tensor<8xi16>
    secret.yield %extracted : i16
  } -> !secret.secret<i16>
  return %0 : !secret.secret<i16>
}

The canonicalization patterns for secret.generic apply a variety of simplifications, such as:

  • Removing any unused or non-secret arguments and return values.
  • Hoisting operations in the body of a generic that only depend on cleartext values to the enclosing scope.
  • Removing any generic ops that use no secrets at all.

These can be used together with the secret-distribute-generic pass to split an IR that contains a large generic op into generic ops that contain a single op, which can then be lowered to a particular FHE scheme dialect with dedicated ops. This makes lowering easier because it gives direct access to the secret version of each type that is used as input to an individual op.

As an example, a single-op secret might look like this (taken from the larger example below. Note the use of a cleartext from the enclosing scope, and the proximity of the secret type to the op to be lowered.

  %c2 = arith.constant 2 : index
  %3 = secret.generic ins(%2 : !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>):
    %8 = tensor_ext.rotate %arg2, %c2 : tensor<8xi16>, index
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>

For a larger example, applying --secret-distribute-generic --canonicalize to the IR above:

func.func @dot_product(%arg0: !secret.secret<tensor<8xi16>>, %arg1: !secret.secret<tensor<8xi16>>) -> !secret.secret<i16> {
  %c1 = arith.constant 1 : index
  %c2 = arith.constant 2 : index
  %c4 = arith.constant 4 : index
  %c7 = arith.constant 7 : index
  %0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
    %8 = arith.muli %arg2, %arg3 : tensor<8xi16>
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>
  %1 = secret.generic ins(%0 : !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>):
    %8 = tensor_ext.rotate %arg2, %c4 : tensor<8xi16>, index
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>
  %2 = secret.generic ins(%0, %1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
    %8 = arith.addi %arg2, %arg3 : tensor<8xi16>
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>
  %3 = secret.generic ins(%2 : !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>):
    %8 = tensor_ext.rotate %arg2, %c2 : tensor<8xi16>, index
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>
  %4 = secret.generic ins(%2, %3 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
    %8 = arith.addi %arg2, %arg3 : tensor<8xi16>
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>
  %5 = secret.generic ins(%4 : !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>):
    %8 = tensor_ext.rotate %arg2, %c1 : tensor<8xi16>, index
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>
  %6 = secret.generic ins(%4, %5 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
    %8 = arith.addi %arg2, %arg3 : tensor<8xi16>
    secret.yield %8 : tensor<8xi16>
  } -> !secret.secret<tensor<8xi16>>
  %7 = secret.generic ins(%6 : !secret.secret<tensor<8xi16>>) {
  ^bb0(%arg2: tensor<8xi16>):
    %extracted = tensor.extract %arg2[%c7] : tensor<8xi16>
    secret.yield %extracted : i16
  } -> !secret.secret<i16>
  return %7 : !secret.secret<i16>
}

And then lowering it to bgv with --secret-to-bgv="poly-mod-degree=8" (the pass option matches the tensor size, but it is an unrealistic FHE polynomial degree used here just for demonstration purposes). Note type annotations on ops are omitted for brevity.

#encoding = #lwe.polynomial_evaluation_encoding<cleartext_start = 16, cleartext_bitwidth = 16>
#params = #lwe.rlwe_params<ring = <cmod=463187969, ideal=#_polynomial.polynomial<1 + x**8>>>
!ty1 = !lwe.rlwe_ciphertext<encoding=#encoding, rlwe_params=#params, underlying_type=tensor<8xi16>>
!ty2 = !lwe.rlwe_ciphertext<encoding=#encoding, rlwe_params=#params, underlying_type=i16>

func.func @dot_product(%arg0: !ty1, %arg1: !ty1) -> !ty2 {
  %c1 = arith.constant 1 : index
  %c2 = arith.constant 2 : index
  %c4 = arith.constant 4 : index
  %c7 = arith.constant 7 : index
  %0 = bgv.mul %arg0, %arg1
  %1 = bgv.relinearize %0 {from_basis = array<i32: 0, 1, 2>, to_basis = array<i32: 0, 1>}
  %2 = bgv.rotate %1, %c4
  %3 = bgv.add %1, %2
  %4 = bgv.rotate %3, %c2
  %5 = bgv.add %3, %4
  %6 = bgv.rotate %5, %c1
  %7 = bgv.add %5, %6
  %8 = bgv.extract %7, %c7
  return %8
}

Differences for CGGI-style pipeline

The tosa-to-boolean-tfhe and related pipelines add a few additional steps. The main goal here is to apply a hardware circuit optimizer to blocks of standard MLIR code (inside secret.generic ops) which converts the computation to an optimized boolean circuit with a desired set of gates. Only then is -secret-distribute-generic applied to split the ops up and lower them to the cggi dialect. In particular, because passing an IR through the circuit optimizer requires unrolling all loops, one useful thing you might want to do is to optimize only the body of a for loop nest.

To accomplish this, we have two additional mechanisms. One is the pass option ops-to-distribute for -secret-distribute-generic, which allows the user to specify a list of ops that generic should be split across, and all others left alone. Specifying affine.for here will pass generic through the affine.for loop, but leave its body intact. This can also be used with the -unroll-factor option to the -yosys-optimizer pass to partially unroll a loop nest and pass the partially-unrolled body through the circuit optimizer.

The other mechanism is the secret.separator op, which is a purely structural op that demarcates the boundary of a subset of a block that should be jointly optimized in the circuit optimizer.

For example, the following tosa ops lower to multiple linalg instructions, and hence multiple for loops, that we want to pass to a circuit optimizer as a unit. The secret.separator ops surrounding the op are preserved through the lowering.

func.func @main(%arg0: tensor<1x1xi8> {secret.secret}) -> tensor<1x16xi32> {
  secret.separator
  %4 = "tosa.const"() {value = dense<[0, 0, -5438, -5515, -1352, -1500, -4152, -84, 3396, 0, 1981, -5581, 0, -6964, 3407, -7217]> : tensor<16xi32>} : () -> tensor<16xi32>
  %5 = "tosa.const"() {value = dense<[[-9], [-54], [57], [71], [104], [115], [98], [99], [64], [-26], [127], [25], [-82], [68], [95], [86]]> : tensor<16x1xi8>} : () -> tensor<16x1xi8>
  %6 = "tosa.fully_connected"(%arg0, %5, %4) {quantization_info = #tosa.conv_quant<input_zp = -128, weight_zp = 0>} : (tensor<1x1xi8>, tensor<16x1xi8>, tensor<16xi32>) -> tensor<1x16xi32>
  secret.separator
  return %6 : tensor<1x16xi32>
}

After running --tosa-to-boolean-tfhe and dumping the IR after the linalg ops are lowered to loops, we can see the secret.separator ops enclose the lowered ops, with the exception of some pure ops that are speculatively executed.

func.func @main(%arg0: memref<1x1xi8, strided<[?, ?], offset: ?>> {secret.secret}) -> memref<1x16xi32> {
  %c-128_i32 = arith.constant -128 : i32
  %0 = memref.get_global @__constant_16xi32 : memref<16xi32>
  %1 = memref.get_global @__constant_16x1xi8 : memref<16x1xi8>
  secret.separator
  %alloc = memref.alloc() {alignment = 64 : i64} : memref<1x16xi8>
  affine.for %arg1 = 0 to 1 {
    affine.for %arg2 = 0 to 16 {
      %2 = affine.load %1[%arg2, %arg1] : memref<16x1xi8>
      affine.store %2, %alloc[%arg1, %arg2] : memref<1x16xi8>
    }
  }
  %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<1x16xi32>
  affine.for %arg1 = 0 to 1 {
    affine.for %arg2 = 0 to 16 {
      %2 = affine.load %0[%arg2] : memref<16xi32>
      affine.store %2, %alloc_0[%arg1, %arg2] : memref<1x16xi32>
    }
  }
  affine.for %arg1 = 0 to 1 {
    affine.for %arg2 = 0 to 16 {
      affine.for %arg3 = 0 to 1 {
        %2 = affine.load %arg0[%arg1, %arg3] : memref<1x1xi8, strided<[?, ?], offset: ?>>
        %3 = affine.load %alloc[%arg3, %arg2] : memref<1x16xi8>
        %4 = affine.load %alloc_0[%arg1, %arg2] : memref<1x16xi32>
        %5 = arith.extsi %2 : i8 to i32
        %6 = arith.subi %5, %c-128_i32 : i32
        %7 = arith.extsi %3 : i8 to i32
        %8 = arith.muli %6, %7 : i32
        %9 = arith.addi %4, %8 : i32
        affine.store %9, %alloc_0[%arg1, %arg2] : memref<1x16xi32>
      }
    }
  }
  secret.separator
  memref.dealloc %alloc : memref<1x16xi8>
  return %alloc_0 : memref<1x16xi32>
}

We decided to use the separator op over a few alternatives:

  • Grouping by secret.generic: these tosa ops must be bufferized, but secret types cannot participate in bufferization (see the Limitations section).
  • Grouping by basic blocks: secret.generic is a single-block op with a yield terminator, and grouping by blocks would require us to change this.
  • Grouping by regions: SSA values generated by a region are not visible to the enclosing scope, so we would need to have the region-bearing op return values, which is tedious to organize.
  • Attaching attributes to ops that should be grouped together: this would not be preserved by upstream lowerings and optimization passes.

generic operands

secret.generic takes any SSA values as legal operands. They may be secret types or non-secret. Canonicalizing secret.generic removes non-secret operands and leaves them to be referenced via the enclosing scope (secret.generic is not IsolatedFromAbove).

This may be unintuitive, as one might expect that only secret types are valid arguments to secret.generic, and that a verifier might assert non-secret args are not present.

However, we allow non-secret operands because it provides a convenient scope encapsulation mechanism, which is useful for the --yosys-optimizer pass that runs a circuit optimizer on individual secret.generic ops and needs to have access to all SSA values used as inputs. The following passes are related to this functionality:

  • secret-capture-generic-ambient-scope
  • secret-generic-absorb-constants
  • secret-extract-generic-body

Due to the canonicalization rules for secret.generic, anyone using these passes as an IR organization mechanism must be sure not to canonicalize before accomplishing the intended task.

Limitations

Bufferization

Secret types cannot participate in bufferization passes. In particular, -one-shot-bufferize hard-codes the notion of tensor and memref types, and so it cannot currently operate on secret<tensor<...>> or secret<memref<...>> types, which prevents us from implementing a bufferization interface for secret.generic. This was part of the motivation to introduce secret.separator, because tosa ops like a fully connected neural network layer lower to multiple linalg ops, and these ops need to be bufferized before they can be lowered further. However, we want to keep the lowered ops grouped together for circuit optimization (e.g., fusing transposes and constant weights into the optimized layer), but because of this limitation, we can’t simply wrap the tosa ops in a secret.generic (bufferization would fail).

2 - SIMD Optimizations

HEIR includes a SIMD (Single Instruction, Multiple Data) optimizer which is designed to exploit the restricted SIMD parallelism most (Ring-LWE-based) FHE schemes support (also commonly known as “packing” or “batching”). Specifically, HEIR incorporates the “automated batching” optimizations (among many other things) from the HECO compiler. The following will assume basic familiarity with the FHE SIMD paradigm and the high-level goals of the optimization, and we refer to the associated HECO paper, slides, talk and additional resources on the Usenix'23 website for an introduction to the topic. This documentation will mostly focus on describing how the optimization is realized in HEIR (which differs somewhat from the original implementation) and how the optimization is intended to be used in an overall end-to-end compilation pipeline.

Representing FHE SIMD Operations

Following the design principle of maintaining programs in standard MLIR dialects as long as possible (cf. the design rationale behind the Secret Dialect), HEIR uses the MLIR tensor dialect and ElementwiseMappable operations from the MLIR arith dialect to represent HE SIMD operations.

We do introduce the HEIR-specific tensor_ext.rotate operation, which represents a cyclical left-rotation of a tensor. Note that, as the current SIMD vectorizer only supports one-dimensional tensors, the semantics of this operation on multi-dimensional tensors are not (currently) defined.

For example, the common “rotate-and-reduce” pattern which results in each element containing the sum/product/etc of the original vector can be expressed as:

%tensor = tensor.from_elements %i1, %i2, %i3, %i4, %i5, %i6, %i7, %i8 : tensor<8xi16>
%0 = tensor_ext.rotate %tensor, %c4 : tensor<8xi16>, index
%1 = arith.addi %tensor, %0 : tensor<8xi16>
%2 = tensor_ext.rotate %1, %c2 : tensor<8xi16>, index
%3 = arith.addi %1, %2 : tensor<8xi16>
%4 = tensor_ext.rotate %3, %c1 : tensor<8xi16>, index
%5 = arith.addi %3, %4 : tensor<8xi16>

The %cN and %iN, which are defined as %cN = arith.constant N : index and %iN = arith.constant N : i16, respectively, have been omitted for readability.

Intended Usage

The -heir-simd-vectorizer pipeline transforms a program consisting of loops and index-based accesses into tensors (e.g., tensor.extract and tensor.insert) into one consisting of SIMD operations (including rotations) on entire tensors. While its implementation does not depend on any FHE-specific details or even the Secret dialect, this transformation is likely only useful when lowering a high-level program to an arithmetic-circuit-based FHE scheme (e.g., B/FV, BGV, or CKKS). The -mlir-to-openfhe-bgv pipeline demonstrates the intended flow: augmenting a high-level program with secret annotations, then applying the SIMD optimization (and any other high-level optimizations) before lowering to BGV operations and then exiting to OpenFHE.

Warning The current SIMD vectorizer pipeline supports only one-dimensional tensors. As a workaround, one could reshape all multi-dimensional tensors into one-dimensional tensors, but MLIR/HEIR currently do not provide a pass to automate this process.

Since the optimization is based on heuristics, the resulting program might not be optimal or could even be worse than a trivial realization that does not use ciphertext packing. However, well-structured programs generally lower to reasonable batched solutions, even if they do not achieve optimal batching layouts. For common operations such as matrix-vector or matrix-matrix multiplications, state-of-the-art approaches require advanced packing schemes that might map elements into the ciphertext vector in non-trivial ways (e.g., diagonal-major and/or replicated). The current SIMD vectorizer will never change the arrangement of elements inside an input tensor and therefore cannot produce the optimal approaches for these operations.

Note, that the SIMD batching optimization is different from, and significantly more complex than, the Straight Line Vectorizer (-straight-line-vectorize pass), which simply groups ElementwiseMappable operations that agree in operation name and operand/result types into vectorized/tensorized versions.

Implementation

Below, we give a brief overview over the implementation, with the goal of both improving maintainability/extensibility of the SIMD vectorizer and allowing advanced users to better understand why a certain program is transformed in the way it is.

Components

The -heir-simd-vectorizer pipeline uses a combination of standard MLIR passes (-canonicalize, -cse, -sccp) and custom HEIR passes. Some of these (-apply-folders, -full-loop-unroll) might have applications outside the SIMD optimization, while others (-insert-rotate, -collapse-insertion-chains and -rotate-and-reduce) are very specific to the FHE SIMD optimization. In addition, the passes make use of the RotationAnalysis and TargetSlotAnalysis analyses.

High-Level Flow

  • Loop Unrolling (-full-loop-unroll): The implementation currently begins by unrolling all loops in the program to simplify the later passes. See #589 for a discussion on how this could be avoided.

  • Canonicalization (-apply-folders -canonicalize): As the rotation-specific passes are very strict about the structure of the IR they operate on, we must first simplify away things such as tensors of constant values. For performance reasons (c.f. comments in the heirSIMDVectorizerPipelineBuilder function in heir-opt.cpp), this must be done by first applying folds before applying the full canonicalization.

  • Main SIMD Rewrite (-insert-rotate -cse -canonicalize -cse): This pass rewrites arithmetic operations over tensor.extract-ed operands into SIMD operations over the entire tensor, rotating the (full-tensor) operands so that the correct elements interact. For example, it will rewrite the following snippet (which computes t2[4] = t0[3] + t1[5])

    %0 = tensor.extract %t0[%c3] : tensor<32xi16>
    %1 = tensor.extract %t1[%c5] : tensor<32xi16>
    %2 = arith.addi %0, %1 : i16
    %3 = tensor.insert %2 into %t2[%c4] : tensor<32xi16>
    

    to

    %0 = tensor_ext.rotate %t0, %c31 : tensor<32xi16>, index
    %1 = tensor_ext.rotate %t1, %c1 : tensor<32xi16>, index
    %2 = arith.addi %0, %1 : tensor<32xi16>
    

    i.e., rotating t0 down by one (31 = -1 (mod 32)) and t1 up by one to bring the elements at index 3 and 5, respectively, to the “target” index 4. The pass uses the TargetSlotAnalysis to identify the appropriate target index (or ciphertext “slot” in FHE-speak). See Insert Rotate Pass below for more details. This pass is roughly equivalent to the -batching pass in the original HECO implementation.

    Doing this rewrite by itself does not represent an optimization, but if we consider what happens to the corresponding code for other indices (e.g., t2[5] = t0[4] + t1[6]), we see that the pass transforms expressions with the same relative index offsets into the exact same set of rotations/SIMD operations, so the following Common Subexpression Elimination (CSE) will remove redundant computations. We apply CSE twice, once directly (which creates new opportunities for canonicalization and folding) and then again after that canonicalization. See TensorExt Canonicalization for a description of the rotation-specific canonocalization patterns).

  • Cleanup of Redundant Insert/Extract (-collapse-insertion-chains -sccp -canonicalize -cse): Because the -insert-rotate pass maintains the consistency of the IR, it emits a tensor.extract operation after the SIMD operation and uses that to replace the original operation (which is valid, as both produce the desired scalar result). As a consequence, the generated code for the snippet above is actually trailed by a (redundant) extract/insert:

    %extracted = tensor.extract %2[%c4] : tensor<32xi16>
    %inserted = tensor.insert %extracted into %t2[%c4] : tensor<32xi16>
    

    In real code, this might generate a long series of such extraction/insertion operations, all extracting from the same (due to CSE) tensor and inserting into the same output tensor. Therefore, the -collapse-insertion-chains pass searches for such chains over entire tensors and collapses them. It supports not just chains where the indices match perfectly, but any chain where the relative offset is consistent across the tensor, issuing a rotation to realize the offset (if the offset is zero, the canonicalization will remove the redundant rotation). Note, that in HECO, insertion/extraction is handled differently, as HECO features a combine operation modelling not just simple insertions (combine(%t0#j, %t1)) but also more complex operations over slices of tensors (combine(%t0#[i,j], %t1)). As a result, the equivalent pass in HECO (-combine-simplify) instead joins different combine operations, and a later fold removes combines that replace the entire target tensor. See issue #512 for a discussion on why the combine operation is a more powerful framework and what would be necessary to port it to HEIR.

  • Applying Rotate-and-Reduce Patterns (-rotate-and-reduce -sccp -canonicalize -cse): The rotate and reduce pattern (see Representing FHE SIMD Operations for an example) is an important aspect of accelerating SIMD-style operations in FHE, but it does not follow automatically from the batching rewrites applied so far. As a result, the -rotate-and-reduce pass needs to search for sequences of arithmetic operations that correspond to the full folding of a tensor, i.e., patterns such as t[0]+(t[1]+(t[2]+t[3]+(...))), which currently uses a backwards search through the IR, but could be achieved more efficiently through a data flow analysis (c.f. issue #532). In HECO, rotate-and-reduce is handled differently, by identifying sequences of compatible operations prior to batching and rewriting them to “n-ary” operations. However, this approach requires non-standard arithmetic operations and is therefore not suitable for use in HEIR. However, there is likely still an opportunity to make the patterns in HEIR more robust/general (e.g., support constant scalar operands in the fold, or support non-full-tensor folds). See issue #522 for ideas on how to make the HEIR pattern more robust/more general.

Insert Rotate Pass

TODO(#721): Write a detailed description of the rotation insertion pass and the associated target slot analysis.

TensorExt Canonicalization

The TensorExt (tensor_ext) Dialect includes a series of canonicalization rules that are essential to making automatically generated rotation code efficient:

  • Rotation by zero: rotate %t, 0 folds away to %t

  • Cyclical wraparound: rotate %t, k for $k > t.size$ can be simplified to rotate %t, (k mod t.size)

  • Sequential rotation: %0 = rotate %t, k followed by %1 = rotate %0, l is simplified to rotate %t (k+l)

  • Extraction: %0 = rotate %t, k followed by %1 = tensor.extract %0[l] is simplified to tensor.extract %t[k+l]

  • Binary Arithmetic Ops: where both operands to a binary arith operation are rotations by the same amount, the rotation can be performed only once, on the result. For Example,

    %0 = rotate %t1, k
    %1 = rotate %t2, k
    %2 = arith.add %0, %1
    

    can be simplified to

    %0 = arith.add %t1, %t2
    %1 = rotate %0, k
    
  • Sandwiched Binary Arithmetic Ops: If a rotation follows a binary arith operation which has rotation as its operands, the post-arith operation can be moved forward. For example,

    %0 = rotate %t1, x
    %1 = rotate %t2, y
    %2 = arith.add %0, %1
    %3 = rotate %2, z
    

    can be simplified to

    %0 = rotate %t1, x + z
    %1 = rotate %t2, y + z
    %2 = arith.add %0, %1
    
  • Single-Use Arithmetic Ops: Finally, there is a pair of rules that do not eliminate rotations, but move rotations up in the IR, which can help in exposing further canonicalization and/or CSE opportunities. These only apply to arith operations with a single use, as they might otherwise increase the total number of rotations. For example,

    %0 = rotate %t1, k
    %2 = arith.add %0, %t2
    %1 = rotate %2, l
    

    can be equivalently rewritten as

    %0 = rotate %t1, (k+l)
    %1 = rotate %t2, l
    %2 = arith.add %0, %1
    

    and a similar pattern exists for situations where the rotation is the rhs operand of the arithmetic operation.

Note that the index computations in the patterns above (e.g., k+l, k mod t.size are realized via emitting arith operations. However, for constant/compile-time-known indices, these will be subsequently constant-folded away by the canonicalization pass.