This section contains documentation on the high level design of the project. Readers are intended to have basic familiarity with MLIR and FHE.

# 1 - Data-oblivious Transformations

A data-oblivious program is one that decouples data input from program execution. Such programs exhibit control-flow and memory access patterns that are independent of their input(s). This programming model, when applied to encrypted data, is necessary for expressing FHE programs. There are 3 major transformations applied to convert a conventional program into a data-oblivious program:

### (1) If-Transformation

If-operations conditioned on inputs create data-dependent control-flow in
programs. `scf.if`

operations should at least define a ’then’ region (true path)
and always terminate with `scf.yield`

even when `scf.if`

doesn’t produce a
result. To convert a data-dependent `scf.if`

operation to an equivalent set of
data-oblivious operations in MLIR, we hoist all safely speculatable operations
in the `scf.if`

operation and convert the `scf.yield`

operation to an
`arith.select`

operation. The following code snippet demonstrates an application
of this transformation.

```
// Before applying If-transformation
func.func @my_function(%input : i1 {secret.secret}) -> () {
...
// Violation: %input is used as a condition causing a data-dependent branch
%result =`%input -> (i16) {
%a = arith.muli %b, %c : i16
scf.yield %a : i16
} else {
scf.yield %b : i16
}
...
}
// After applying If-transformation
func.func @my_function(%input : i16 {secret.secret}) -> (){
...
%a = arith.muli %b, %c : i16
%result = arith.select %input, %a, %b : i16
...
}
```

We implement a `ConvertIfToSelect`

pass that transforms operations with
secret-input conditions and with only Pure operations (i.e., operations that
have no memory side effect and are speculatable) in their body. **This
transformation cannot be applied to operations when side effects are present in
only one of the two regions.** Although possible, we currently do not support
transformations for operations where both regions have operations with matching
side effects. When side effects are present, the pass fails.

### (2) Loop-Transformation

Loop statements with input-dependent conditions (bounds) and number of
iterations introduce data-dependent branches that violate data-obliviousness. To
convert such loops into a data-oblivious version, we replace input-dependent
conditionals (bounds) with static input-independent parameters (e.g. defining a
constant upper bound), and early-exits with update operations where the value
returned from the loop is selectively updated using conditional predication. In
MLIR, loops are expressed using either `affine.for`

, `scf.for`

or `scf.while`

operations.

[!NOTE] Early exiting from loops is not supported in

`scf`

and`affine`

, so early exits are not supported in this pipeline. TODO(#922): support early exits

`affine.for`

: This operation lends itself well to expressing data oblivious programs because it requires constant loop bounds, eliminating input-dependent limits.

```
%sum_0 = arith.constant 0.0 : f32
// The for-loop's bound is a fixed constant
%sum = affine.for %i = 0 to 10 step 2
iter_args(%sum_iter = %sum_0) -> (f32) {
%t = affine.load %buffer[%i] : memref<1024xf32>
%sum_next = arith.addf %sum_iter, %input : f32
affine.yield %sum_next : f32
}
...
```

`scf.for`

: In contrast to affine.for, scf.for does allow input-dependent conditionals which does not adhere to data-obliviousness constraints. A solution to this could be to either have the programmer or the compiler specify an input-independent upper bound so we can transform the loop to use this upper bound and also carefully update values returned from the for-loop using conditional predication. Our current solution to this is for the programmer to add the lower bound and worst case upper bound in the static affine loop’s`attributes`

list.

```
func.func @my_function(%value: i32 {secret.secret}, %inputIndex: index {secret.secret}) -> i32 {
...
// Violation: for-loop uses %inputIndex as upper bound which causes a secret-dependent control-flow
%result = scf.for %iv = %begin to %inputIndex step %step_value iter_args(%arg1 = %value) -> i32 {
%output = arith.muli %arg1, %arg1 : i32
scf.yield %output : i32
}{lower = 0, upper = 32}
...
}
// After applying Loop-Transformation
func.func @my_function(%value: i32 {secret.secret}, %inputIndex: index {secret.secret}) -> i32 {
...
// Build for-loop using lower and upper values from the `attributes` list
%result = affine.for %iv = 0 to step 32 iter_args(%arg1 = %value) -> i32 {
%output = arith.muli %arg1, %agr1 : i32
%cond = arith.cmpi eq, %iv, %inputIndex : index
%newOutput = arith.select %cond, %output, %arg1
scf.yield %newOutput : i32
}
...
}
```

`scf.while`

: This operation represents a generic while/do-while loop that keeps iterating as long as a condition is met. An input-dependent while condition introduces a data-dependent control flow that violates data-oblivious constraints. For this transformation, the programmer needs to add the`max_iter`

attribute that describes the maximum number of iterations the loop runs which we then use the value to build our static`affine.for`

loop.

```
// Before applying Loop-Transformation
func.func @my_function(%input: i16 {secret.secret}){
%zero = arith.constant 0 : i16
%result = scf.while (%arg1 = %input) : (i16) -> i16 {
%cond = arith.cmpi slt, %arg1, %zero : i16
// Violation: scf.while uses %cond whose value depends on %input
scf.condition(%cond) %arg1 : i16
} do {
^bb0(%arg2: i16):
%mul = arith.muli %arg2, %arg2: i16
scf.yield %mul
} attributes {max_iter = 16 : i64}
...
return
}
// After applying Loop-Transformation
func.func @my_function(%input: i16 {secret.secret}){
%zero = arith.constant 0 : i16
%begin = arith.constant 1 : index
...
// Replace while-loop with a for-loop with a constant bound %MAX_ITER
%result = affine.for %iv = %0 to %16 step %step_value iter_args(%iter_arg = %input) -> i16 {
%cond = arith.cmpi slt, %iter_arg, %zero : i16
%mul = arith.muli %iter_arg, %iter_arg : i16
%output = arith.select %cond, %mul, %iter_arg
scf.yield %output
}{max_iter = 16 : i64}
...
return
}
```

### (3) Access-Transformation

Input-dependent memory access cause data-dependent memory footprints. A naive
data-oblivious solution to this maybe doing read-write operations over the
entire data structure while only performing the desired save/update operation
for the index of interest. For simplicity, we only look at load/store operations
for tensors as they are well supported structures in high-level MLIR likely
emitted by most frontends. We drafted the following non-SIMD approach for this
transformation and defer SIMD optimizations to the `heir-simd-vectorizer`

pass:

```
// Before applying Access Transformation
func.func @my_function(%input: tensor<16xi32> {secret.secret}, %inputIndex: index {secret.secret}) {
...
%c_10 = arith.constant 10 : i32
// Violation: tensor.extract loads value at %inputIndex
%extractedValue = tensor.extract %input[%inputIndex] : tensor<16xi32>
%newValue = arith.addi %extractedValue, %c_10 : i32
// Violation: tensor.insert stores value at %inputIndex
%inserted = tensor.insert %newValue into %input[%inputIndex] : tensor<16xi32>
...
}
// After applying Non-SIMD Access Transformation
func.func @my_function(%input: tensor<16xi32> {secret.secret}, %inputIndex: index {secret.secret}) {
...
%c_10 = arith.constant 10 : i32
%i_0 = arith.constant 0 : index
%dummyValue = arith.constant 0 : i32
%extractedValue = affine.for %i=0 to 16 iter_args(%arg= %dummyValue) -> (i32) {
// 1. Check if %i matches %inputIndex
// 2. Extract value at %i
// 3. If %i matches %inputIndex, select %value extracted in (2), else select %dummyValue
// 4. Yield selected value
%cond = arith.cmpi eq, %i, %inputIndex : index
%value = tensor.extract %input[%i] : tensor<16xi32>
%selected = arith.select %cond, %value, %dummyValue : i32
affine.yield %selected : i32
}
%newValue = arith.addi %extractedValue, %c_10 : i32
%inserted = affine.for %i=0 to 16 iter_args(%inputArg = %input) -> tensor<16xi32> {
// 1. Check if %i matches the %inputIndex
// 2. Insert %newValue and produce %newTensor
// 3. If %i matches %inputIndex, select %newTensor, else select input tensor
// 4. Yield final tensor
%cond = arith.cmpi eq, %i, %inputIndex : index
%newTensor = tensor.insert %value into %inputArg[%i] : tensor<16xi32>
%finalTensor= arith.select %cond, %newTensor, %inputArg : tensor<16xi32>
affine.yield %finalTensor : tensor<16xi32>
}
...
}
```

### More notes on these transformations

These 3 transformations have a cascading behavior where transformations can be applied progressively to achieve a data-oblivious program. The order of the transformations goes as follows:

*Access-Transformation*(change data-dependent tensor accesses (reads-writes) to use`affine.for`

and`scf.if`

operations) ->*Loop-Transformation*(change data-dependent loops to use constant bounds and condition the loop’s yield results with`scf.if`

operation) ->*If-Transformation*(substitute data-dependent conditionals with`arith.select`

operation).- Besides that, when we apply non-SIMD Access-Transformation on multiple data-dependent tensor read-write operations over the same tensor, we can benefit from upstream affine transformations over the resulting multiple affine loops produced by the Access-Transformation to fuse these loops.

# 2 - Secret

The `secret`

dialect contains types
and operations to represent generic computations on secret data. It is intended
to be a high-level entry point for the HEIR compiler, agnostic of any particular
FHE scheme.

Most prior FHE compiler projects design their IR around a specific FHE scheme,
and provide dedicated IR types for the secret analogues of existing data types,
and/or dedicated operations on secret data types. For example, the Concrete
compiler has `!FHE.eint<32>`

for an encrypted 32-bit integer, and `add_eint`

and
similar ops. HECO has `!fhe.secret<T>`

that models a generic secret type, but
similarly defines `fhe.add`

and `fhe.multiply`

, and other projects are similar.

The problem with this approach is that it is difficult to incorporate the apply
upstream canonicalization and optimization passes to these ops. For example, the
`arith`

dialect in MLIR has
canonicalization patterns
that must be replicated to apply to FHE analogues. One of the goals of HEIR is
to reuse as much upstream infrastructure as possible, and so this led us to
design the `secret`

dialect to have both generic types and generic computations.
Thus, the `secret`

dialect has two main parts: a `secret<T>`

type that wraps any
other MLIR type `T`

, and a `secret.generic`

op that lifts any computation on
cleartext to the “corresponding” computation on secret data types.

## Overview with BGV-style lowering pipeline

Here is an example of a program that uses `secret`

to lift a dot product
computation:

```
func.func @dot_product(
%arg0: !secret.secret<tensor<8xi16>>,
%arg1: !secret.secret<tensor<8xi16>>) -> !secret.secret<i16> {
%c0_i16 = arith.constant 0 : i16
%0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
%1 = affine.for %arg4 = 0 to 8 iter_args(%arg5 = %c0_i16) -> (i16) {
%extracted = tensor.extract %arg2[%arg4] : tensor<8xi16>
%extracted_0 = tensor.extract %arg3[%arg4] : tensor<8xi16>
%2 = arith.muli %extracted, %extracted_0 : i16
%3 = arith.addi %arg5, %2 : i16
affine.yield %3 : i16
}
secret.yield %1 : i16
} -> !secret.secret<i16>
return %0 : !secret.secret<i16>
}
```

The operands to the `generic`

op are the secret data types, and the op contains
a single region, whose block arguments are the corresponding cleartext data
values. Then the region is free to perform any computation, and the values
passed to `secret.yield`

are lifted back to `secret`

types. Note that
`secret.generic`

is not isolated from its enclosing scope, so one may refer to
cleartext SSA values without adding them as generic operands and block
arguments.

Clearly `secret.generic`

does not actually do anything. It is not decrypting
data. It is merely describing the operation that one wishes to apply to the
secret data in more familiar terms. It is a structural operation, primarily used
to demarcate which operations involve secret operands and have secret results,
and group them for later optimization. The benefit of this is that one can write
optimization passes on types and ops that are not aware of `secret`

, and they
will naturally match on the bodies of `generic`

ops.

For example, here is what the above dot product computation looks like after
applying the `-cse -canonicalize -heir-simd-vectorizer`

passes, the
implementations of which do not depend on `secret`

or `generic`

.

```
func.func @dot_product(
%arg0: !secret.secret<tensor<8xi16>>,
%arg1: !secret.secret<tensor<8xi16>>) -> !secret.secret<i16> {
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c4 = arith.constant 4 : index
%c7 = arith.constant 7 : index
%0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
%1 = arith.muli %arg2, %arg3 : tensor<8xi16>
%2 = tensor_ext.rotate %1, %c4 : tensor<8xi16>, index
%3 = arith.addi %1, %2 : tensor<8xi16>
%4 = tensor_ext.rotate %3, %c2 : tensor<8xi16>, index
%5 = arith.addi %3, %4 : tensor<8xi16>
%6 = tensor_ext.rotate %5, %c1 : tensor<8xi16>, index
%7 = arith.addi %5, %6 : tensor<8xi16>
%extracted = tensor.extract %7[%c7] : tensor<8xi16>
secret.yield %extracted : i16
} -> !secret.secret<i16>
return %0 : !secret.secret<i16>
}
```

The canonicalization patterns for `secret.generic`

apply a variety of
simplifications, such as:

- Removing any unused or non-secret arguments and return values.
- Hoisting operations in the body of a
`generic`

that only depend on cleartext values to the enclosing scope. - Removing any
`generic`

ops that use no secrets at all.

These can be used together with the
`secret-distribute-generic`

pass
to split an IR that contains a large `generic`

op into `generic`

ops that
contain a single op, which can then be lowered to a particular FHE scheme
dialect with dedicated ops. This makes lowering easier because it gives direct
access to the secret version of each type that is used as input to an individual
op.

As an example, a single-op secret might look like this (taken from the larger example below. Note the use of a cleartext from the enclosing scope, and the proximity of the secret type to the op to be lowered.

```
%c2 = arith.constant 2 : index
%3 = secret.generic ins(%2 : !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>):
%8 = tensor_ext.rotate %arg2, %c2 : tensor<8xi16>, index
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
```

For a larger example, applying `--secret-distribute-generic --canonicalize`

to
the IR above:

```
func.func @dot_product(%arg0: !secret.secret<tensor<8xi16>>, %arg1: !secret.secret<tensor<8xi16>>) -> !secret.secret<i16> {
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c4 = arith.constant 4 : index
%c7 = arith.constant 7 : index
%0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
%8 = arith.muli %arg2, %arg3 : tensor<8xi16>
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
%1 = secret.generic ins(%0 : !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>):
%8 = tensor_ext.rotate %arg2, %c4 : tensor<8xi16>, index
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
%2 = secret.generic ins(%0, %1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
%8 = arith.addi %arg2, %arg3 : tensor<8xi16>
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
%3 = secret.generic ins(%2 : !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>):
%8 = tensor_ext.rotate %arg2, %c2 : tensor<8xi16>, index
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
%4 = secret.generic ins(%2, %3 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
%8 = arith.addi %arg2, %arg3 : tensor<8xi16>
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
%5 = secret.generic ins(%4 : !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>):
%8 = tensor_ext.rotate %arg2, %c1 : tensor<8xi16>, index
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
%6 = secret.generic ins(%4, %5 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
%8 = arith.addi %arg2, %arg3 : tensor<8xi16>
secret.yield %8 : tensor<8xi16>
} -> !secret.secret<tensor<8xi16>>
%7 = secret.generic ins(%6 : !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>):
%extracted = tensor.extract %arg2[%c7] : tensor<8xi16>
secret.yield %extracted : i16
} -> !secret.secret<i16>
return %7 : !secret.secret<i16>
}
```

And then lowering it to `bgv`

with `--secret-to-bgv="poly-mod-degree=8"`

(the
pass option matches the tensor size, but it is an unrealistic FHE polynomial
degree used here just for demonstration purposes). Note type annotations on ops
are omitted for brevity.

```
#encoding = #lwe.polynomial_evaluation_encoding<cleartext_start = 16, cleartext_bitwidth = 16>
#params = #lwe.rlwe_params<ring = <cmod=463187969, ideal=#_polynomial.polynomial<1 + x**8>>>
!ty1 = !lwe.rlwe_ciphertext<encoding=#encoding, rlwe_params=#params, underlying_type=tensor<8xi16>>
!ty2 = !lwe.rlwe_ciphertext<encoding=#encoding, rlwe_params=#params, underlying_type=i16>
func.func @dot_product(%arg0: !ty1, %arg1: !ty1) -> !ty2 {
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c4 = arith.constant 4 : index
%c7 = arith.constant 7 : index
%0 = bgv.mul %arg0, %arg1
%1 = bgv.relinearize %0 {from_basis = array<i32: 0, 1, 2>, to_basis = array<i32: 0, 1>}
%2 = bgv.rotate %1, %c4
%3 = bgv.add %1, %2
%4 = bgv.rotate %3, %c2
%5 = bgv.add %3, %4
%6 = bgv.rotate %5, %c1
%7 = bgv.add %5, %6
%8 = bgv.extract %7, %c7
return %8
}
```

## Differences for CGGI-style pipeline

The `tosa-to-boolean-tfhe`

and related pipelines add a few additional steps. The
main goal here is to apply a hardware circuit optimizer to blocks of standard
MLIR code (inside `secret.generic`

ops) which converts the computation to an
optimized boolean circuit with a desired set of gates. Only then is
`-secret-distribute-generic`

applied to split the ops up and lower them to the
`cggi`

dialect. In particular, because passing an IR through the circuit
optimizer requires unrolling all loops, one useful thing you might want to do is
to optimize only the *body* of a for loop nest.

To accomplish this, we have two additional mechanisms. One is the pass option
`ops-to-distribute`

for `-secret-distribute-generic`

, which allows the user to
specify a list of ops that `generic`

should be split across, and all others left
alone. Specifying `affine.for`

here will pass `generic`

through the `affine.for`

loop, but leave its body intact. This can also be used with the `-unroll-factor`

option to the `-yosys-optimizer`

pass to partially unroll a loop nest and pass
the partially-unrolled body through the circuit optimizer.

The other mechanism is the `secret.separator`

op, which is a purely structural
op that demarcates the boundary of a subset of a block that should be jointly
optimized in the circuit optimizer.

For example, the following `tosa`

ops lower to multiple linalg instructions, and
hence multiple for loops, that we want to pass to a circuit optimizer as a unit.
The `secret.separator`

ops surrounding the op are preserved through the
lowering.

```
func.func @main(%arg0: tensor<1x1xi8> {secret.secret}) -> tensor<1x16xi32> {
secret.separator
%4 = "tosa.const"() {value = dense<[0, 0, -5438, -5515, -1352, -1500, -4152, -84, 3396, 0, 1981, -5581, 0, -6964, 3407, -7217]> : tensor<16xi32>} : () -> tensor<16xi32>
%5 = "tosa.const"() {value = dense<[[-9], [-54], [57], [71], [104], [115], [98], [99], [64], [-26], [127], [25], [-82], [68], [95], [86]]> : tensor<16x1xi8>} : () -> tensor<16x1xi8>
%6 = "tosa.fully_connected"(%arg0, %5, %4) {quantization_info = #tosa.conv_quant<input_zp = -128, weight_zp = 0>} : (tensor<1x1xi8>, tensor<16x1xi8>, tensor<16xi32>) -> tensor<1x16xi32>
secret.separator
return %6 : tensor<1x16xi32>
}
```

After running `--tosa-to-boolean-tfhe`

and dumping the IR after the linalg ops
are lowered to loops, we can see the `secret.separator`

ops enclose the lowered
ops, with the exception of some pure ops that are speculatively executed.

```
func.func @main(%arg0: memref<1x1xi8, strided<[?, ?], offset: ?>> {secret.secret}) -> memref<1x16xi32> {
%c-128_i32 = arith.constant -128 : i32
%0 = memref.get_global @__constant_16xi32 : memref<16xi32>
%1 = memref.get_global @__constant_16x1xi8 : memref<16x1xi8>
secret.separator
%alloc = memref.alloc() {alignment = 64 : i64} : memref<1x16xi8>
affine.for %arg1 = 0 to 1 {
affine.for %arg2 = 0 to 16 {
%2 = affine.load %1[%arg2, %arg1] : memref<16x1xi8>
affine.store %2, %alloc[%arg1, %arg2] : memref<1x16xi8>
}
}
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<1x16xi32>
affine.for %arg1 = 0 to 1 {
affine.for %arg2 = 0 to 16 {
%2 = affine.load %0[%arg2] : memref<16xi32>
affine.store %2, %alloc_0[%arg1, %arg2] : memref<1x16xi32>
}
}
affine.for %arg1 = 0 to 1 {
affine.for %arg2 = 0 to 16 {
affine.for %arg3 = 0 to 1 {
%2 = affine.load %arg0[%arg1, %arg3] : memref<1x1xi8, strided<[?, ?], offset: ?>>
%3 = affine.load %alloc[%arg3, %arg2] : memref<1x16xi8>
%4 = affine.load %alloc_0[%arg1, %arg2] : memref<1x16xi32>
%5 = arith.extsi %2 : i8 to i32
%6 = arith.subi %5, %c-128_i32 : i32
%7 = arith.extsi %3 : i8 to i32
%8 = arith.muli %6, %7 : i32
%9 = arith.addi %4, %8 : i32
affine.store %9, %alloc_0[%arg1, %arg2] : memref<1x16xi32>
}
}
}
secret.separator
memref.dealloc %alloc : memref<1x16xi8>
return %alloc_0 : memref<1x16xi32>
}
```

We decided to use the `separator`

op over a few alternatives:

- Grouping by
`secret.generic`

: these`tosa`

ops must be bufferized, but`secret`

types cannot participate in bufferization (see the Limitations section). - Grouping by basic blocks:
`secret.generic`

is a single-block op with a yield terminator, and grouping by blocks would require us to change this. - Grouping by regions: SSA values generated by a region are not visible to the enclosing scope, so we would need to have the region-bearing op return values, which is tedious to organize.
- Attaching attributes to ops that should be grouped together: this would not be preserved by upstream lowerings and optimization passes.

`generic`

operands

`secret.generic`

takes any SSA values as legal operands. They may be `secret`

types or non-secret. Canonicalizing `secret.generic`

removes non-secret operands
and leaves them to be referenced via the enclosing scope (`secret.generic`

is
not `IsolatedFromAbove`

).

This may be unintuitive, as one might expect that only secret types are valid
arguments to `secret.generic`

, and that a verifier might assert non-secret args
are not present.

However, we allow non-secret operands because it provides a convenient scope
encapsulation mechanism, which is useful for the `--yosys-optimizer`

pass that
runs a circuit optimizer on individual `secret.generic`

ops and needs to have
access to all SSA values used as inputs. The following passes are related to
this functionality:

`secret-capture-generic-ambient-scope`

`secret-generic-absorb-constants`

`secret-extract-generic-body`

Due to the canonicalization rules for `secret.generic`

, anyone using these
passes as an IR organization mechanism must be sure not to canonicalize before
accomplishing the intended task.

## Limitations

### Bufferization

Secret types cannot participate in bufferization passes. In particular,
`-one-shot-bufferize`

hard-codes the notion of tensor and memref types, and so
it cannot currently operate on `secret<tensor<...>>`

or `secret<memref<...>>`

types, which prevents us from implementing a bufferization interface for
`secret.generic`

. This was part of the motivation to introduce
`secret.separator`

, because `tosa`

ops like a fully connected neural network
layer lower to multiple linalg ops, and these ops need to be bufferized before
they can be lowered further. However, we want to keep the lowered ops grouped
together for circuit optimization (e.g., fusing transposes and constant weights
into the optimized layer), but because of this limitation, we can’t simply wrap
the `tosa`

ops in a `secret.generic`

(bufferization would fail).

# 3 - SIMD Optimizations

HEIR includes a SIMD (Single Instruction, Multiple Data) optimizer which is designed to exploit the restricted SIMD parallelism most (Ring-LWE-based) FHE schemes support (also commonly known as “packing” or “batching”). Specifically, HEIR incorporates the “automated batching” optimizations (among many other things) from the HECO compiler. The following will assume basic familiarity with the FHE SIMD paradigm and the high-level goals of the optimization, and we refer to the associated HECO paper, slides, talk and additional resources on the Usenix'23 website for an introduction to the topic. This documentation will mostly focus on describing how the optimization is realized in HEIR (which differs somewhat from the original implementation) and how the optimization is intended to be used in an overall end-to-end compilation pipeline.

## Representing FHE SIMD Operations

Following the design principle of maintaining programs in standard MLIR dialects
as long as possible (cf. the design rationale behind the
Secret Dialect), HEIR uses the MLIR
`tensor`

dialect and
ElementwiseMappable
operations from the MLIR
`arith`

dialect to represent HE
SIMD operations.

We do introduce the HEIR-specific
`tensor_ext.rotate`

operation, which represents a cyclical left-rotation of a tensor. Note that, as
the current SIMD vectorizer only supports one-dimensional tensors, the semantics
of this operation on multi-dimensional tensors are not (currently) defined.

For example, the common “rotate-and-reduce” pattern which results in each element containing the sum/product/etc of the original vector can be expressed as:

```
%tensor = tensor.from_elements %i1, %i2, %i3, %i4, %i5, %i6, %i7, %i8 : tensor<8xi16>
%0 = tensor_ext.rotate %tensor, %c4 : tensor<8xi16>, index
%1 = arith.addi %tensor, %0 : tensor<8xi16>
%2 = tensor_ext.rotate %1, %c2 : tensor<8xi16>, index
%3 = arith.addi %1, %2 : tensor<8xi16>
%4 = tensor_ext.rotate %3, %c1 : tensor<8xi16>, index
%5 = arith.addi %3, %4 : tensor<8xi16>
```

The `%cN`

and `%iN`

, which are defined as `%cN = arith.constant N : index`

and
`%iN = arith.constant N : i16`

, respectively, have been omitted for readability.

## Intended Usage

The `-heir-simd-vectorizer`

pipeline transforms a program consisting of loops
and index-based accesses into tensors (e.g., `tensor.extract`

and
`tensor.insert`

) into one consisting of SIMD operations (including rotations) on
entire tensors. While its implementation does not depend on any FHE-specific
details or even the Secret dialect, this transformation is likely only useful
when lowering a high-level program to an arithmetic-circuit-based FHE scheme
(e.g., B/FV, BGV, or CKKS). The `-mlir-to-openfhe-bgv`

pipeline demonstrates the
intended flow: augmenting a high-level program with `secret`

annotations, then
applying the SIMD optimization (and any other high-level optimizations) before
lowering to BGV operations and then exiting to OpenFHE.

WarningThe current SIMD vectorizer pipeline supports only one-dimensional tensors. As a workaround, one could reshape all multi-dimensional tensors into one-dimensional tensors, but MLIR/HEIR currently do not provide a pass to automate this process.

Since the optimization is based on heuristics, the resulting program might not be optimal or could even be worse than a trivial realization that does not use ciphertext packing. However, well-structured programs generally lower to reasonable batched solutions, even if they do not achieve optimal batching layouts. For common operations such as matrix-vector or matrix-matrix multiplications, state-of-the-art approaches require advanced packing schemes that might map elements into the ciphertext vector in non-trivial ways (e.g., diagonal-major and/or replicated). The current SIMD vectorizer will never change the arrangement of elements inside an input tensor and therefore cannot produce the optimal approaches for these operations.

Note, that the SIMD batching optimization is different from, and significantly
more complex than, the Straight Line Vectorizer (`-straight-line-vectorize`

pass), which simply groups
ElementwiseMappable
operations that agree in operation name and operand/result types into
vectorized/tensorized versions.

## Implementation

Below, we give a brief overview over the implementation, with the goal of both improving maintainability/extensibility of the SIMD vectorizer and allowing advanced users to better understand why a certain program is transformed in the way it is.

### Components

The `-heir-simd-vectorizer`

pipeline uses a combination of standard MLIR passes
(`-canonicalize`

,
`-cse`

,
`-sccp`

) and custom HEIR passes.
Some of these
(`-apply-folders`

,
`-full-loop-unroll`

)
might have applications outside the SIMD optimization, while others
(`-insert-rotate`

,
`-collapse-insertion-chains`

and
`-rotate-and-reduce`

)
are very specific to the FHE SIMD optimization. In addition, the passes make use
of the `RotationAnalysis`

and `TargetSlotAnalysis`

analyses.

### High-Level Flow

**Loop Unrolling**(`-full-loop-unroll`

): The implementation currently begins by unrolling all loops in the program to simplify the later passes. See #589 for a discussion on how this could be avoided.**Canonicalization**(`-apply-folders -canonicalize`

): As the rotation-specific passes are very strict about the structure of the IR they operate on, we must first simplify away things such as tensors of constant values. For performance reasons (c.f. comments in the`heirSIMDVectorizerPipelineBuilder`

function in`heir-opt.cpp`

), this must be done by first applying folds before applying the full canonicalization.**Main SIMD Rewrite**(`-insert-rotate -cse -canonicalize -cse`

): This pass rewrites arithmetic operations over`tensor.extract`

-ed operands into SIMD operations over the entire tensor, rotating the (full-tensor) operands so that the correct elements interact. For example, it will rewrite the following snippet (which computes`t2[4] = t0[3] + t1[5]`

)`%0 = tensor.extract %t0[%c3] : tensor<32xi16> %1 = tensor.extract %t1[%c5] : tensor<32xi16> %2 = arith.addi %0, %1 : i16 %3 = tensor.insert %2 into %t2[%c4] : tensor<32xi16>`

to

`%0 = tensor_ext.rotate %t0, %c31 : tensor<32xi16>, index %1 = tensor_ext.rotate %t1, %c1 : tensor<32xi16>, index %2 = arith.addi %0, %1 : tensor<32xi16>`

i.e., rotating

`t0`

down by one (31 = -1 (mod 32)) and`t1`

up by one to bring the elements at index 3 and 5, respectively, to the “target” index 4. The pass uses the`TargetSlotAnalysis`

to identify the appropriate target index (or ciphertext “slot” in FHE-speak). See Insert Rotate Pass below for more details. This pass is roughly equivalent to the`-batching`

pass in the original HECO implementation.Doing this rewrite by itself does not represent an optimization, but if we consider what happens to the corresponding code for other indices (e.g.,

`t2[5] = t0[4] + t1[6]`

), we see that the pass transforms expressions with the same relative index offsets into the exact same set of rotations/SIMD operations, so the following Common Subexpression Elimination (CSE) will remove redundant computations. We apply CSE twice, once directly (which creates new opportunities for canonicalization and folding) and then again after that canonicalization. See TensorExt Canonicalization for a description of the rotation-specific canonocalization patterns).**Cleanup of Redundant Insert/Extract**(`-collapse-insertion-chains -sccp -canonicalize -cse`

): Because the`-insert-rotate`

pass maintains the consistency of the IR, it emits a`tensor.extract`

operation after the SIMD operation and uses that to replace the original operation (which is valid, as both produce the desired scalar result). As a consequence, the generated code for the snippet above is actually trailed by a (redundant) extract/insert:`%extracted = tensor.extract %2[%c4] : tensor<32xi16> %inserted = tensor.insert %extracted into %t2[%c4] : tensor<32xi16>`

In real code, this might generate a long series of such extraction/insertion operations, all extracting from the same (due to CSE) tensor and inserting into the same output tensor. Therefore, the

`-collapse-insertion-chains`

pass searches for such chains over entire tensors and collapses them. It supports not just chains where the indices match perfectly, but any chain where the relative offset is consistent across the tensor, issuing a rotation to realize the offset (if the offset is zero, the canonicalization will remove the redundant rotation). Note, that in HECO, insertion/extraction is handled differently, as HECO features a`combine`

operation modelling not just simple insertions (`combine(%t0#j, %t1)`

) but also more complex operations over slices of tensors (`combine(%t0#[i,j], %t1)`

). As a result, the equivalent pass in HECO (`-combine-simplify`

) instead joins different`combine`

operations, and a later fold removes`combines`

that replace the entire target tensor. See issue #512 for a discussion on why the`combine`

operation is a more powerful framework and what would be necessary to port it to HEIR.**Applying Rotate-and-Reduce Patterns**(`-rotate-and-reduce -sccp -canonicalize -cse`

): The rotate and reduce pattern (see Representing FHE SIMD Operations for an example) is an important aspect of accelerating SIMD-style operations in FHE, but it does not follow automatically from the batching rewrites applied so far. As a result, the`-rotate-and-reduce`

pass needs to search for sequences of arithmetic operations that correspond to the full folding of a tensor, i.e., patterns such as`t[0]+(t[1]+(t[2]+t[3]+(...)))`

, which currently uses a backwards search through the IR, but could be achieved more efficiently through a data flow analysis (c.f. issue #532). In HECO, rotate-and-reduce is handled differently, by identifying sequences of compatible operations prior to batching and rewriting them to “n-ary” operations. However, this approach requires non-standard arithmetic operations and is therefore not suitable for use in HEIR. However, there is likely still an opportunity to make the patterns in HEIR more robust/general (e.g., support constant scalar operands in the fold, or support non-full-tensor folds). See issue #522 for ideas on how to make the HEIR pattern more robust/more general.

### Insert Rotate Pass

TODO(#721): Write a detailed description of the rotation insertion pass and the associated target slot analysis.

### TensorExt Canonicalization

The
TensorExt (`tensor_ext`

) Dialect
includes a series of canonicalization rules that are essential to making
automatically generated rotation code efficient:

Rotation by zero:

`rotate %t, 0`

folds away to`%t`

Cyclical wraparound:

`rotate %t, k`

for $k > t.size$ can be simplified to`rotate %t, (k mod t.size)`

Sequential rotation:

`%0 = rotate %t, k`

followed by`%1 = rotate %0, l`

is simplified to`rotate %t (k+l)`

Extraction:

`%0 = rotate %t, k`

followed by`%1 = tensor.extract %0[l]`

is simplified to`tensor.extract %t[k+l]`

Binary Arithmetic Ops: where both operands to a binary

`arith`

operation are rotations by the same amount, the rotation can be performed only once, on the result. For Example,`%0 = rotate %t1, k %1 = rotate %t2, k %2 = arith.add %0, %1`

can be simplified to

`%0 = arith.add %t1, %t2 %1 = rotate %0, k`

*Sandwiched*Binary Arithmetic Ops: If a rotation follows a binary`arith`

operation which has rotation as its operands, the post-arith operation can be moved forward. For example,`%0 = rotate %t1, x %1 = rotate %t2, y %2 = arith.add %0, %1 %3 = rotate %2, z`

can be simplified to

`%0 = rotate %t1, x + z %1 = rotate %t2, y + z %2 = arith.add %0, %1`

Single-Use Arithmetic Ops: Finally, there is a pair of rules that do not eliminate rotations, but move rotations up in the IR, which can help in exposing further canonicalization and/or CSE opportunities. These only apply to

`arith`

operations with a single use, as they might otherwise increase the total number of rotations. For example,`%0 = rotate %t1, k %2 = arith.add %0, %t2 %1 = rotate %2, l`

can be equivalently rewritten as

`%0 = rotate %t1, (k+l) %1 = rotate %t2, l %2 = arith.add %0, %1`

and a similar pattern exists for situations where the rotation is the rhs operand of the arithmetic operation.

Note that the index computations in the patterns above (e.g., `k+l`

,
`k mod t.size`

are realized via emitting `arith`

operations. However, for
constant/compile-time-known indices, these will be subsequently constant-folded
away by the canonicalization pass.

# 4 -

title: Optimizing relinearization

## weight: 10

This document outlines the integer linear program model used in the
`optimize-relinearization`

pass.

## Background

In vector/arithmetic FHE, RLWE ciphertexts often have the form $\mathbf{c} = (c_0, c_1)$, where the details of how $c_0$ and $c_1$ are computed depend on the specific scheme. However, in most of these schemes, the process of decryption can be thought of as taking a dot product between the vector $\mathbf{c}$ and a vector $(1, s)$ containing the secret key $s$ (followed by rounding).

In such schemes, the homomorphic multiplication of two ciphertexts $\mathbf{c} = (c_0, c_1)$ and $\mathbf{d} = (d_0, d_1)$ produces a ciphertext $\mathbf{f} = (f_0, f_1, f_2)$. This triple can be decrypted by taking a dot product with $(1, s, s^2)$.

With this in mind, each RLWE ciphertext $\mathbf{c}$ has an associated *key
basis*, which is the vector $\mathbf{s_c}$ whose dot product with $\mathbf{c}$
decrypts it.

Usually a larger key basis is undesirable. For one, operations in a higher key
basis are more expensive and have higher rates of noise growth. Repeated
multiplications exponentially increase the length of the key basis. So to avoid
this, an operation called *relinearization* was designed that converts a
ciphertext from a given key basis back to $(1, s)$. Doing this requires a set of
*relinearization keys* to be provided by the client and stored by the server.

In general, key bases can be arbitrary. Rotation of an RLWE ciphertext by a shift of $k$, for example, first applies the automorphism $x \mapsto x^k$. This converts the key basis from $(1, s)$ to $(1, s^k)$, and more generally maps $(1, s, s^2, \dots, s^d) \mapsto (1, s^k, s^{2k}, \dots, s^{kd})$. Most FHE implementations post-compose this automorphism with a key switching operation to return to the linear basis $(1, s)$. Similarly, multiplication can be defined for two key bases $(1, s^n)$ and $(1, s^m)$ (with $n < m$) to produce a key basis $(1, s^n, s^m, s^{n+m})$. By a combination of multiplications and rotations (without ever relinearizing or key switching), ciphertexts with a variety of strange key bases can be produced.

Most FHE implementations do not permit wild key bases because each key switch and relinearization operation (for each choice of key basis) requires additional secret key material to be stored by the server. Instead, they often enforce that rotation has key-switching built in, and multiplication relinearizes by default.

That said, many FHE implementations do allow for the relinearization operation
to be deferred. A useful such situation is when a series of independent
multiplications are performed, and the results are added together. Addition can
operate in any key basis (though all inputs must have the same key basis), and
so the relinearization op that follows each multiplication can be deferred until
after the additions are complete, at which point there is only one
relinearization to perform. This technique is usually called *lazy
relinearization*. It has the benefit of avoiding expensive relinearization
operations, as well as reducing noise growth, as relinearization adds noise to
the ciphertext, which can further reduce the need for bootstrapping.

In much of the literature, lazy relinearization is applied manually. See for example Blatt-Gusev-Polyakov-Rohloff-Vaikuntanathan 2019 and Lee-Lee-Kim-Kim-No-Kang 2020. In some compiler projects, such as the EVA compiler relinearization is applied automatically via a heuristic, either “eagerly” (immediately after each multiplication op) or “lazily,” deferred as late as possible.

## The `optimize-relinearization`

pass

In HEIR, relinearization placement is implemented via a mixed-integer linear program (ILP). It is intended to be more general than a lazy relinearization heuristic, and certain parameter settings of the ILP reproduce lazy relinearization.

The `optimize-relinearization`

pass starts by deleting all relinearization
operations from the IR, solves the ILP, and then inserts relinearization ops
according to the solution. This implies that the input IR to the ILP has no
relinearization ops in it already.

## Model specification

The ILP model fits into a family of models that is sometimes called “state-dynamics” models, in that it has “state” variables that track a quantity that flows through a system, as well as “decision” variables that control decisions to change the state at particular points. A brief overview of state dynamics models can be found here

In this ILP, the “state” value is the degree of the key basis. I.e., rather than track the entire key basis, we assume the key basis always has the form $(1, s, s^2, \dots, s^k)$ and track the value $k$. The index tracking state is SSA value, and the decision variables are whether to relinearize.

### Variables

Define the following variables:

- For each operation $o$, $R_o \in { 0, 1 }$ defines the decision to relinearize the result of operation $o$. Relinearization is applied if and only if $R_o = 1$.
- For each SSA value $v$, $\textup{KB}_v$ is a continuous variable
representing the degree of the key basis of $v$. For example, if the key basis
of a ciphertext is $(1, s)$, then $\textup{KB}_v = 1$. If $v$ is the result
of an operation $o$, $\textup{KB}_v$ is the key basis of the result of $o$
*after*relinearization has been optionally applied to it, depending on the value of the decision variable $R_o$. - For each SSA value $v$ that is an operation result, $\textup{KB}^{br}_v$ is
a continuous variable whose value represents the key basis degree of $v$
*before*relinearization is applied (`br`

= “before relin”). These SSA values are mainly for*after*the model is solved and relinearization operations need to be inserted into the IR. Here, type conflicts require us to reconstruct the key basis degree, and saving the values allows us to avoid recomputing the values.

Each of the key-basis variables is bounded from above by a parameter
`MAX_KEY_BASIS_DEGREE`

that can be used to impose hard limits on the key basis
size, which may be required if generating code for a backend that does not
support operations over generalized key bases.

### Objective

The objective is to minimize the number of relinearization operations, i.e., $\min \sum_o R_o$.

TODO(#1018): update docs when objective is generalized.

### Constraints

The simple constraints are as follows:

- Initial key basis degree: For each block argument, $\textup{KB}_v$ is fixed
to equal the
`dimension`

parameter on the RLWE ciphertext type. - Operand agreement: For each operation with operand SSA values $v_1, \dots,
v_k$, $\textup{KB}
*{v_1} = \dots = \textup{KB}*{v_k}$, i.e., all key basis inputs must match. - Special linearized ops:
`bgv.rotate`

and`func.return`

require linearized inputs, i.e., $\textup{KB}_{v_i} = 1$ for all inputs $v_i$ to these operations. - Before relinearization key basis: for each operation $o$ with operands $v_1,
\dots, v_k$, constrain $\textup{KB}^{br}
*{\textup{result}(o)} = f(\textup{KB}*{v_1}, \dots, \textup{KB}_{v_k})$, where $f$ is a statically known linear function. For multiplication $f$ is addition, and for all other ops it is the projection onto any input, since multiplication is the only op that increases the degree, and all operands are constrained to have equal degree.

The remaining constraints control the dynamics of how the key basis degree changes as relinearizations are inserted.

They can be thought of as implementing this (non-linear) constraint for each operation $o$:

[ \textup{KB}*{\textup{result}(o)} = \begin{cases}
\textup{KB}^{br}*{\textup{result(o)}} & \text{ if } R_o = 0 \ 1 & \text{ if
} R_o = 1 \end{cases} ]

Note that $\textup{KB}^{br}_{\textup{result}(o)}$ is constrained by one of the simple constraints to be a linear expression containing key basis variables for the operands of $o$. The conditional above cannot be implemented directly in an ILP. Instead, one can implement it via four constraints that effectively linearize (in the sense of making non-linear constraints linear) the multiplexer formula

[ \textup{KB}*{\textup{result}(o)} = (1 - R_o) \cdot
\textup{KB}^{br}*{\textup{result}(o)} + R_o \cdot 1 ]

(Note the above is not linear because in includes the product of two variables.) The four constraints are:

[ \begin{aligned} \textup{KB}*\textup{result}(o) &\geq \textup{ R} o \*{\textup{result}(o)} – C \textup{ R}

\textup{KB}\textup{result}(o) &\leq 1 + C(1 – \textup{R}o) \

\textup{KB}\textup{result}(o) &\geq \textup{KB}^{br}

*o \*

\textup{KB}\textup{result}(o) &\leq \textup{KB}^{br}_{\textup{result}(o)}

\textup{KB}

- C \textup{ R}_o \

\end{aligned} ]

Here $C$ is a constant that can be set to any value larger than
`MAX_KEY_BASIS_DEGREE`

. We set it to 100.

Setting $R_o = 0$ makes constraints 1 and 2 trivially satisfied, while
constraints 3 and 4 enforce the equality $\textup{KB}*{\textup{result}(o)} =
\textup{KB}^{br}*{\textup{result}(o)}$. Likewise, setting $R_o = 1$ makes
constraints 3 and 4 trivially satisfied, while constraints 1 and 2 enforce the
equality $\textup{KB}_{\textup{result}(o)} = 1$.

## Notes

- ILP performance scales roughly with the number of integer variables. The formulation above only requires the decision variable to be integer, and the initialization and constraints effectively force the key basis variables to be integer. As a result, the solve time of the above ILP should scale with the number of ciphertext-handling ops in the program.