# API Description - Compute¶

## ElementwiseUnary¶

### Function Definition¶

```
template <typename InT, typename OutT, int NX, int NY, int BlockSize, class OpFunc>
__device__ void ElementwiseUnary(OutT* out, const InT* in, OpFunc compute)
```

### Detailed Description¶

The input is calculated according to the calculation rules in OpFunc, and the calculation result is stored in the register out according to the OutT type.

### Template Parameters¶

InT: The type of input.

OutT: The type stored in the out register.

NX: Each thread needs to calculate NX columns.

NY: Each thread needs to calculate NY rows.

BlockSize: Device attribute, which identifies the current device thread indexing method. For GPU, threadIdx.x is used as the thread index, this parameter is not currently supported.

OpFunc: calculation function, please refer to the OpFunc section for definitions.

### Parameters¶

out: Output register pointer, the size is NX * NY.

in: Input register pointer, the size is NX * NY.

compute: Calculation function, declared as OpFunc<InT>().

## ElementwiseBinary¶

### Function Definition¶

```
template <typename InT, typename OutT, int NX, int NY, int BlockSize, class OpFunc>
__device__ void ElementwiseBinary(OutT* out, const InT* in1, const InT* in2, OpFunc compute)
```

### Detailed Description¶

Calculate in1 and in2 according to the calculation rules in OpFunc, and store the calculation result in the register out according to the OutT type.

### Template Parameters¶

InT: The type of input data.

OutT: The type stored in the out register.

NX: Each thread needs to calculate NX columns.

NY: Each thread needs to calculate NY rows.

BlockSize: Device attribute, which identifies the current device thread indexing method. For GPU, threadIdx.x is used as the thread index, this parameter is not currently supported.

OpFunc: calculation function, please refer to the OpFunc section for definitions.

### Parameters¶

out: Output register pointer, the size is NX * NY.

in1: The pointer of the left operand register, the size is NX * NY.

in2: Right operand register pointer, the size is NX * NY.

compute: The calculation object declared as OpFunc<InT>().

## CycleBinary¶

### Function Definition¶

```
template <typename InT, typename OutT, int NX, int NY, int BlockSize, class OpFunc>
__device__ void CycleBinary(OutT* out, const InT* in1, const InT* in2, OpFunc compute)
```

### Detailed Description¶

Calculate in1 and in2 according to the calculation rules in the OpFunc, and store the calculation results in the register out according to the OutT type. The shape of in1 is [1, NX], and the shape of in2 is [NY, NX], realizing in1, in2 Loop calculation, the shape of out is [NY, NX].

### Template Parameters¶

InT: Type of input data.

OutT: The type stored in the out register.

NX: Each thread needs to calculate NX columns.

NY: Each thread needs to calculate NY rows.

BlockSize: Device attribute, which identifies the current device thread indexing method. For GPU, threadIdx.x is used as the thread index, this parameter is not currently supported.

OpFunc: calculation function, please refer to the OpFunc section for definitions.

### Parameters¶

out: Output register pointer, the size is NX * NY.

in1: The pointer of the left operand register, the size is NX.

in2: Right operand register pointer, the size is NX * NY.

compute: The calculation object declared as OpFunc<InT>().

## ElementwiseTernary¶

### Function Definition¶

```
template <typename InT, typename OutT, int NX, int NY, int BlockSize, class OpFunc>
__device__ void ElementwiseTernary(OutT* out, const InT* in1, const InT* in2, const InT* in3, OpFunc compute)
```

### Detailed Description¶

Calculate in1, in2, and in3 according to the calculation rules in OpFunc, and store the calculation result in the register out according to the OutT type.

### Template Parameters¶

InT: Type of input data.

OutT: The type stored in the out register.

NX: Each thread needs to calculate NX columns.

NY: Each thread needs to calculate NY rows.

BlockSize: Device attribute, which identifies the current device thread indexing method. For GPU, threadIdx.x is used as the thread index, this parameter is not currently supported.

OpFunc: calculation function, please refer to the OpFunc section for definitions.

### Parameters¶

out: Output register pointer, the size is NX * NY.

in1: The register pointer of operand 1, the size is NX * NY.

in2: The register pointer of operand 2, the size is NX * NY.

in3: The register pointer of operand 3, the size is NX * NY.

compute: Declared as the calculation object of OpFunc<InT>().

## ElementwiseAny¶

### Function Definition¶

```
template <typename InT, typename OutT, int NX, int NY, int BlockSize, int Arity, class OpFunc>
__device__ void ElementwiseAny(OutT* out, InT (*ins)[NX * NY], OpFunc compute)
```

### Detailed Description¶

Calculate the input in ins according to the calculation rules in OpFunc, and store the calculation result in the register out according to the OutT type. All inputs and output have the same shapes.

### Template Parameters¶

InT: Type of input data.

OutT: The type stored in the out register.

NX: Each thread needs to calculate NX columns.

NY: Each thread needs to calculate NY rows.

BlockSize: Device attribute, which identifies the current device thread indexing method. For GPU, threadIdx.x is used as the thread index, this parameter is not currently supported.

Arity: The number of pointers in the pointer array ins.

OpFunc: calculation function, please refer to the OpFunc section for definitions.

### Parameters¶

out: Output register pointer, the size is NX * NY.

ins: A pointer array composed of multiple input pointers, the size is Arity.

compute: The calculation object declared as OpFunc<InT>().

## Reduce¶

### Function Definition¶

```
template <typename T, int NX, int NY, int BlockSize, class ReduceFunctor, details::ReduceMode Mode>
__device__ void Reduce(T* out, const T* in, ReduceFunctor reducer, bool reduce_last_dim)
```

### Detailed Description¶

Reduce the input according to the reducer, the input shape is [NY, NX], when Mode = kLocalMode, reduce in along the NX direction to complete the intra-thread protocol, out is [NY, 1]; when Mode = kGlobalMode , Use shared memory to complete the protocol operation between threads in the block, the size of in and out are the same, both are [NY, NX].

The data processing process of ReduceMax is as follows:

### Template Parameters¶

T: Type of input data.

NX: Each thread needs to calculate NX columns.

NY: Each thread needs to calculate NY rows.

BlockSize: Device attribute, which identifies the current device thread indexing method. For GPU, threadIdx.x is used as the thread index, this parameter is not currently supported.

ReduceFunctor: Reduce calculation function, please refer to the OpFunc section for definitions.

Mode : Reduce mode can be kGlobalMode or kLocalMode.

### Parameters¶

out: Output register pointer, the size is NX * NY.

in: Input register pointer, the size is NX * NY.

reducer: Reduction method, which can be defined using ReduceFunctor<T>().

reduce_last_dim: Indicates whether the last dimension of the original input is reduced.