3. mcThrust并行库

mcThrust是一款类似于C++标准模板库（STL）的并行算法库。mcThrust旨在通过高级接口实现高性能并行应用程序，并在GPU和多核CPU之间实现性能可移植性（performance portability）。 mcThrust的算法后端实现支持多种技术环境，包括C++、MXMACA、OpenMP和TBB等。mcThrust提供了丰富的数据并行原语（primitives）集合，如copy、reduce和transform等。这些原语可以通过将简洁易读的源代码组合在一起，实现复杂算法的设计。因此，mcThrust可用于进行MXMACA应用程序的快速原型设计。

本章通过一些例子描述了如何使用mcThrust开发或移植应用程序。关于mcThrust API的详细信息，参见《沐曦通用GPU mcThrust API参考》。

3.1. 并行算法

3.1.1. thrust::transform

thrust::transform 用于将输入变量进行运算，并将结果保存至对应的输出变量。以下代码片段实现了对数组中的每个元素进行取反运算：

int data[10] = {-5, 0, 2, -3, 2, 4, 0, -1, 2, 8};
for (int i = 0; i < data.size(); ++i)
    data[i] = -data[i];

可以使用如下 thrust::transform 语句实现并行：

int data[10] = {-5, 0, 2, -3, 2, 4, 0, -1, 2, 8};
thrust::negate<int> op;
thrust::transform(data, data + data.size(), data, op); // in-place transformation
// data is now {5, 0, -2, 3, -2, -4, 0, 1, -2, -8};

3.1.2. thrust::reduce

thrust::reduce 可用于实现归约运算。以下代码片段实现了取数组中最大值的运算：

int data[6] = {1, 0, 2, 2, 1, 3};
int result = -1;
for (int i = 0; i < data.size(); ++i)
    if (result < data[i]) result = data[i];

可以使用如下 thrust::reduce 语句实现并行归约运算：

int data[6] = {1, 0, 2, 2, 1, 3};
int result = thrust::reduce(data, data + 6, -1, thrust::maximum<int>());
// result == 3

3.1.3. thrust::copy

thrust::copy 可用于实现数据拷贝。以下代码片段实现了将一数组中的数据复制到另一数组的运算：

for (int i = 0; i < vec0.size(); ++i)
    vec1[i] = vec0[i];
// vec1 is now a copy of vec0

可以使用如下 thrust::copy 语句实现并行：

thrust::copy(vec0.begin(), vec0.end(), vec1.begin());
// vec1 is now a copy of vec0

3.2. 迭代器iterator

迭代器通常和并行算法API联合使用，以实现更为复杂的计算。

3.2.1. thrust::constant_iterator

thrust::constant_iterator 是一个常量迭代器，其中的每一个元素都是一个预先指定的常量。以下代码片段实现了将一数组中的每个元素加10的运算：

int data[10] = {3, 7, 2, 5};
for (int i = 0; i < data.size(); ++i)
    data[i] = data[i] + 10;

可以将 thrust::constant_iterator 与 thrust::transform 配合使用实现上述功能：

int data[10] = {3, 7, 2, 5};
// add 10 to all values in data
thrust::transform(data.begin(), data.end(),
                  thrust::make_constant_iterator(10),
                  data.begin(),
                  thrust::plus<int>());
// data is now [13, 17, 12, 15]

3.2.2. thrust::counting_iterator

thrust::counting_iterator 是一个整数序列迭代器。以下代码片段实现了将数组赋值为自5开始的整数序列：

for (int i = 0; i < vec.size(); ++i)
    vec[i] = i + 5;

可以将 thrust::counting_iterator 与 thrust::copy 配合使用实现上述功能：

thrust::counting_iterator<int> iter(5);
thrust::copy(iter, iter + vec.size(), vec.begin());
// vec is now [5, 6, 7, 8, 9, …]

3.2.3. thrust::permutation_iterator

thrust::permutation_iterator 可以实现对数组的重排列。以下代码片段使用ind数组对data_in数组进行重排列，并将其赋值给 data_out：

for (int i = 0; i < ind.size(); ++i)
    data_out[i] = data_in[ind[i]];

可以将 thrust::permutation_iterator 与 thrust::copy 配合使用实现上述功能：

thrust::copy(
    thrust::make_permutation_iterator(data_in.begin(), ind.begin()),
    thrust::make_permutation_iterator(data_in.begin(), ind.end()),
    data_out.begin());

3.2.4. thrust::transform_iterator

thrust::transform_iterator 可以实现在数组迭代时进行运算。以下代码片段演示了对数组的平方根求和：

double result = 0;
for (int i = 0; i < vec.size(); ++i)
    result += sqr(vec[i]);

可以将 thrust::transform_iterator 与 thrust::reduce 配合使用实现上述功能：

auto iter = thrust::make_transform_iterator(vec.begin(), thrust::sqrt());
double result = thrust::reduce(
    iter, iter + vec.size(), 0., thrust::plus<double>());

3.2.5. thrust::zip_iterator与thrust::tuple

thrust的并行算法大多支持至多2个输入数组。当参与运算的数组较多时我们可以使用 thrust::zip_iterator 配合 thrust::tuple 将多个数组捆绑打包，构建成虚拟数组。以下代码片段的运算涉及3个输入数组和2个常数：

for (int i = 0; i < data_out.size(); ++i)
    data_out[i] = data_in1[i]
                + beta * (data_in2[i] - omega * data_in3[i]);

可以将 thrust::zip_iterator、 thrust::tuple 与 thrust::tranform 配合使用实现上述功能：

struct Functor : public std::unary_function<thrust::tuple<double, double>, double>
{
    const double beta;
    const double omega;

    Functor(double _beta, double _omega) : beta(_beta), omega(_omega) {}

    __host__ __device__ double operator()(const thrust::tuple<double, double> &t, const double &data3)
        return thrust::get<0>(t) + beta * (thrust::get<1>(t) - omega * data3);
};

...

thrust::transform(
    thrust::make_zip_iterator(
        thrust::make_tuple(data_in1.begin(), data_in2.begin())
    ),
    thrust::make_zip_iterator(
        thrust::make_tuple(data_in1.end(), data_in2.end())
    ),
    data_in3.begin(),
    data_out.begin(),
    Functor(beta, omega)
);

其中，定义了操作符Functor来实现运算，并在其中使用 thrust::get 获取tuple结构体中的元素。使用 thrust::zip_iterator 还有一个优点：它将数据组织成了数组结构（structure of arrays）的形式，这可以提高数据访问效率。