#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
1	maomao90	161
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
10	TheScrasse	142

Hello, codeforces!

Do you have modern GPU and CPU and want to calculate something with them, but you don't know how to do it? Dont be upset, because you can read this blog and run your first GPU program on your AMD/Intel/NVIDIA GPU. I will describe an installation process and then we will dive into the C++ code. As an example, we will calculate a lot of mazes for Bug Game in parallel. For doing this in simple way, we will use library Boost.Compute, which is a C++ STL-style wrapper over OpenCL library. ...

Installation for MacOS (Apple M1 Pro)

As win said, OpenCL and g++ are pre-installed on MacOS. Everything that you have to do is install Boost with command brew install boost. After this you will be able to compile your program with command g++ -std=c++17 -O3 -I/opt/homebrew/Cellar/boost/1.81.0_1/include -framework OpenCL -o main main.cpp

Installation for Windows

We need to install GCC, Boost and OpenCL. Let's do it with cygwin. Alternative: do it with msys2. Just download the cygwin installer by above link and run installation process. Select www.cygwin.mirror.constant.com as a mirror!

Cygwin installation

Select GCC

On the step of choosing packages, select View: Full, type Search: gcc and select next packages:

colorgcc
gcc-core
gcc-g++
libgcc1

Screenshot for choosing of GCC

Select Boost

Then type Search: boost and select all of the packages.

List of selected packages

Screenshots

Select OpenCL

Then type Search: OpenCL and select all of the packages: libOpenCL-devel and libOpenCL1.

Screenshot

Finish installation

After selecting all of the packages, click Next twice and wait while the installation process is not finished.

Hello, world!

We are ready to write our first example. Lets print list of available devices. Here is the C++ code:

#include <iostream>
#include <iomanip>
#include <boost/compute/core.hpp>

namespace compute = boost::compute;

std::ostream &operator<<(std::ostream &os, const compute::device & device)
{
    return os << device.name() << " (platform: " << device.platform().name() << ")";
}

int main()
{
    // cycle over all of the devices:
    std::cout << "List of available devices:\n";
    for (int i = 0; auto device : compute::system::devices())
        std::cout << std::setw(4) << i++ << ":\t" << device << std::endl;
    
    // get the default device
    compute::device device = compute::system::default_device();
    std::cout << "\nDefault device:\n\t" << device << std::endl;    

    return 0;
}

This code will print the list of available devices for Boost.Compute, and will print default device. I saved this code in directory C:\Users\dkozyrev\Desktop\gpu-examples as file printDevices.cpp.

Open cygwin launcher (not installer!) and complete these steps for compilation and running:

cd "C:\Users\dkozyrev\Desktop\gpu-examples" — going into our directory;
g++ -Ofast -std=c++17 printDevices.cpp -o printDevices -lOpenCL — compilation and linking;
./printDevices — running executable.

Please, pay attention, that on MacOS the compilation command is different: g++ -std=c++17 -O3 -I/opt/homebrew/Cellar/boost/1.81.0_1/include -framework OpenCL -o main main.cpp

You will see the result and will not see your GPU in the list of available devices. Now we are going to install GPU drivers.

Screenshot

Install GPU Drivers

Intel GPU

Google intel opencl sdk and you will see this website. Download and install Intel® SDK for OpenCL™ Applications.

Screenshot

NVIDIA GPU

Google nvidia cuda toolkit. You will see this website. Just download and install!

AMD GPU

Probably, you need to download and install AMD Accelerated Parallel Processing (APP) SDK 3.0. I don't know exactly. Please, see the answer on stackoverflow here.

Copying GPU Drivers to cygwin

If you are using msys2, this step can be avoided (I did some tests on msys2), but when you are using cygwin, it is necessary.

When installation will be finished, you will see the file C:\Windows\System32\OpenCL.dll. Seems that the path is same for any GPU (Intel, AMD, NVIDIA). Open cygwin launcher (not installer!) and complete next:

cp "/cygdrive/c/Windows/System32/OpenCl.dll" "/cygdrive/c/cygwin64/bin/"
cp "/cygdrive/c/Windows/System32/OpenCl.dll" "/cygdrive/c/cygwin64/bin/cygOpenCL-1.dll"
echo "cygOpenCL-1.dll" > "/etc/OpenCL/vendors/gpu.icd"
./printDevices — rerun first example (printing all of the available devices)

You will see, that GPU is available now! I see my Intel(R) Iris(R) Xe Graphics in the list of all devices.

Screenshot

OK, we are ready to calculate something!

Boost.Compute calculation scheme

GPU can process a lot of data simultaneously (in parallel). This is something like processing $$$100.000$$$ independent testcases. There is no profit to run GPU for only one testcase. So, we will use next calculation scheme:

Create a vector of independent testcases on the host (CPU)
Copy this vector from the host (from CPU) to a device (to GPU)
Process all of the testcases in parallel
Copy a vector of results from device (from GPU) to the host (to CPU)
Process a vector of results on CPU

Lets start to solve Bug Game on GPU and we will see how it is implemented!

Solving Bug Game on GPU

Please, read problem statement firstly:

Now you know how the Bug moves and what is the maze. Suppose that you are building maze in your code. Some genetic or randomized algorithms, or brute force need to calculate the answer for a whole population (whole set of mazes) independently. This step is costly when population is heavy.

We will create a vector of valid mazes (where the exit is reachable), send this vector to GPU, wait for finish of calculations, copy results and process them.

In this blog we are interested in looking on the performance, so, we will read a maze with 1.488.650 steps from the input file, copy this maze 1.000.000 times and calculate this million of mazes on GPU.

CPU part of program

Reading of the maze from the input file:

    std::ifstream fin(inputFileName);
    if (!fin)
    {
        std::cout << "Can't open the file '" << inputFileName << "'" << std::endl;
        return 0;
    }
    std::vector<std::vector<bool>> f(21, std::vector<bool>(31, 1));
    for (int i = 0; i < 21; i++)
        for (int j = 0; j < 31; j++)
        {
            char ch; fin >> ch;
            f[i][j] = (ch == '#');
        }

Then we need to create a struct for keeping number of visits for each cell. When the Bug's moving finished, the answer is the sum of number of visits for each cell, so we need just to keep the array with number of visits.

Let's create the struct Field:

struct Field
{
    unsigned nVis[21 * 31];
};

We are using 1D array instead of 2D array and data type unsigned (unsigned integer 32-bit data type). Copying from vector of vectors to our new structure:

#define pos2int(r, c) ((r) * 31 + (c))
#define WALL (UINT_MAX)
#define EMPTY (0)
Field setField(const std::vector<std::vector<bool>> & f)
{
    Field a;
    for (int r = 0; r < 21; r++)
        for (int c = 0; c < 31; c++)
            a.nVis[pos2int(r,c)] = (f[r][c] ? WALL : EMPTY);
    return a;
}

Also we need a function for accumulating the answer:

long long accumulateTheAnswer(const Field &f)
{
    long long res = 0;
    for (int i = 0; i < 21 * 31; i++)
        if (f.nVis[i] != WALL)
            res += f.nVis[i];
    return res;
}

Now we need to write a function which will implement the Bug's moving on CPU:

__attribute__((noinline))
void bugMovingCPU(Field &a)
{
    int pos = 1 * 31 + 1, dir = +31;
    while (pos != 618)
    {
        ++a.nVis[pos];
        unsigned val = a.nVis[pos + dir];
        { unsigned next = a.nVis[pos + 31]; if (next < val) dir = +31, val = next; }
        { unsigned next = a.nVis[pos +  1]; if (next < val) dir =  +1, val = next; }
        { unsigned next = a.nVis[pos - 31]; if (next < val) dir = -31, val = next; }
        { unsigned next = a.nVis[pos -  1]; if (next < val) dir =  -1, val = next; }
        pos += dir;
    }
}

This algorithm starts moving from position $$$\left(1,1\right)=1\times 31 + 1$$$ and while the exit is not reached (pos is not equal to $$$\left(19,29\right)=19 \times 31 + 29 = 618$$$), we are calculating minimal value among adjacent cells in order DOWN-RIGHT-UP-LEFT and doing step in minimal direction.

I added attribute noinline: this keyword will prohibit in-lining of this function. The performance test will not be affected by GCC optimizations (otherwise GCC may decide to calculate the answer once and copy the answer n times).

Yeah, we are starting with GPU part now.

GPU part of program

Firstly, GPU (OpenCL library) should be familiar with our struct Field. It can be done with next macro:

BOOST_COMPUTE_ADAPT_STRUCT(Field, Field, (nVis))

You need to write the name of structure twice and then you need to list out all members of structure. In this example we have only one member nVis.

We need to include a new header file #include <boost/compute/types/struct.hpp> for working with structures.

Secondly, you need to write the function which will calculate the maze on GPU:

BOOST_COMPUTE_FUNCTION(Field, bugMovingGPU, (Field a),
{
    int pos = 1 * 31 + 1, dir = +31;
    while (pos != 618)
    {
        ++a.nVis[pos];
        unsigned val = a.nVis[pos + dir];
        { unsigned next = a.nVis[pos + 31]; if (next < val) dir = +31, val = next; }
        { unsigned next = a.nVis[pos +  1]; if (next < val) dir =  +1, val = next; }
        { unsigned next = a.nVis[pos - 31]; if (next < val) dir = -31, val = next; }
        { unsigned next = a.nVis[pos -  1]; if (next < val) dir =  -1, val = next; }
        pos += dir;
    }
    return a;
});

We will use the same code, as we used above. Parameters of macro BOOST_COMPUTE_FUNCTION:

Field — return value;
bugMovingGPU — name of a new function;
(Field a) — the list of arguments of this function;
The body of this function.

After reading of the field, in the function main we need to detect GPU device, create a context for this device and create a command queue for sending jobs. We will use default device.

    compute::device device = compute::system::default_device();
    compute::context context(device);
    compute::command_queue queue(context, device);

An alternative: you can iterate over all of the devices and pick one which you want to use or you can use all of the GPUs (if you have a lot of them), but, I think, that you will need to create different queues and contexts.

Lets create and fill the vector of mazes on CPU (the host's vector):

    std::vector<Field> host_vector(nFields, setField(f));

Create vector of mazes on GPU (the device's vector) and copy from CPU to GPU:

    // create a vector on the device
    compute::vector<Field> device_vector(host_vector.size(), context);

    // transfer data from the host to the device
    compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue);

We need to use data structures and algorithms from Boost.Compute library. This is why we are creating compute::vector (second argument is device's context) and using compute::copy algorithm (4-th argument is created device's command queue). Don't forget #include <boost/compute/container/vector.hpp>

Boost.Compute will perform copying of the data from CPU to GPU. If you need an asynchronous copying (you may want to do something useful on CPU while copying of data), you can use compute::copy_async like in this example from Boost.Compute manual.

Now we are ready to calculate the answer on GPU for each field. We can use function transform:

    // calculate the answer for each field on GPU:
    compute::transform(
        device_vector.begin(), device_vector.end(), // input range
        device_vector.begin(), // begin of output range
        bugMovingGPU, queue // the lambda function which will be applied to each maze, and device's queue
    );

This function is an equivalent to C++ STL std::transform. See description on cppreference. You can use your lambda function as $$$4$$$-th parameter. This function will be applied to each element from the input range. The result will be written to the output range. Don't forget #include <boost/compute/algorithm/transform.hpp>.

Probably, you can use compute::for_each too, but I was not able to do this, because GPU can't work with references to data (Field &f) or I don't know how to do it...

And then, you can copy the results from device to host:

    compute::copy(device_vector.begin(), device_vector.end(), host_vector.begin(), queue);

Then we will need to measure runtime. I'm using the library std::chrono for it. You can see some details in the source code.

Fixing the bug in Boost.Compute 1.66 header by yourself

When you will try to compile the program in cygwin, you will get a compilation error:

In file included from /usr/include/boost/compute/detail/meta_kernel.hpp:37,
                 from /usr/include/boost/compute/types/struct.hpp:26,
                 from gpu-example.cpp:6:
/usr/include/boost/compute/memory/svm_ptr.hpp: In member function ‘boost::compute::context& boost::compute::svm_ptr<T>::get_context() const’:
/usr/include/boost/compute/memory/svm_ptr.hpp:131:16: error: binding reference of type ‘boost::compute::context&’ to ‘const boost::compute::context’ discards qualifiers
  131 |         return m_context;

This is because an author of Boost.Compute forgot about const in one place.

You need to fix this error by yourself or download a fixed file svm_ptr.hpp and replace it. Please, download mine svm_ptr.hpp file.

Suppose, that you saved this file as C:\Users\dkozyrev\Desktop\gpu-examples\svm_ptr.hpp. Open cygwin launched and complete the command cp "C:\Users\dkozyrev\Desktop\gpu-examples\svm_ptr.hpp" /usr/include/boost/compute/memory/svm_ptr.hpp.

Compilation and running

Take this C++ file and save it as gpu-example.cpp. Also save the 1.488.650 maze as 1.488.650.txt.

Windows compilation

Open the cygwin launcher and complete next:

cd "C:\Users\dkozyrev\Desktop\gpu-examples"
g++ -Ofast -std=c++17 -DBOOST_COMPUTE_DEBUG_KERNEL_COMPILATION gpu-example.cpp -o gpu-example -lOpenCL

The macro BOOST_COMPUTE_DEBUG_KERNEL_COMPILATION is needed to print the compilation log from the GPU device.

MacOS Apple M1 Pro compilation

cd "your directory"
g++ -std=c++17 -O3 -DBOOST_COMPUTE_DEBUG_KERNEL_COMPILATION -I/opt/homebrew/Cellar/boost/1.81.0_1/include -framework OpenCL -o gpu-example gpu-example.cpp

Running

Firstly, you can run $$$1$$$ field calculation with command ./gpu-example 1 "1.488.650.txt". If everything is OK, you can try to run from $$$1024$$$ to $$$10000-1000000$$$ fields.

I will show GPU performance in the bottom of this blog.

Adding CPU multi-threading

While GPU will be very busy, CPU will do nothing. We can fix it by completing some jobs on CPU too. From the main thread we will create few threads. Each thread will calculate own lambda function. This lambda function will do some calculations while GPU is still working.

    // start CPU jobs by creating vector of threads, each of them will do some work independently
    const int nCPUThreads = std::thread::hardware_concurrency();
    std::atomic<bool> gpuFinished{false};
    std::vector<std::thread> threads;
    std::vector<std::deque<Field>> cpuResults(nCPUThreads);
    for (int id = 0; id < nCPUThreads; id++)
        threads.emplace_back([threadId = id, &gpuFinished, &f, &cpuResults]()
        {
            while (!gpuFinished)
            {
                // requesting new field:
                Field currField = setField(f);
                // calculating new field:
                bugMovingCPU(currField);
                // adding this field in a vector of results:
                cpuResults[threadId].push_back(currField);
            }
        });

In above code we use std::thread::hardware_concurrency() for requesting how many threads can work at the same time on CPU. Then we are creating the atomic boolean variable gpuFinished for checking if GPU calculations have been finished or not. Then we are creating the vector of threads and the vector of cpu results. Each thread will have own storage for storing results of its work. This is needed for avoiding the situation when two or more threads use same vector and override same vector's element (avoid Data Races).

We are using emplace_back instead of push_back for vector of threads, because std::thread can't be copied.

In the context scope of the lambda function we are capturing references to shared (between threads) data (gpuFinished, f and cpuResults) and are creating constant int threadId — we will use this as an index in vector.

These threads will work while we will not change value of gpuFinished in the main thread. We will change it when GPU finished (after copying of results from device to host):

    // GPU has been finished here:
    gpuFinished = true;
    
    // waiting while threads are not finished:
    for (auto &t : threads)
        if (t.joinable())
            t.join();

In last $$$3$$$ lines we are joining created threads to main thread. This means, that the main thread will wait when other threads will be destroyed. Otherwise it is possible that the main thread finished earlier than other threads.

CPU multi-threading bonus

Different approaches to parallelize

If you want to process a vector of mazes on CPU inplace, you can use this popular lock-free pattern:

    std::vector<Field> fields(nFields, setField(f));
    std::atomic<int> position{0};
    for (int id = 0; id < nCPUThreads; id++)
        threads.emplace_back([threadId = id, &position, &fields]()
        {
            for (int p; (p = position++) < (int)fields.size(); )
                bugMovingCPU(fields[p]);
        });

We are using atomic variable position again because position++ is the atomic operation. This is the way how we are avoiding data races here. Don't forget about joining!

Some alternatives: split your vector in nCPUThreads segments or use std::mutex with std::lock_guard for locking an access to common resources:

    std::vector<Field> fields(nFields, setField(f));
    int position = 0;
    std::mutex mutex;
    for (int id = 0; id < nCPUThreads; id++)
        threads.emplace_back([threadId = id, &position, &fields, &mutex]()
        {
            while (true)
            {
                int p; // local variable for each thread
                {   // lock the access to all variables in this scope
                    // other threads will wait for unlocking of the mutex
                    std::lock_guard<std::mutex> lock(mutex);
                    p = position++;
                }
                if (p >= (int)fields.size())
                    break;
                bugMovingCPU(fields[p]);
            }
        });

Locking and unlocking quite expensive, because there is a communication between thread and manager of resources in OS.

Also you can see examples on cppreference for mutex, thread, lock_guard and condition_variable.

Running GPU + CPU calculations

Let's run this example. Save the code as gpu-cpu-example.cpp. Also download the maze and save as 1.488.650.txt. Place everything in the same directory.

MacOS Apple M1 Pro

The command of compilation: g++ -std=c++17 -O3 -I/opt/homebrew/Cellar/boost/1.81.0_1/include -framework OpenCL -o gpu-cpu-example gpu-cpu-example.cpp

Command $$$1$$$: ./gpu-cpu-example 128 "1.488.650.txt"

results for 128 copies of field with clang

nFields: 128
input file: '1.488.650.txt'

List of available devices:
   0:   Apple M1 Pro (platform: Apple)

Number of CPU threads: 8

Default device will be used:
        Apple M1 Pro (platform: Apple)

Completed during 0.315786 seconds:
  GPU ops:        190.547.200
  CPU ops:        430.219.850
  Sum ops:        620.767.050

Productivity per 1 second:
GPU Speed:        603.406.023/s
CPU Speed:      1.362.377.663/s
Sum speed:      1.965.783.686/s

Command $$$2$$$: ./gpu-cpu-example 10000 "1.488.650.txt"

Results for 10000 copies of field with clang

nFields: 10000
input file: '1.488.650.txt'

List of available devices:
   0:   Apple M1 Pro (platform: Apple)

Number of CPU threads: 8

Default device will be used:
        Apple M1 Pro (platform: Apple)

Completed during 1.45009 seconds:
  GPU ops:     14.886.500.000
  CPU ops:      1.993.302.350
  Sum ops:     16.879.802.350

Productivity per 1 second:
GPU Speed:     10.265.881.786/s
CPU Speed:      1.374.601.571/s
Sum speed:     11.640.483.357/s

Command $$$3$$$: ./gpu-cpu-example 100000 "1.488.650.txt"

Results for 100000 copies of field with clang

nFields: 100000
input file: '1.488.650.txt'

List of available devices:
   0:   Apple M1 Pro (platform: Apple)

Number of CPU threads: 8

Default device will be used:
        Apple M1 Pro (platform: Apple)

Completed during 8.26215 seconds:
  GPU ops:    148.865.000.000
  CPU ops:     11.411.990.900
  Sum ops:    160.276.990.900

Productivity per 1 second:
GPU Speed:     18.017.715.160/s
CPU Speed:      1.381.238.044/s
Sum speed:     19.398.953.204/s

Command $$$4$$$: ./gpu-cpu-example 1000000 "1.488.650.txt"

Results for 100000 copies of field with clang

nFields: 100000
input file: '1.488.650.txt'

List of available devices:
   0:   Apple M1 Pro (platform: Apple)

Number of CPU threads: 8

Default device will be used:
        Apple M1 Pro (platform: Apple)

Completed during 8.26215 seconds:
  GPU ops:    148.865.000.000
  CPU ops:     11.411.990.900
  Sum ops:    160.276.990.900

Productivity per 1 second:
GPU Speed:     18.017.715.160/s
CPU Speed:      1.381.238.044/s
Sum speed:     19.398.953.204/s

Results for 100000 copies of field with gcc

nFields: 100000
input file: '1.488.650.txt'

List of available devices:
   0:   Apple M1 Pro (platform: Apple)

Number of CPU threads: 8

Default device will be used:
        Apple M1 Pro (platform: Apple)

Completed during 8.12622 seconds:
  GPU ops:    148.865.000.000
  CPU ops:     30.310.402.650
  Sum ops:    179.175.402.650

Productiviry per 1 second:
GPU Speed:     18.319.090.962/s
CPU Speed:      3.729.950.110/s
Sum speed:     22.049.041.072/s

So, we achieved $$$22$$$ billions moves per second on Apple M1 Pro and gcc is faster than clang. Statistics has been provided by win.

Windows + NVIDIA GeForce GTX 1660 Ti

Commands:

cd <your directory>
g++ -Ofast -DBOOST_COMPUTE_DEBUG_KERNEL_COMPILATION gpu-cpu-example.cpp -o gpu-cpu-example -lOpenCL |& tee out.txt
./gpu-cpu-example 1000000 "1.488.650.txt"

The speed is 28.501.609.181/s, provided by sziiza.

Results for 1000000 copies of field

nFields: 1000000
input file: '1.488.650.txt'

List of available devices:
   0:   NVIDIA GeForce GTX 1660 Ti (platform: NVIDIA CUDA)

Number of CPU threads: 12

Default device will be used:
        NVIDIA GeForce GTX 1660 Ti (platform: NVIDIA CUDA)

Completed during 59.0952 seconds:
  GPU ops:  1.488.650.000.000
  CPU ops:    195.659.224.100
  Sum ops:  1.684.309.224.100

Productiviry per 1 second:
GPU Speed:     25.190.695.331/s
CPU Speed:      3.310.913.850/s
Sum speed:     28.501.609.181/s

Comments: probably, you need to disable integrated GPU in Windows before running. Otherwise, NVIDIA will not be chosen as a default device. Or you need to rewrite the code.

Also, sziiza calculated his $$$137$$$ millions maze 1.000.000 times! Got same performance. Such calculations takes $$$5451$$$ seconds. He did 157.345.095.570.802 operations with speed 28.865.081.470/s.

Windows + NVIDIA GeForce RTX 3070 Laptop GPU

Instructions are the same to previous one. mz1 used his own 11.818.688.txt maze. Also he disabled integrated GPU.

Speed is 48.483.623.775/s — the winner.

Results for 1000000 copies of field

nFields: 1000000
input file: '11.818.688.txt'

List of available devices:
   0:   NVIDIA GeForce RTX 3070 Laptop GPU

Number of CPU threads: 16

Default device will be used:
        NVIDIA GeForce RTX 3070 Laptop GPU

Completed during 266.148 seconds:
  GPU ops: 11.818.688.000.000
  CPU ops:  1.085.132.838.720
  Sum ops: 12.903.820.838.720

Productiviry per 1 second:
GPU Speed:     44.406.445.941/s
CPU Speed:      4.077.177.834/s
Sum speed:     48.483.623.775/s

Windows + Intel(R) UHD Graphics 620

Command $$$1$$$: ./gpu-cpu-example 1000 "1.488.650.txt".

Results for 1000 copies of field

nFields: 1000
input file: '1.488.650.txt'

List of available devices:
   0:   Intel(R) UHD Graphics 620 (platform: Intel(R) OpenCL HD Graphics)

Number of CPU threads: 8

Default device will be used:
        Intel(R) UHD Graphics 620 (platform: Intel(R) OpenCL HD Graphics)

Completed during 3.74463 seconds:
  GPU ops:      1.488.650.000
  CPU ops:      3.860.069.450
  Sum ops:      5.348.719.450

Productivity per 1 second:
GPU Speed:        397.542.465/s
CPU Speed:      1.030.827.613/s
Sum Speed:      1.428.370.078/s

Command $$$2$$$: ./gpu-cpu-example 10000 "1.488.650.txt"

Results for 10000 copies of field

   nFields: 10000
input file: '1.488.650.txt'

List of available devices:
   0:   Intel(R) UHD Graphics 620 (platform: Intel(R) OpenCL HD Graphics)

Number of CPU threads: 8

Default device will be used:
        Intel(R) UHD Graphics 620 (platform: Intel(R) OpenCL HD Graphics)

Completed during 81.025 seconds:
  GPU ops:     14.886.500.000
  CPU ops:     72.504.698.250
  Sum ops:     87.391.198.250

Productivity per 1 second:
GPU Speed:        183.727.212/s
CPU Speed:        894.843.388/s
Sum Speed:      1.078.570.600/s

Failed on Windows + Intel(R) Iris(R) Xe Graphics

Intel(R) Iris(R) Xe Graphics skipped some fields and decided do not calculate them. Probably, something wrong with driver/setup, but I can't resolve it. GPU ops have to be
1.488.650.000, but they are 1.143.283.200. Debugging shows that elements [0,255] and [512,767] has been calculated, while [256,511] and [768,999] are skipped. Any ideas how to resolve?

results for 1000 copies of field

$ ./gpu-cpu-example 1000 "1.488.650.txt"
nFields: 1000
input file: '1.488.650.txt'
List of available devices:
   0:   Intel(R) Iris(R) Xe Graphics (platform: Intel(R) OpenCL HD Graphics)

Number of CPU threads: 8

Default device will be used:
        Intel(R) Iris(R) Xe Graphics (platform: Intel(R) OpenCL HD Graphics)

Completed during 6.31165 seconds:
  GPU ops:      1.143.283.200
  CPU ops:     16.477.866.850
  Sum ops:     17.621.150.050

Productivity per 1 second:
GPU Speed:        181.138.523/s
CPU Speed:      2.610.706.147/s
Sum Speed:      2.791.844.670/s

Other examples

You can read full Boost.Compute Tutorial and try all of the examples here. Then you will still have some questions like "How to use long long as return type?" you can see examples in a directory boost/compute/test on the github. You can find the testing of for each Boost.Compute method here, reproduce them and see how long long or your own structures can be registered and used.

P.S. If you are able to test written code on your GPU/CPU, please, write your results in comments.