143 lines
5.1 KiB
Markdown
143 lines
5.1 KiB
Markdown
# About
|
|
|
|
This is an active library, using C++ template metaprogramming, to automate loop parallelisation.
|
|
This work has been done for my Ph.D. thesis.
|
|
|
|
## Brief
|
|
|
|
It uses [expression templates](https://en.wikipedia.org/wiki/Expression_templates) and an [EDSL](https://en.wikipedia.org/wiki/Domain-specific_language)
|
|
to build an [AST](https://en.wikipedia.org/wiki/AST) of loops at compile-time.
|
|
This AST is analysed to determine first which instructions of a given loop are interdependent, then for each instruction group
|
|
if it can be run in parallel with itself at different iterations of the loop.
|
|
The interdependence of instructions relies on how variables are accessed (read/write).
|
|
To identify instruction groups that can be run in parallel, the library uses index functions (in array accesses) and a set of rules
|
|
depending on these functions.
|
|
According to this, the library generates sequential and/or parallel loops.
|
|
|
|
## Example
|
|
|
|
Given this source code:
|
|
```cpp
|
|
// a, b, c, d, e and f are arrays
|
|
for(int i = 0; i < n; ++i) {
|
|
a[i] = a[i] * b[i];
|
|
c[i] = c[i+1] - d[i];
|
|
b[i] = b[i] + i;
|
|
d[i] = std::pow(c[i], e[i]);
|
|
f[i*i] = 2 * f[i*i];
|
|
}
|
|
```
|
|
|
|
By writing it like this:
|
|
```cpp
|
|
// a, b, c, d, e and f are expression template operands
|
|
// pow is an automatically generated expression template function
|
|
Index i;
|
|
parallelFor(Range{0, n},
|
|
a[i] = a[i] * b[i],
|
|
c[i] = c[i+ctv<1>] + d[i],
|
|
b[i] = b[i] + i,
|
|
d[i] = pow(c[i], e[i]),
|
|
f[injective(i*i)] = 2 * f[injective(i*i)]
|
|
);
|
|
```
|
|
|
|
The library will automatically produce these two loops:
|
|
```bash
|
|
for(int i = 0; i < n; ++i) {
|
|
c[i] = c[i+1] + d[i];
|
|
d[i] = std::pow(c[i], e[i]);
|
|
}
|
|
|
|
# pragma omp parallel for
|
|
for(int i = 0; i < n; ++i) {
|
|
a[i] = a[i] * b[i];
|
|
b[i] = b[i] + i;
|
|
f[i*i] = 2 * f[i*i];
|
|
}
|
|
```
|
|
|
|
## Performances
|
|
|
|
### Compilation time
|
|
|
|
This library runs mostly during the compilation and therefore extends its duration.
|
|
We ran tests for four sets of source codes:
|
|
|
|
<ul>
|
|
<li>"dep/fixed" corresponding to dependent instructions with common index functions;
|
|
<img src="https://phd.pereda.fr/assets/pfor/ct_legend.png" width="150" align="right">
|
|
</li>
|
|
<li>"indep/fixed" corresponding to independent instructions with common index functions;</li>
|
|
<li>"dep/rand" corresponding to dependent instructions with random index functions;</li>
|
|
<li>"indep/rand" corresponding to independent instructions with random index functions.</li>
|
|
</ul>
|
|
|
|
For more specific information, see the `ctbenchmarks` directory.
|
|
|
|
The compilation time is not significantly affected by the number of loops as shown below:
|
|
|
|
<div align="center"><img src="https://phd.pereda.fr/assets/pfor/ct_var_loops.png" width="500"></div>
|
|
|
|
As expected, however, the compilation time increases (exponentially) with the number of instructions inside an unique loop:
|
|
|
|
<div align="center"><img src="https://phd.pereda.fr/assets/pfor/ct_var_instructions.png" width="500"></div>
|
|
|
|
This could be improved but was not a priority as usually a single loop does not run so many instructions.
|
|
|
|
### Execution time
|
|
|
|
The point of the library is to provide a solution to automatically run in parallel whenever possible,
|
|
not to do better than what one can expect from classic parallelisation tools like [OpenMP](https://www.openmp.org/).
|
|
In contrast, when a loop cannot be run in parallel, the library should not cause runtime overhead.
|
|
|
|
We ran tests with and without using the library:
|
|
|
|
<ul>
|
|
<li>"seq" corresponding to a sequential execution without the library;
|
|
<img src="https://phd.pereda.fr/assets/pfor/rt_legend.png" width="150" align="right">
|
|
</li>
|
|
<li>"omp" corresponding to a parallel execution without the library, using OpenMP, when applicable;</li>
|
|
<li>"gen_omp" corresponding to a parallel execution with the library generating OpenMP code;</li>
|
|
<li>"gen_thread" corresponding to a parallel execution with the library generating multi-threaded loop.</li>
|
|
</ul>
|
|
|
|
For both "gen_omp" and "gen_thread", the library will actually generate a sequential code if the loop cannot be run in parallel.
|
|
|
|
The two figures below shows that using the library does not induce significant runtime overhead, both for sequential and
|
|
parallel (with 18 cores) executions:
|
|
|
|
<div align="center"><img src="https://phd.pereda.fr/assets/pfor/rt_seq.png" width="500"></div>
|
|
<div align="center"><img src="https://phd.pereda.fr/assets/pfor/rt_par_18.png" width="500"></div>
|
|
|
|
Obtained speed up is equivalent using the library with respect to a handwritten OpenMP program:
|
|
|
|
<div align="center"><img src="https://phd.pereda.fr/assets/pfor/rt_speedup.png" width="500"></div>
|
|
|
|
## Related publications
|
|
|
|
- "Static Loop Parallelization Decision Using Template Metaprogramming", HPCS 2018 (IEEE) [10.1109/HPCS.2018.00159](https://doi.org/10.1109/HPCS.2018.00159)
|
|
|
|
## Organisation
|
|
|
|
Main directories:
|
|
- `src/pfor`: the library sources;
|
|
- `examples`: some examples using the library;
|
|
- `[cr]tbenchmarks`: scripts for compile-time/run-time benchmarking;
|
|
- `results`: results presented in the thesis, obtained using mentioned scripts and codes.
|
|
|
|
## Usage
|
|
|
|
To produce the `Makefile` and build the project:
|
|
```bash
|
|
mkdir build
|
|
cd build
|
|
cmake -DCMAKE_BUILD_TYPE=Release ..
|
|
make
|
|
```
|
|
|
|
To run examples:
|
|
```bash
|
|
./build/examples/${example_name}
|
|
```
|