Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Checks/PWD005/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Update the copied array range to match the actual array usage in the code.

### Relevance

Minimising data transfers is one of the main optimization points when offloading
Minimizing data transfers is one of the main optimization points when offloading
computations to the GPU. An opportunity for such optimization occurs whenever
only part of an array is required in a computation. In such cases, only a part
of the array may be transferred to or from the GPU. However, the developer must
Expand Down
14 changes: 7 additions & 7 deletions Checks/PWD006/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
### Issue

The copy of a non-scalar variable to an accelerator device has been requested
but none or only a part of its data will be transferred because it is laid out
but none or only a part of its data will be transferred because it is laid out
non-contiguously in memory.

### Actions
Expand All @@ -13,9 +13,9 @@ memory segments are copied to the memory of the accelerator device.

### Relevance

The data of non-scalar variables might be spread across memory, laid out in non-
contiguous regions. One classical example is a dynamically-allocated two-
dimensional array in C/C++, which consists of a contiguous array of pointers
The data of non-scalar variables might be spread across memory, laid out in
non-contiguous regions. One classical example is a dynamically-allocated
two-dimensional array in C/C++, which consists of a contiguous array of pointers
pointing to separate contiguous arrays that contain the actual data. Note that
the elements of each individual array are contiguous in memory but the different
arrays are scattered in the memory. This also holds for dynamically-allocated
Expand All @@ -25,7 +25,7 @@ In order to offload such non-scalar variables to an accelerator device using
OpenMP or OpenACC, it is not enough to add it to a data movement clause. This is
known as deep copy and currently is not automatically supported by either OpenMP
or OpenACC. To overcome this limitation, all the non-contiguous memory segments
must be explicitly transferred by the programmer. In OpenMP 4.5, this can be
must be explicitly transferred by the programmer. In OpenMP 4.5, this can be
achieved through the *enter/exit data* execution statements. Alternatively, the
code could be refactored so that it uses variables with contiguous data layouts
(eg. flatten an array of arrays).
Expand Down Expand Up @@ -85,12 +85,12 @@ void foo(int **A) {
}
```

The *enter/exit data* statements ressemble how the dynamic bi-dimensional memory
The *enter/exit data* statements resemble how the dynamic bi-dimensional memory
is allocated in the CPU. An array of pointers is allocated first, followed by
the allocation of all the separate arrays that contain the actual data. Each
allocation constitutes a contiguous memory segment and must be transferred
individually using *enter data*. The deallocation takes place in the inverted
order and the same happens with the *exit *data statements.
order and the same happens with the *exit* data statements.

### Related resources

Expand Down
2 changes: 1 addition & 1 deletion Checks/PWD007/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Protect the recurrence or execute the code sequentially if that is not possible.
### Relevance

The recurrence computation pattern occurs when the same memory position is read
and written to, at least once, in different iterations of a loop. It englobes
and written to, at least once, in different iterations of a loop. It englobes
both true dependencies (read-after-write) and anti-dependencies (write-after-
read) across loop iterations. Sometimes the term "loop-carried dependencies" is
also used. If a loop with a recurrence computation pattern is parallelized
Expand Down
2 changes: 1 addition & 1 deletion Checks/PWD009/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Change the data scope of the variable from private to shared.

Specifying an invalid scope for a variable may introduce race conditions and
produce incorrect results. For instance, when a variable must be shared among
threads but it is privatized instead.
threads, but it is privatized instead.

### Code example

Expand Down
6 changes: 3 additions & 3 deletions Checks/PWR002/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ A scalar variable should be declared in the smallest
[scope](../../Glossary/Variable-scope.md) possible. In computer programming, the term
scope of a variable usually refers to the part of the code where the variable
can be used (e.g. a function, a loop). During the execution of a program, a
variable cannot be accessed from outside of its scope. This effectively limits
variable cannot be accessed from outside its scope. This effectively limits
the visibility of the variable, which prevents its value from being read or
written in other parts of the code.

Expand Down Expand Up @@ -40,7 +40,7 @@ incompatible purposes, making code testing significantly easier.

In the following code, the function `example` declares a variable `t` used in
each iteration of the loop to hold a value that is then assigned to the array
`result`. The variable `t` is not used outside of the loop.
`result`. The variable `t` is not used outside the loop.

```c
void example() {
Expand Down Expand Up @@ -96,7 +96,7 @@ code within larger programs by grouping sections together. Conveniently,

In the following code, the subroutine `example` declares a variable `t` used in
each iteration of the loop to hold a value that is then assigned to the array
`result`. The variable `t` is not used outside of the loop.
`result`. The variable `t` is not used outside the loop.

```fortran
subroutine example()
Expand Down
10 changes: 5 additions & 5 deletions Checks/PWR003/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,19 +77,19 @@ int example_impure(int a) {
* `const` function:
* Depends only on `a` and `b`. If successive calls are made with the same `a`
and `b` values, the output will not change.
* Returns a value without modifying any data outside of the function.
* Returns a value without modifying any data outside the function.

* `pure` function:
* Depends on `c`, a global variable whose value can be modified between
successive calls to the function by other parts of the program. Even if
successive calls are made with the same `a` value, the output can differ
depending on the state of `c`.
* Returns a value without modifying any data outside of the function.
* Returns a value without modifying any data outside the function.

* "Normal" function:
* Depends on `c`, a global variable. This restricts the function to be
`pure`, at most.
* However, the function also modifies `c`, memory outside of its scope, thus
* However, the function also modifies `c`, memory outside its scope, thus
leading to a "normal" function.

In the case of the `pure` and "normal" functions, it is equivalent that they
Expand Down Expand Up @@ -129,12 +129,12 @@ end module example_module
successive calls to the function by other parts of the program. Even if
successive calls are made with the same `a` value, the output can be
different depending on the state of `c`.
* Returns a value without modifying any data outside of the function.
* Returns a value without modifying any data outside the function.

* "Normal" function:
* Depends on `c`, a public variable. This restricts the function to be
`pure`, at most.
* However, the function also modifies `c`, memory outside of its scope, thus
* However, the function also modifies `c`, memory outside its scope, thus
leading to a "normal" function.

>[!WARNING]
Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR005/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Add `default(none)` to disable default OpenMP scoping.
When the scope for a variable is not specified in an
[OpenMP](../../Glossary/OpenMP.md) `parallel` directive, a default scope is assigned
to it. Even when set explicitly, using a default scope is considered a bad
practice since it can lead to wrong data scopes inadvertently being applied to
practice since it can lead to wrong data scopes inadvertently being applied to
variables. Thus, it is recommended to explicitly set the scope for each
variable.

Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR006/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Set the scope of the read-only variable to shared.

Since a read-only variable is never written to, it can be safely shared without
any risk of race conditions. **Sharing variables is more efficient than
privatizing** them from a memory perspective so it should be favored whenever
privatizing** them from a memory perspective, so it should be favored whenever
possible.

### Code example
Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR009/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ specific setup in order to better exploit its capabilities.
The OpenMP `parallel` construct specifies a parallel region of the code that
will be executed by a team of threads. It is normally accompanied by a
worksharing construct so that each thread of the team takes care of part of the
work (e.g the `for` construct assigns a subset of the loop iterations to each
work (e.g., the `for` construct assigns a subset of the loop iterations to each
thread). This attains a single level of parallelism since all work is
distributed across a team of threads. This works well for multi-core CPUs but
GPUs are composed of a high number of processing units organized into groups
Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR012/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ variable modifications, and also contributes to improve compiler and static
analyzer code coverage.

In parallel programming, derived data types are often discouraged when
offloading to the GPU because they may inhibit compiler analyses and
offloading to the GPU because they may inhibit compiler analyses and
optimizations due to [pointer aliasing](../../Glossary/Pointer-aliasing.md). Also, it
can cause unnecessary data movements impacting performance or incorrect data
movements impacting correctness and even crashes impacting code quality.
Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR019/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ innermost loop.

### Relevance

Vectorization takes advantage of having as high a trip count (ie. number of
Vectorization takes advantage of having as high a trip count (i.e., number of
iterations) as possible. When loops are
[perfectly nested](../../Glossary/Perfect-loop-nesting.md) and they can be safely
interchanged, making the loop with the highest trip count the innermost should
Expand Down
4 changes: 2 additions & 2 deletions Checks/PWR020/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ statements in a first loop and the non-vectorizable statements in a second loop.

[vectorization](../../Glossary/Vectorization.md) is one of the most important ways to
speed up the computation of a loop. In practice, loops may contain a mix of
computations where only a part of the loop body introduces loop-carrie
computations where only a part of the loop body introduces loop-carried
dependencies that prevent vectorization. Different types of compute patterns
make explicit the loop-carried dependencies present in the loop. On the one
hand, the
Expand All @@ -25,7 +25,7 @@ vectorized:

* The
[sparse reduction compute pattern](../../Glossary/Patterns-for-performance-optimization/Sparse-reduction.md) - e.g.
the reduction variable has an read-write indirect memory access pattern which
the reduction variable has a read-write indirect memory access pattern which
does not allow to determine the dependencies between the loop iterations at
compile-time.

Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR021/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ vectorized:

* The
[sparse reduction compute pattern](../../Glossary/Patterns-for-performance-optimization/Sparse-reduction.md) - e.g.
the reduction variable has an read-write indirect memory access pattern which
the reduction variable has a read-write indirect memory access pattern which
does not allow to determine the dependencies between the loop iterations at
compile-time.

Expand Down
8 changes: 4 additions & 4 deletions Checks/PWR022/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,18 @@
### Issue

Conditional evaluates to the same value for all loop iterations and can be
[moved outside of the loop](../../Glossary/Loop-unswitching.md) to favor
[moved outside the loop](../../Glossary/Loop-unswitching.md) to favor
[vectorization](../../Glossary/Vectorization.md).

### Actions

Move the invariant conditional outside of the loop by duplicating the loop body.
Move the invariant conditional outside the loop by duplicating the loop body.

### Relevance

Classical vectorization requirements do not allow branching inside the loop
body, which would mean no `if` and `switch` statements inside the loop body are
allowed. However, loop invariant conditionals can be extracted outside of the
allowed. However, loop invariant conditionals can be extracted outside the
loop to facilitate vectorization. Therefore, it is often good to extract
invariant conditional statements out of vectorizable loops to increase
performance. A conditional whose expression evaluates to the same value for all
Expand All @@ -25,7 +25,7 @@ it will always be either true or false.
> This optimization is called
> [loop unswitching](../../Glossary/Loop-unswitching.md) and the compilers can do
> it automatically in simple cases. However, in more complex cases, the compiler
> will omit this optimization and therefore it is beneficial to do it manually.
> will omit this optimization and, therefore, it is beneficial to do it manually.

### Code example

Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR023/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ guarantee that the pointers do not alias one another, i.e. no memory address is
accessible through two different pointers. The developer can use the `restrict`
C keyword to inform the compiler that the specified block of memory is not
aliased by any other block. Providing this information can help the compiler
generate more efficient code or vectorize the loop. Therefore it is always
generate more efficient code or vectorize the loop. Therefore, it is always
recommended to use `restrict` whenever possible so that the compiler has as much
information as possible to perform optimizations such as vectorization.

Expand Down
4 changes: 2 additions & 2 deletions Checks/PWR024/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
### Issue

The loop is currently not in
[OpenMP canonical](../../Glossary/OpenMP-canonical-form.md) form but it can be made
OpenMP compliant through refactoring.
[OpenMP canonical](../../Glossary/OpenMP-canonical-form.md) form, but it can be
made OpenMP compliant through refactoring.

### Actions

Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR031/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ or square roots.
> [!NOTE]
> Some compilers under some circumstances (e.g. relaxed IEEE 754 semantics) can
> do this optimization automatically. However, doing it manually will guarantee
> best performance across all the compilers.
> the best performance across all the compilers.

### Code example

Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR032/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ In C, there are several versions of the same mathematical function for different
types. For example, the square root function is available for floats, doubles
and long doubles through `sqrtf`, `sqrt` and `sqrtl`, respectively. Oftentimes,
the developer who is not careful will not use the function matching the data
type. For instance, most developers will just use "sqrt" for any data type,
type. For instance, most developers will just use `sqrt` for any data type,
instead of using `sqrtf` when the argument is float.

The type mismatch does not cause a compiler error because of the implicit type
Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR034/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PWR034: avoid strided array access to improve performance
# PWR034: Avoid strided array access to improve performance

### Issue

Expand Down
2 changes: 1 addition & 1 deletion Checks/PWR035/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ or changing the data layout to avoid non-consecutive access in hot loops.
### Relevance

Accessing an array in a non-consecutive order is less efficient than accessing
consecutive positions because the latter maximises
consecutive positions because the latter maximizes
[locality of reference](../../Glossary/Locality-of-reference.md).

### Code example
Expand Down
4 changes: 2 additions & 2 deletions Checks/PWR040/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ for low performance on modern computer systems. Matrices are
Iterating over them column-wise (in C) and row-wise (in Fortran) is inefficient,
because it uses the memory subsystem suboptimally.

Nested loops that iterate over matrices in an inefficient manner can be
optimized by applying [loop tiling](../../Glossary/Loop-tiling.md). In contrast to
Nested loops that iterate over matrices inefficiently can be optimized by
applying [loop tiling](../../Glossary/Loop-tiling.md). In contrast to
[loop interchange](../../Glossary/Loop-interchange.md), loop tiling doesn't remove
the inefficient memory access, but instead breaks the problem into smaller
subproblems. Smaller subproblems have a much better
Expand Down
6 changes: 3 additions & 3 deletions Checks/PWR042/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ efficient one.
In order to perform the loop interchange, the loops need to be
[perfectly nested](../../Glossary/Perfect-loop-nesting.md), i.e. all the statements
need to be inside the innermost loop. However, due to the initialization of a
reduction variablе, loop interchange is not directly applicable.
reduction variable, loop interchange is not directly applicable.

> [!NOTE]
> Often, loop interchange enables vectorization of the innermost loop which
Expand Down Expand Up @@ -104,7 +104,7 @@ first and the third loops are single non-nested loops, so let's focus on the
second loop nest as it will have a higher impact on performance.

Note that this loop nest is perfectly nested, making loop interchange
applicable. This optimization will turn the `ij` order into `ji`, improving
applicable. This optimization will turn the `ij` order into `ji`, improving
the locality of reference:

```c
Expand Down Expand Up @@ -187,7 +187,7 @@ first and the third loops are single non-nested loops, so let's focus on the
second loop nest as it will have a higher impact on performance.

Note that this loop nest is perfectly nested, making loop interchange
applicable. This optimization will turn the `ij` order into `ji`, improving
applicable. This optimization will turn the `ij` order into `ji`, improving
the locality of reference:

```fortran
Expand Down
4 changes: 2 additions & 2 deletions Checks/PWR045/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ boost.

### Actions

Calculate the reciprocal outside of the loop and replace the division with
Calculate the reciprocal outside the loop and replace the division with
multiplication with a reciprocal

### Relevance
Expand All @@ -18,7 +18,7 @@ performing the division in each iteration of the loop, one could do the
following:

* For the expression `A / B`, calculate the reciprocal of the denominator
(`RECIP_B = 1.0 / B`) and put it outside of the loop.
(`RECIP_B = 1.0 / B`) and put it outside the loop.

* Replace the expression `A / B`, use `A * RECIP_B`.

Expand Down
4 changes: 2 additions & 2 deletions Checks/PWR048/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ __attribute__((const)) double example(double a, double b, double c) {
}
```

In the above example, the expression `a + b * c` is effectively a FMA operation
and it can be replaced with a call to `fma`:
In the above example, the expression `a + b * c` is effectively an FMA
operation and it can be replaced with a call to `fma`:

```c
#include <math.h>
Expand Down
10 changes: 5 additions & 5 deletions Checks/PWR049/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,26 @@

### Issue

A condition that depends only on the iterator variable can be moved outside of
the loop.
A condition that depends only on the iterator variable can be moved outside the
loop.

### Actions

Move iterator-dependent condition outside of the loop.
Move iterator-dependent condition outside the loop.

### Relevance

A condition that depends only on the iterator is predictable: we know exactly at
which iteration of the loop it is going to be true. Nevertheless, it is
evaluated in each iteration of the loop.

Moving the iterator-dependent condition outside of the loop will result in fewer
Moving the iterator-dependent condition outside the loop will result in fewer
instructions executed in the loop. This transformation can occasionally enable
vectorization, and for the loops that are already vectorized, it can increase
vectorization efficiency.

> [!NOTE]
> Moving an iterator-dependent condition outside of the loop is a creative
> Moving an iterator-dependent condition outside the loop is a creative
> process. Depending on the type of condition, it can involve loop peeling,
> [loop fission](../../Glossary/Loop-fission.md) or loop unrolling.

Expand Down
Loading