diff --git a/Checks/PWD005/README.md b/Checks/PWD005/README.md index ea87c4e5..27773cd7 100644 --- a/Checks/PWD005/README.md +++ b/Checks/PWD005/README.md @@ -11,7 +11,7 @@ Update the copied array range to match the actual array usage in the code. ### Relevance -Minimising data transfers is one of the main optimization points when offloading +Minimizing data transfers is one of the main optimization points when offloading computations to the GPU. An opportunity for such optimization occurs whenever only part of an array is required in a computation. In such cases, only a part of the array may be transferred to or from the GPU. However, the developer must diff --git a/Checks/PWD006/README.md b/Checks/PWD006/README.md index 74e74f26..0cd2618b 100644 --- a/Checks/PWD006/README.md +++ b/Checks/PWD006/README.md @@ -3,7 +3,7 @@ ### Issue The copy of a non-scalar variable to an accelerator device has been requested -but none or only a part of its data will be transferred because it is laid out +but none or only a part of its data will be transferred because it is laid out non-contiguously in memory. ### Actions @@ -13,9 +13,9 @@ memory segments are copied to the memory of the accelerator device. ### Relevance -The data of non-scalar variables might be spread across memory, laid out in non- -contiguous regions. One classical example is a dynamically-allocated two- -dimensional array in C/C++, which consists of a contiguous array of pointers +The data of non-scalar variables might be spread across memory, laid out in +non-contiguous regions. One classical example is a dynamically-allocated +two-dimensional array in C/C++, which consists of a contiguous array of pointers pointing to separate contiguous arrays that contain the actual data. Note that the elements of each individual array are contiguous in memory but the different arrays are scattered in the memory. This also holds for dynamically-allocated @@ -25,7 +25,7 @@ In order to offload such non-scalar variables to an accelerator device using OpenMP or OpenACC, it is not enough to add it to a data movement clause. This is known as deep copy and currently is not automatically supported by either OpenMP or OpenACC. To overcome this limitation, all the non-contiguous memory segments -must be explicitly transferred by the programmer. In OpenMP 4.5, this can be +must be explicitly transferred by the programmer. In OpenMP 4.5, this can be achieved through the *enter/exit data* execution statements. Alternatively, the code could be refactored so that it uses variables with contiguous data layouts (eg. flatten an array of arrays). @@ -85,12 +85,12 @@ void foo(int **A) { } ``` -The *enter/exit data* statements ressemble how the dynamic bi-dimensional memory +The *enter/exit data* statements resemble how the dynamic bi-dimensional memory is allocated in the CPU. An array of pointers is allocated first, followed by the allocation of all the separate arrays that contain the actual data. Each allocation constitutes a contiguous memory segment and must be transferred individually using *enter data*. The deallocation takes place in the inverted -order and the same happens with the *exit *data statements. +order and the same happens with the *exit* data statements. ### Related resources diff --git a/Checks/PWD007/README.md b/Checks/PWD007/README.md index 3be428a0..9ee00e88 100644 --- a/Checks/PWD007/README.md +++ b/Checks/PWD007/README.md @@ -12,7 +12,7 @@ Protect the recurrence or execute the code sequentially if that is not possible. ### Relevance The recurrence computation pattern occurs when the same memory position is read -and written to, at least once, in different iterations of a loop. It englobes +and written to, at least once, in different iterations of a loop. It englobes both true dependencies (read-after-write) and anti-dependencies (write-after- read) across loop iterations. Sometimes the term "loop-carried dependencies" is also used. If a loop with a recurrence computation pattern is parallelized diff --git a/Checks/PWD009/README.md b/Checks/PWD009/README.md index 82591a8f..0cbd6542 100644 --- a/Checks/PWD009/README.md +++ b/Checks/PWD009/README.md @@ -12,7 +12,7 @@ Change the data scope of the variable from private to shared. Specifying an invalid scope for a variable may introduce race conditions and produce incorrect results. For instance, when a variable must be shared among -threads but it is privatized instead. +threads, but it is privatized instead. ### Code example diff --git a/Checks/PWR002/README.md b/Checks/PWR002/README.md index b2d7f95f..d5d91008 100644 --- a/Checks/PWR002/README.md +++ b/Checks/PWR002/README.md @@ -6,7 +6,7 @@ A scalar variable should be declared in the smallest [scope](../../Glossary/Variable-scope.md) possible. In computer programming, the term scope of a variable usually refers to the part of the code where the variable can be used (e.g. a function, a loop). During the execution of a program, a -variable cannot be accessed from outside of its scope. This effectively limits +variable cannot be accessed from outside its scope. This effectively limits the visibility of the variable, which prevents its value from being read or written in other parts of the code. @@ -40,7 +40,7 @@ incompatible purposes, making code testing significantly easier. In the following code, the function `example` declares a variable `t` used in each iteration of the loop to hold a value that is then assigned to the array -`result`. The variable `t` is not used outside of the loop. +`result`. The variable `t` is not used outside the loop. ```c void example() { @@ -96,7 +96,7 @@ code within larger programs by grouping sections together. Conveniently, In the following code, the subroutine `example` declares a variable `t` used in each iteration of the loop to hold a value that is then assigned to the array -`result`. The variable `t` is not used outside of the loop. +`result`. The variable `t` is not used outside the loop. ```fortran subroutine example() diff --git a/Checks/PWR003/README.md b/Checks/PWR003/README.md index b31a4e2b..ea1ed815 100644 --- a/Checks/PWR003/README.md +++ b/Checks/PWR003/README.md @@ -77,19 +77,19 @@ int example_impure(int a) { * `const` function: * Depends only on `a` and `b`. If successive calls are made with the same `a` and `b` values, the output will not change. - * Returns a value without modifying any data outside of the function. + * Returns a value without modifying any data outside the function. * `pure` function: * Depends on `c`, a global variable whose value can be modified between successive calls to the function by other parts of the program. Even if successive calls are made with the same `a` value, the output can differ depending on the state of `c`. - * Returns a value without modifying any data outside of the function. + * Returns a value without modifying any data outside the function. * "Normal" function: * Depends on `c`, a global variable. This restricts the function to be `pure`, at most. - * However, the function also modifies `c`, memory outside of its scope, thus + * However, the function also modifies `c`, memory outside its scope, thus leading to a "normal" function. In the case of the `pure` and "normal" functions, it is equivalent that they @@ -129,12 +129,12 @@ end module example_module successive calls to the function by other parts of the program. Even if successive calls are made with the same `a` value, the output can be different depending on the state of `c`. - * Returns a value without modifying any data outside of the function. + * Returns a value without modifying any data outside the function. * "Normal" function: * Depends on `c`, a public variable. This restricts the function to be `pure`, at most. - * However, the function also modifies `c`, memory outside of its scope, thus + * However, the function also modifies `c`, memory outside its scope, thus leading to a "normal" function. >[!WARNING] diff --git a/Checks/PWR005/README.md b/Checks/PWR005/README.md index 737bed80..0b32d482 100644 --- a/Checks/PWR005/README.md +++ b/Checks/PWR005/README.md @@ -15,7 +15,7 @@ Add `default(none)` to disable default OpenMP scoping. When the scope for a variable is not specified in an [OpenMP](../../Glossary/OpenMP.md) `parallel` directive, a default scope is assigned to it. Even when set explicitly, using a default scope is considered a bad -practice since it can lead to wrong data scopes inadvertently being applied to +practice since it can lead to wrong data scopes inadvertently being applied to variables. Thus, it is recommended to explicitly set the scope for each variable. diff --git a/Checks/PWR006/README.md b/Checks/PWR006/README.md index 8ea4c34a..a6040c6d 100644 --- a/Checks/PWR006/README.md +++ b/Checks/PWR006/README.md @@ -13,7 +13,7 @@ Set the scope of the read-only variable to shared. Since a read-only variable is never written to, it can be safely shared without any risk of race conditions. **Sharing variables is more efficient than -privatizing** them from a memory perspective so it should be favored whenever +privatizing** them from a memory perspective, so it should be favored whenever possible. ### Code example diff --git a/Checks/PWR009/README.md b/Checks/PWR009/README.md index bda3521f..4f881ce2 100644 --- a/Checks/PWR009/README.md +++ b/Checks/PWR009/README.md @@ -20,7 +20,7 @@ specific setup in order to better exploit its capabilities. The OpenMP `parallel` construct specifies a parallel region of the code that will be executed by a team of threads. It is normally accompanied by a worksharing construct so that each thread of the team takes care of part of the -work (e.g the `for` construct assigns a subset of the loop iterations to each +work (e.g., the `for` construct assigns a subset of the loop iterations to each thread). This attains a single level of parallelism since all work is distributed across a team of threads. This works well for multi-core CPUs but GPUs are composed of a high number of processing units organized into groups diff --git a/Checks/PWR012/README.md b/Checks/PWR012/README.md index 211e0c15..01257e1f 100644 --- a/Checks/PWR012/README.md +++ b/Checks/PWR012/README.md @@ -24,7 +24,7 @@ variable modifications, and also contributes to improve compiler and static analyzer code coverage. In parallel programming, derived data types are often discouraged when -offloading to the GPU because they may inhibit compiler analyses and +offloading to the GPU because they may inhibit compiler analyses and optimizations due to [pointer aliasing](../../Glossary/Pointer-aliasing.md). Also, it can cause unnecessary data movements impacting performance or incorrect data movements impacting correctness and even crashes impacting code quality. diff --git a/Checks/PWR019/README.md b/Checks/PWR019/README.md index 1ea6cd12..79bed3be 100644 --- a/Checks/PWR019/README.md +++ b/Checks/PWR019/README.md @@ -12,7 +12,7 @@ innermost loop. ### Relevance -Vectorization takes advantage of having as high a trip count (ie. number of +Vectorization takes advantage of having as high a trip count (i.e., number of iterations) as possible. When loops are [perfectly nested](../../Glossary/Perfect-loop-nesting.md) and they can be safely interchanged, making the loop with the highest trip count the innermost should diff --git a/Checks/PWR020/README.md b/Checks/PWR020/README.md index e3fb0942..a34e7349 100644 --- a/Checks/PWR020/README.md +++ b/Checks/PWR020/README.md @@ -14,7 +14,7 @@ statements in a first loop and the non-vectorizable statements in a second loop. [vectorization](../../Glossary/Vectorization.md) is one of the most important ways to speed up the computation of a loop. In practice, loops may contain a mix of -computations where only a part of the loop body introduces loop-carrie +computations where only a part of the loop body introduces loop-carried dependencies that prevent vectorization. Different types of compute patterns make explicit the loop-carried dependencies present in the loop. On the one hand, the @@ -25,7 +25,7 @@ vectorized: * The [sparse reduction compute pattern](../../Glossary/Patterns-for-performance-optimization/Sparse-reduction.md) - e.g. -the reduction variable has an read-write indirect memory access pattern which +the reduction variable has a read-write indirect memory access pattern which does not allow to determine the dependencies between the loop iterations at compile-time. diff --git a/Checks/PWR021/README.md b/Checks/PWR021/README.md index 2ecc1b5c..922ef23e 100644 --- a/Checks/PWR021/README.md +++ b/Checks/PWR021/README.md @@ -28,7 +28,7 @@ vectorized: * The [sparse reduction compute pattern](../../Glossary/Patterns-for-performance-optimization/Sparse-reduction.md) - e.g. -the reduction variable has an read-write indirect memory access pattern which +the reduction variable has a read-write indirect memory access pattern which does not allow to determine the dependencies between the loop iterations at compile-time. diff --git a/Checks/PWR022/README.md b/Checks/PWR022/README.md index f0f2a561..c762194f 100644 --- a/Checks/PWR022/README.md +++ b/Checks/PWR022/README.md @@ -3,18 +3,18 @@ ### Issue Conditional evaluates to the same value for all loop iterations and can be -[moved outside of the loop](../../Glossary/Loop-unswitching.md) to favor +[moved outside the loop](../../Glossary/Loop-unswitching.md) to favor [vectorization](../../Glossary/Vectorization.md). ### Actions -Move the invariant conditional outside of the loop by duplicating the loop body. +Move the invariant conditional outside the loop by duplicating the loop body. ### Relevance Classical vectorization requirements do not allow branching inside the loop body, which would mean no `if` and `switch` statements inside the loop body are -allowed. However, loop invariant conditionals can be extracted outside of the +allowed. However, loop invariant conditionals can be extracted outside the loop to facilitate vectorization. Therefore, it is often good to extract invariant conditional statements out of vectorizable loops to increase performance. A conditional whose expression evaluates to the same value for all @@ -25,7 +25,7 @@ it will always be either true or false. > This optimization is called > [loop unswitching](../../Glossary/Loop-unswitching.md) and the compilers can do > it automatically in simple cases. However, in more complex cases, the compiler -> will omit this optimization and therefore it is beneficial to do it manually. +> will omit this optimization and, therefore, it is beneficial to do it manually. ### Code example diff --git a/Checks/PWR023/README.md b/Checks/PWR023/README.md index 18eb1c28..8f680680 100644 --- a/Checks/PWR023/README.md +++ b/Checks/PWR023/README.md @@ -18,7 +18,7 @@ guarantee that the pointers do not alias one another, i.e. no memory address is accessible through two different pointers. The developer can use the `restrict` C keyword to inform the compiler that the specified block of memory is not aliased by any other block. Providing this information can help the compiler -generate more efficient code or vectorize the loop. Therefore it is always +generate more efficient code or vectorize the loop. Therefore, it is always recommended to use `restrict` whenever possible so that the compiler has as much information as possible to perform optimizations such as vectorization. diff --git a/Checks/PWR024/README.md b/Checks/PWR024/README.md index a849023e..427f328f 100644 --- a/Checks/PWR024/README.md +++ b/Checks/PWR024/README.md @@ -3,8 +3,8 @@ ### Issue The loop is currently not in -[OpenMP canonical](../../Glossary/OpenMP-canonical-form.md) form but it can be made -OpenMP compliant through refactoring. +[OpenMP canonical](../../Glossary/OpenMP-canonical-form.md) form, but it can be +made OpenMP compliant through refactoring. ### Actions diff --git a/Checks/PWR031/README.md b/Checks/PWR031/README.md index 6eaabafa..d8ab668a 100644 --- a/Checks/PWR031/README.md +++ b/Checks/PWR031/README.md @@ -22,7 +22,7 @@ or square roots. > [!NOTE] > Some compilers under some circumstances (e.g. relaxed IEEE 754 semantics) can > do this optimization automatically. However, doing it manually will guarantee -> best performance across all the compilers. +> the best performance across all the compilers. ### Code example diff --git a/Checks/PWR032/README.md b/Checks/PWR032/README.md index 684630b1..ef545f39 100644 --- a/Checks/PWR032/README.md +++ b/Checks/PWR032/README.md @@ -17,7 +17,7 @@ In C, there are several versions of the same mathematical function for different types. For example, the square root function is available for floats, doubles and long doubles through `sqrtf`, `sqrt` and `sqrtl`, respectively. Oftentimes, the developer who is not careful will not use the function matching the data -type. For instance, most developers will just use "sqrt" for any data type, +type. For instance, most developers will just use `sqrt` for any data type, instead of using `sqrtf` when the argument is float. The type mismatch does not cause a compiler error because of the implicit type diff --git a/Checks/PWR034/README.md b/Checks/PWR034/README.md index 481c797c..a92fb646 100644 --- a/Checks/PWR034/README.md +++ b/Checks/PWR034/README.md @@ -1,4 +1,4 @@ -# PWR034: avoid strided array access to improve performance +# PWR034: Avoid strided array access to improve performance ### Issue diff --git a/Checks/PWR035/README.md b/Checks/PWR035/README.md index a32baaf6..39443f4b 100644 --- a/Checks/PWR035/README.md +++ b/Checks/PWR035/README.md @@ -15,7 +15,7 @@ or changing the data layout to avoid non-consecutive access in hot loops. ### Relevance Accessing an array in a non-consecutive order is less efficient than accessing -consecutive positions because the latter maximises +consecutive positions because the latter maximizes [locality of reference](../../Glossary/Locality-of-reference.md). ### Code example diff --git a/Checks/PWR040/README.md b/Checks/PWR040/README.md index bb46704c..06ea2bd6 100644 --- a/Checks/PWR040/README.md +++ b/Checks/PWR040/README.md @@ -19,8 +19,8 @@ for low performance on modern computer systems. Matrices are Iterating over them column-wise (in C) and row-wise (in Fortran) is inefficient, because it uses the memory subsystem suboptimally. -Nested loops that iterate over matrices in an inefficient manner can be -optimized by applying [loop tiling](../../Glossary/Loop-tiling.md). In contrast to +Nested loops that iterate over matrices inefficiently can be optimized by +applying [loop tiling](../../Glossary/Loop-tiling.md). In contrast to [loop interchange](../../Glossary/Loop-interchange.md), loop tiling doesn't remove the inefficient memory access, but instead breaks the problem into smaller subproblems. Smaller subproblems have a much better diff --git a/Checks/PWR042/README.md b/Checks/PWR042/README.md index b090369c..71a79233 100644 --- a/Checks/PWR042/README.md +++ b/Checks/PWR042/README.md @@ -30,7 +30,7 @@ efficient one. In order to perform the loop interchange, the loops need to be [perfectly nested](../../Glossary/Perfect-loop-nesting.md), i.e. all the statements need to be inside the innermost loop. However, due to the initialization of a -reduction variablе, loop interchange is not directly applicable. +reduction variable, loop interchange is not directly applicable. > [!NOTE] > Often, loop interchange enables vectorization of the innermost loop which @@ -104,7 +104,7 @@ first and the third loops are single non-nested loops, so let's focus on the second loop nest as it will have a higher impact on performance. Note that this loop nest is perfectly nested, making loop interchange -applicable. This optimization will turn the `ij` order into `ji`, improving +applicable. This optimization will turn the `ij` order into `ji`, improving the locality of reference: ```c @@ -187,7 +187,7 @@ first and the third loops are single non-nested loops, so let's focus on the second loop nest as it will have a higher impact on performance. Note that this loop nest is perfectly nested, making loop interchange -applicable. This optimization will turn the `ij` order into `ji`, improving +applicable. This optimization will turn the `ij` order into `ji`, improving the locality of reference: ```fortran diff --git a/Checks/PWR045/README.md b/Checks/PWR045/README.md index 94559c2a..8283e899 100644 --- a/Checks/PWR045/README.md +++ b/Checks/PWR045/README.md @@ -8,7 +8,7 @@ boost. ### Actions -Calculate the reciprocal outside of the loop and replace the division with +Calculate the reciprocal outside the loop and replace the division with multiplication with a reciprocal ### Relevance @@ -18,7 +18,7 @@ performing the division in each iteration of the loop, one could do the following: * For the expression `A / B`, calculate the reciprocal of the denominator -(`RECIP_B = 1.0 / B`) and put it outside of the loop. +(`RECIP_B = 1.0 / B`) and put it outside the loop. * Replace the expression `A / B`, use `A * RECIP_B`. diff --git a/Checks/PWR048/README.md b/Checks/PWR048/README.md index e83cd0af..f1f90294 100644 --- a/Checks/PWR048/README.md +++ b/Checks/PWR048/README.md @@ -34,8 +34,8 @@ __attribute__((const)) double example(double a, double b, double c) { } ``` -In the above example, the expression `a + b * c` is effectively a FMA operation -and it can be replaced with a call to `fma`: +In the above example, the expression `a + b * c` is effectively an FMA +operation and it can be replaced with a call to `fma`: ```c #include diff --git a/Checks/PWR049/README.md b/Checks/PWR049/README.md index 6cb5ce42..e361dd0d 100644 --- a/Checks/PWR049/README.md +++ b/Checks/PWR049/README.md @@ -2,12 +2,12 @@ ### Issue -A condition that depends only on the iterator variable can be moved outside of -the loop. +A condition that depends only on the iterator variable can be moved outside the +loop. ### Actions -Move iterator-dependent condition outside of the loop. +Move iterator-dependent condition outside the loop. ### Relevance @@ -15,13 +15,13 @@ A condition that depends only on the iterator is predictable: we know exactly at which iteration of the loop it is going to be true. Nevertheless, it is evaluated in each iteration of the loop. -Moving the iterator-dependent condition outside of the loop will result in fewer +Moving the iterator-dependent condition outside the loop will result in fewer instructions executed in the loop. This transformation can occasionally enable vectorization, and for the loops that are already vectorized, it can increase vectorization efficiency. > [!NOTE] -> Moving an iterator-dependent condition outside of the loop is a creative +> Moving an iterator-dependent condition outside the loop is a creative > process. Depending on the type of condition, it can involve loop peeling, > [loop fission](../../Glossary/Loop-fission.md) or loop unrolling. diff --git a/Checks/PWR050/README.md b/Checks/PWR050/README.md index 13d0740b..2a7c5f7a 100644 --- a/Checks/PWR050/README.md +++ b/Checks/PWR050/README.md @@ -19,7 +19,7 @@ multithreaded code is not straightforward. Essentially, the programmer must explicitly specify how to execute the loop in vector mode on the hardware, as well as add the appropriate synchronization to avoid race conditions at runtime. Typically, minimizing the computational overhead of multithreading is the -biggest challenge to speedup the code. +biggest challenge to speed up the code. > [!NOTE] > Executing forall loops using multithreading incurs less overhead than in the diff --git a/Checks/PWR051/README.md b/Checks/PWR051/README.md index 199bdc20..e05e476b 100644 --- a/Checks/PWR051/README.md +++ b/Checks/PWR051/README.md @@ -19,7 +19,7 @@ multithreaded code is not straightforward. Essentially, the programmer must explicitly specify how to execute the loop in vector mode on the hardware, as well as add the appropriate synchronization to avoid race conditions at runtime. Typically, minimizing the computational overhead of multithreading is the -biggest challenge to speedup the code. +biggest challenge to speed up the code. > [!NOTE] > Executing scalar reduction loops using multithreading incurs an overhead due to diff --git a/Checks/PWR052/README.md b/Checks/PWR052/README.md index 2cccb71a..c4e64e4f 100644 --- a/Checks/PWR052/README.md +++ b/Checks/PWR052/README.md @@ -19,7 +19,7 @@ computers, but writing multithreaded code is not straightforward. Essentially, the programmer must explicitly specify how to execute the loop in vector mode on the hardware, as well as add the appropriate synchronization to avoid race conditions at runtime. Typically, minimizing the computational overhead of -multithreading is the biggest challenge to speedup the code. +multithreading is the biggest challenge to speed up the code. > [!NOTE] > Executing sparse reduction loops using multithreading incurs an overhead due to diff --git a/Checks/PWR055/README.md b/Checks/PWR055/README.md index 31db357f..b7e3a05b 100644 --- a/Checks/PWR055/README.md +++ b/Checks/PWR055/README.md @@ -19,7 +19,7 @@ is not straightforward. Essentially, the programmer must explicitly manage the data transfers between the host and the accelerator, specify how to execute the loop in parallel on the accelerator, as well as add the appropriate synchronization to avoid race conditions at runtime. Typically, minimizing the -computational overhead of offloading is the biggest challenge to speedup the +computational overhead of offloading is the biggest challenge to speed up the code using accelerators. > [!NOTE] diff --git a/Checks/PWR056/README.md b/Checks/PWR056/README.md index 24e7af51..5811c030 100644 --- a/Checks/PWR056/README.md +++ b/Checks/PWR056/README.md @@ -23,7 +23,7 @@ loop in parallel on the accelerator, as well as add the appropriate synchronization to avoid race conditions at runtime. Typically, **minimizing the computational overhead of offloading is the biggest -challenge to speedup the code using accelerators**. +challenge to speed up the code using accelerators**. > [!NOTE] > Offloading scalar reduction loops incurs an overhead due to the synchronization diff --git a/Checks/PWR057/README.md b/Checks/PWR057/README.md index 0878fe9a..732db6cb 100644 --- a/Checks/PWR057/README.md +++ b/Checks/PWR057/README.md @@ -21,7 +21,7 @@ is not straightforward. Essentially, the programmer must explicitly manage the data transfers between the host and the accelerator, specify how to execute the loop in parallel on the accelerator, as well as add the appropriate synchronization to avoid race conditions at runtime. Typically, minimizing the -computational overhead of offloading is the biggest challenge to speedup the +computational overhead of offloading is the biggest challenge to speed up the code using accelerators. > [!NOTE] diff --git a/Checks/PWR060/README.md b/Checks/PWR060/README.md index 79b6fec7..05d398cc 100644 --- a/Checks/PWR060/README.md +++ b/Checks/PWR060/README.md @@ -17,8 +17,8 @@ written in the first loop and read in the second loop. Vectorization is one of the most important ways to speed up computation in the loop. In practice, loops may contain vectorizable statements, but vectorization -may be either inhibited or inefficient due to the usage of data stored in non- -consecutive memory locations. Programs exhibit different types of +may be either inhibited or inefficient due to the usage of data stored in +non-consecutive memory locations. Programs exhibit different types of [memory access patterns](../../Glossary/Memory-access-pattern.md) that lead to non-consecutive memory access, e.g. strided, indirect, random accesses. diff --git a/Checks/PWR063/README.md b/Checks/PWR063/README.md index e69e8cea..fd92ef8a 100644 --- a/Checks/PWR063/README.md +++ b/Checks/PWR063/README.md @@ -118,7 +118,7 @@ arithmetic `if` statement: ``` Although it is a simple program, using an arithmetic `if` to drive the flow of -the loop makes the behaviour of the program less explicit than modern loop +the loop makes the behavior of the program less explicit than modern loop construct. We may improve the readability, intent, and maintainability of the code if we diff --git a/Checks/PWR068/README.md b/Checks/PWR068/README.md index 13c6ea23..93461126 100644 --- a/Checks/PWR068/README.md +++ b/Checks/PWR068/README.md @@ -206,7 +206,7 @@ Factorial of 5 is 120 > [!TIP] > When interoperating between Fortran and C/C++, it's necessary to manually > define explicit interfaces for the C/C++ procedures to call. Although this is -> not a perfect solution, since the are no guarantees that these interfaces +> not a perfect solution, since there are no guarantees that these interfaces > will match the actual C/C++ procedures, it's still best to make the > interfaces as explicit as possible. This includes specifying details such as > argument intents, to help the Fortran compiler catch early as many issues as diff --git a/Checks/PWR069/README.md b/Checks/PWR069/README.md index a7b06f71..86dbffe9 100644 --- a/Checks/PWR069/README.md +++ b/Checks/PWR069/README.md @@ -22,7 +22,7 @@ code. In procedures with implicit typing enabled, an `use` without the `only` specification can also easily lead to errors. If the imported module is later expanded with new members, these are automatically imported into the procedure -and might inadvertedly shadow existing and implicitly typed variables, +and might inadvertently shadow existing and implicitly typed variables, potentially leading to difficult-to-diagnose bugs. By leveraging the `only` keyword, the programmer restricts the visibility to diff --git a/Checks/PWR070/README.md b/Checks/PWR070/README.md index 21387e29..b8a11216 100644 --- a/Checks/PWR070/README.md +++ b/Checks/PWR070/README.md @@ -36,7 +36,7 @@ and more efficient: - In general, they lack compile-time checks for consistency between the provided and the expected array. -Aditionally, explicit-shape and assumed-size dummy arguments require contiguous +Additionally, explicit-shape and assumed-size dummy arguments require contiguous memory. This forces the creation of intermediate data copies when working with array slices or strided accesses. In contrast, assumed-shape arrays can handle these scenarios directly, leading to enhanced performance. diff --git a/Checks/PWR075/README.md b/Checks/PWR075/README.md index 0ef60ad4..ce56e7c2 100644 --- a/Checks/PWR075/README.md +++ b/Checks/PWR075/README.md @@ -238,7 +238,7 @@ included in this PWR075 documentation. | Non-standard double precision hyperbolic trigonometric functions: `DACOSH`, `DASINH`, `DATANH` | Use the generic intrinsic procedures: `ACOSH`, `ASINH`, `ATANH` | | Mathematical function to compute the Gamma function for double precision arguments: `DGAMMA` | Use the generic `GAMMA` that also accepts double precision arguments | | Mathematical function for double precision complementary error function: `DERFC` | Use the generic intrinsic function for the complementary error function: `ERFC` | -| Functions for processor time measurements: `DTIME`, `SECOND` | Use the generic intrinsic subroutine `CPU_TIME(TIME)` | +| Functions for processor time measurements: `DTIME`, `SECOND` | Use the generic intrinsic subroutine `CPU_TIME(TIME)` | | Functions to retrieve date and time information: `FDATE`, `IDATE`, `ITIME`, `CTIME`, `LTIME`, `GMTIME` | Use the generic intrinsic subroutine `DATE_AND_TIME([DATE, TIME, ZONE, VALUES])` | | Functions for low-level file input: `FGET`, `FGETC` | Use `READ` or C interoperability | | Functions to indicate integers of different precisions: `FLOATI`, `FLOATJ`, `FLOATK` | Use the generic `REAL(A)` function or `DBLE(A)` function if double precision is required. | @@ -254,7 +254,7 @@ included in this PWR075 documentation. | Function to extract the imaginary part of a complex number: `IMAGPART` | Use the generic intrinsic function `AIMAG(Z)` that returns the imaginary part of a complex number | | Types to convert values to integers of different precisions: `INT2`, `INT8` | Use the generic intrinsic function `INT(A, KIND)` along with standard kind type parameters (e.g. `C_INT16_T` or `C_INT64_T`) | | Mathematical functions to compute the natural logarithm of the Gamma function: `LGAMMA`, `ALGAMA`, `DLGAMA` | Use the generic intrinsic function `LOG_GAMMA` | -| Function to find the last non-blanck character in a string: `LNBLNK` | Use the generic intrinsic function `LEN_TRIM(STRING [, KIND])` | +| Function to find the last non-blank character in a string: `LNBLNK` | Use the generic intrinsic function `LEN_TRIM(STRING [, KIND])` | | Functions for generating random numbers: `RAND`, `RAN`,`IRAND`, `SRAND` | Use the generic intrinsic to generate pseudorandom numbers: `RANDOM_NUMBER` | | Function to extract the real part of a complex number: `REALPART` | Use the generic intrinsic function `REAL(A [, KIND])` or `DBLE(A)` if double precision is required to obtain the real part | | Function to execute a system command from Fortran: `SYSTEM` | Use the generic intrinsic subroutine `EXECUTE_COMMAND_LINE` | diff --git a/Checks/PWR080/README.md b/Checks/PWR080/README.md index 2efe98f5..68a08b23 100644 --- a/Checks/PWR080/README.md +++ b/Checks/PWR080/README.md @@ -11,7 +11,7 @@ undefined behavior due to its indeterminate value. To prevent bugs in the code, ensure the problematic variable is initialized in all possible code paths. It may help to add explicit `else` or `default` branches in control-flow blocks, or even set a default initial value -inmediately after declaring the variable. +immediately after declaring the variable. ### Relevance diff --git a/Checks/RMK012/README.md b/Checks/RMK012/README.md index 52236be7..9b4ed92b 100644 --- a/Checks/RMK012/README.md +++ b/Checks/RMK012/README.md @@ -33,7 +33,7 @@ vectorization. * If the condition in the loop is always evaluated to a loop-invariant value (i.e. its value is either true or false across the execution of the loop), this -condition can be moved outside of the loop (see +condition can be moved outside the loop (see [loop unswitching](../../Glossary/Loop-unswitching.md)). * If the condition in the loop depends on iterator variables only, the conditions diff --git a/Checks/RMK015/README.md b/Checks/RMK015/README.md index 132cc9e5..cee66629 100644 --- a/Checks/RMK015/README.md +++ b/Checks/RMK015/README.md @@ -16,7 +16,7 @@ debugging tools. Compilers are designed to **convert source code into efficient executable code for the target hardware**, **reducing the cost of the compilation process** and -**facilitating the debugging  process** by the programmer. Compilers provide +**facilitating the debugging process** by the programmer. Compilers provide optimization flags to improve performance, as well as optimization flags for reducing the size of the executable code. Typical compiler optimization flags for performance are `-O0`, `-O1`, `-O2`, `-O3` and `-Ofast`. On the other hand, diff --git a/Deprecated/PWR010/README.md b/Deprecated/PWR010/README.md index 52feef35..7bbd9ccd 100644 --- a/Deprecated/PWR010/README.md +++ b/Deprecated/PWR010/README.md @@ -11,7 +11,7 @@ ### Issue -In the C and C++ programming languages, matrices are stored in a +In the C and C++ programming languages, matrices are stored in a [row-major layout](../../Glossary/Row-major-and-column-major-order.md); thus, iterating the matrix column-wise is non-optimal and should be avoided if possible. diff --git a/Deprecated/PWR033/README.md b/Deprecated/PWR033/README.md index 50d374e2..018bce91 100644 --- a/Deprecated/PWR033/README.md +++ b/Deprecated/PWR033/README.md @@ -31,7 +31,7 @@ may be larger but it should also become faster. > This optimization is called [loop unswitching](../../Glossary/Loop-unswitching.md) > and the compilers can do it automatically in simple cases. However, in more > complex cases, the compiler will omit this optimization and therefore it is -> beneficial to do it manually.. +> beneficial to do it manually. ### Code example @@ -51,7 +51,7 @@ void example(int addTwo) { In each iteration, the increment statement evaluates the argument to decide how much to increment. However, this value is fixed for the whole execution of the -function and thus, the conditional can be moved outside of the loop. The +function and thus, the conditional can be moved outside the loop. The resulting code is as follows: ```c @@ -87,7 +87,7 @@ end subroutine In each iteration, the increment statement evaluates the argument to decide how much to increment. However, this value is fixed for the whole execution of the -function and thus, the conditional can be moved outside of the loop. The +function and thus, the conditional can be moved outside the loop. The resulting code is as follows: ```fortran diff --git a/Deprecated/RMK001/README.md b/Deprecated/RMK001/README.md index d34929fc..89affdac 100644 --- a/Deprecated/RMK001/README.md +++ b/Deprecated/RMK001/README.md @@ -1,4 +1,4 @@ -# RMK001: loop nesting that might benefit from hybrid parallelization using multithreading and SIMD +# RMK001: Loop nesting that might benefit from hybrid parallelization using multithreading and SIMD > [!WARNING] > This check was deprecated in favor of [PWR050](../../Checks/PWR050/README.md), diff --git a/Deprecated/RMK003/README.md b/Deprecated/RMK003/README.md index 0bebbc2d..304bc4fe 100644 --- a/Deprecated/RMK003/README.md +++ b/Deprecated/RMK003/README.md @@ -1,4 +1,4 @@ -# RMK003: potential temporary variable for the loop which might be privatizable, thus enabling the loop parallelization +# RMK003: Potential temporary variable for the loop which might be privatizable, thus enabling the loop parallelization > [!WARNING] > This check was deprecated due to the lack of actionable guidance and examples. diff --git a/Glossary/Locality-of-reference.md b/Glossary/Locality-of-reference.md index 662f0d9e..5659ff4e 100644 --- a/Glossary/Locality-of-reference.md +++ b/Glossary/Locality-of-reference.md @@ -14,7 +14,7 @@ ways: * **Temporal locality**: If the CPU has accessed a certain memory location, there is a high probability that it will access it again in the near future. Using the -same values in different loop iterations is an example of temporal locality. +same values in different loop iterations is an example of temporal locality. * **Spatial locality**: If the CPU has accessed a certain memory location, there is a high probability that it will access its neighboring locations in the near @@ -47,7 +47,7 @@ brings performance gain. Writing code that makes efficient use of vectorization is essential to write performant code for modern hardware. For example, loop fission enables splitting -an non-vectorizable loop into two or more loops. The goal of the fission is to +a non-vectorizable loop into two or more loops. The goal of the fission is to isolate the statements preventing the vectorization into a dedicated loop. By doing this, we enable vectorization in the rest of the loop, which can lead to speed improvements. Note loop fission introduces overheads (e.g. loop control diff --git a/Glossary/Loop-sectioning.md b/Glossary/Loop-sectioning.md index 502f8bd4..110a8593 100644 --- a/Glossary/Loop-sectioning.md +++ b/Glossary/Loop-sectioning.md @@ -5,7 +5,7 @@ efficiency of vectorization by splitting the loop execution into several sections. Instead of iterating from `0` to `N`, the loop iterates in sections which are -smaller in size, e.g. `0` to `S`, from `S` to `2S - 1`, etc. +smaller, e.g. `0` to `S`, from `S` to `2S - 1`, etc. There are two distinct use cases for loop sectioning: diff --git a/Glossary/Loop-tiling.md b/Glossary/Loop-tiling.md index 0e6e8a3b..c7dc15dd 100644 --- a/Glossary/Loop-tiling.md +++ b/Glossary/Loop-tiling.md @@ -62,7 +62,7 @@ for (int jj = 0; jj < m; jj += TILE_SIZE) { The careful reader might notice that after this intervention, the values for the array `a` will be read `m / TILE_SIZE` times from the memory. If the size of array `a` is large, then it can be useful to perform loop tiling on the loop -over `i` a as well, like this: +over `i` as well, like this: ```c for (int ii = 0; ii < n; ii += TILE_SIZE_I) { diff --git a/Glossary/Loop-unswitching.md b/Glossary/Loop-unswitching.md index 389c9a47..75ae9f84 100644 --- a/Glossary/Loop-unswitching.md +++ b/Glossary/Loop-unswitching.md @@ -2,7 +2,7 @@ **Loop unswitching** is a program optimization technique, where invariant conditions inside loops (i.e. conditions whose value is always the same inside -the loop) can be taken outside of the loop by creating copies of the loop. +the loop) can be taken outside the loop by creating copies of the loop. To illustrate loop unswitching, consider the following example: @@ -22,7 +22,7 @@ in case `a[i]` is negative and we are debugging, we want to log an error. The condition `if (debug)` is loop invariant, since the variable `debug` never changes its value. By doing loop unswitching and moving this condition outside -of the loop, the loop becomes faster. Here is the same loop after loop +the loop, the loop becomes faster. Here is the same loop after loop unswitching: ```c diff --git a/Glossary/Memory-access-pattern.md b/Glossary/Memory-access-pattern.md index 85f1398d..a41e987c 100644 --- a/Glossary/Memory-access-pattern.md +++ b/Glossary/Memory-access-pattern.md @@ -48,12 +48,13 @@ follows: * Access to `d[i]` is constant. It doesn't depend on the value of `j` and it has the same value inside the innermost loop. -* Access to `a[j]` is sequential. Everytime the iterator variable `j` increases by -1, the loop is accessing the next neighboring element. The same applies to the -access to `index[j]`. +* Access to `a[j]` is sequential. Every time the iterator variable `j` +increases by 1, the loop is accessing the next neighboring element. The same +applies to the access to `index[j]`. -* Access to `b[j * n]` is strided. Everytime the iterator variable `j` increases -by 1, the loop is accessing the element of the array `b` increased by `n`. +* Access to `b[j * n]` is strided. Every time the iterator variable `j` +increases by 1, the loop is accessing the element of the array `b` increased by +`n`. * Access to `c[index[j]]` is random. The value accessed when the iterator variable `j` increases its value is not known and it is considered random. diff --git a/Glossary/Multithreading.md b/Glossary/Multithreading.md index 7fa00368..841bd16a 100644 --- a/Glossary/Multithreading.md +++ b/Glossary/Multithreading.md @@ -8,9 +8,9 @@ to several CPU cores in order to speed up its execution. The crucial underlying concept of multithreading is **thread**. The simplest way to imagine a thread is as an independent worker, which has its own code that it -is executing. Some of the data used by the thread is local to the thread, and -some of it is shared among all threads. An important aspect of multithreading is -that all the threads in principle have access to the same address space. +is executing. Some data used by the thread is local to the thread, and some of +it is shared among all threads. An important aspect of multithreading is that +all the threads in principle have access to the same address space. Although the user can create as many logical threads as they want, for optimum performance the number of threads should correspond to the number of CPU cores. @@ -38,4 +38,6 @@ The two biggest challenges with multithreading are: 1. [Deciding which data should be thread-private and which should be shared](Variable-scoping-in-the-context-of-OpenMP.md), -2. and thread synchronization and possible data races. Without it the parallelization either doesn't pay off in term of performance or gives the wrong results. +2. and thread synchronization and possible data races. Without it the + parallelization either doesn't pay off in terms of performance or gives the + wrong results. diff --git a/Glossary/Patterns-for-performance-optimization/Recurrence.md b/Glossary/Patterns-for-performance-optimization/Recurrence.md index f89ced5c..4ade0f6b 100644 --- a/Glossary/Patterns-for-performance-optimization/Recurrence.md +++ b/Glossary/Patterns-for-performance-optimization/Recurrence.md @@ -13,8 +13,8 @@ A more formal definition is that a recurrence is a computation `a(s) = e`, where `e` contains a set of occurrences `a(s1), ..., a(sm)` so that, in the general case, the subscripts `s, s1, ..., sm` are different. Note that in the classical sense, a recurrence satisfies the additional constraint that at least one -subscript is symbolically different than `s`, and thus dependencies between -different loop iterations are introduced. +subscript is symbolically different from `s`, and thus dependencies between +different loop iterations are introduced. ### Code examples @@ -39,7 +39,7 @@ end do ### Parallelizing recurrences with OpenMP and OpenACC In general, codes containing a recurrence pattern are difficult to parallelize -in an efficient manner, and may even not be parallelizable at all. An example of +efficiently, and may even not be parallelizable at all. An example of parallelizable recurrence is the computation of a cumulative sum, which can be computed efficiently in parallel through parallel prefix sum operations. This is usually known as scan operation and it is supported in OpenMP since version 5.0. diff --git a/Glossary/Patterns-for-performance-optimization/Scalar-reduction.md b/Glossary/Patterns-for-performance-optimization/Scalar-reduction.md index 6a56abc4..c2ad5c81 100644 --- a/Glossary/Patterns-for-performance-optimization/Scalar-reduction.md +++ b/Glossary/Patterns-for-performance-optimization/Scalar-reduction.md @@ -35,7 +35,7 @@ end do ### Parallelizing scalar reductions with OpenMP and OpenACC The computation of the scalar reduction has concurrent read-write accesses to -the scalar reduction variable. Therefore a scalar reduction can be computed in +the scalar reduction variable. Therefore, a scalar reduction can be computed in parallel safely only if additional synchronization is inserted in order to avoid race conditions associated to the reduction variable. diff --git a/Glossary/Scalar-to-vector-promotion.md b/Glossary/Scalar-to-vector-promotion.md index fbf12578..17998986 100644 --- a/Glossary/Scalar-to-vector-promotion.md +++ b/Glossary/Scalar-to-vector-promotion.md @@ -7,7 +7,7 @@ optimization techniques, notably [loop fission](Loop-fission.md). In this technique, a temporary scalar is converted to a vector whose value is preserved between loop iterations, with the goal to enable loop fission needed to extract the statements preventing -optimizations outside of the critical loop. +optimizations outside the critical loop. ### Loop interchange diff --git a/README.md b/README.md index 3f907713..2ae6797d 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ The Open Catalog includes designed to demonstrate: - No performance degradation when implementing the correctness, modernization, - and security recommendations. + security, and portability recommendations. - Potential performance enhancements achievable through the optimization recommendations.