rocBLAS documentation is available at https://rocm.docs.amd.com/projects/rocBLAS/en/latest/index.html.
- gfx950 support
ROCBLAS_LAYER = 8internal API logging forgemmdebugging- Support for AOCL 5.0 gcc build as a client reference library
- Allow
PkgConfigfor client reference library fallback detection
CMAKE_CXX_COMPILERis now passed on during compilation for a Tensile build- Change default atomics mode from
allowedtonot allowed
- Support code for non-production gfx targets
rocblas_hgemm_kernel_name,rocblas_sgemm_kernel_name, androcblas_dgemm_kernel_nameAPI functions- Use of
warpSizeas a constexpr - Use of deprecated behavior of
hipPeekLastError rocblas_is_user_managing_device_memoryandrocblas_set_device_memory_sizeAPI functionsrocblas_float8.handrocblas_hip_f8_impl.hfilesrocblas_gemm_ex3,rocblas_gemm_batched_ex3,rocblas_gemm_strided_batched_ex3API functions
- Optimized
gemmby usinggemvkernels when applicable - Optimized
gemvfor smallmandnwith a large batch count on gfx942 - Improved the performance of Level 1
dotfor all precisions and variants whenN > 100000000on gfx942 - Improved the performance of Level 1
asumandnrm2for all precisions and variants on gfx942 - Improved the performance of Level 2
sger(single precision) on gfx942 - Improved the performance of Level 3
dgmmfor all precisions and variants on gfx942
- Fixed environment variable path-based logging to append multiple handle output to the same file
- Support numerics when
trsmis running withrocblas_status_perf_degraded - Fixed the build dependency installation of
joblibon some operating systems - Return
rocblas_status_internal_errorwhenrocblas_[set,get]_ [matrix,vector]is called with a host pointer in place of a device pointer - Reduced the default verbosity level for internal GEMM backend information
- Updated from the deprecated rocm-cmake to ROCmCMakeBuildTools
- Corrected AlmaLinux gfortran package dependencies
- Deprecated the use of negative indices to indicate the default solution is being used for
gemm_exwithrocblas_gemm_algo_solution_index
- rocTX support in rocBLAS (not available on Windows or in the static library version on Linux)
- On gfx12, all functions now support full
rocblas_intdynamic range forbatch_count --ninjabuild option- Support for GPU_TARGETS cmake variable
- rocblas-test client removes the stress tests unless YAML-based testing or
gtest_filteradds them - rocblas clients OpenMP default threading is reduced to be less than the logical core count
gemm_extesting and timing reuses device memorygemm_extiming initializes matrices on device
- Significantly reduced workspace memory requirements for Level 1 ILP64:
iamaxandiamin - Reduced workspace memory requirements for Level 1 ILP64:
dot,asum,nrm2 - Improved the performance of Level 2 gemv for the problem sizes (
TransA == N && m > 2*n) and (TransA == T) - Improved the performance of Level 3 syrk and herk for the problem size (
k > 500 && n < 4000)
- gfx12:
ger,geam,geam_ex,dgmm,trmm,symm,hemm, ILP64gemm, and larger data support - Added a
gfortranpackage dependency for Azure Linux OS - Outdated SLES OS package dependencies (
cxxtoolsandjoblib) ininstall.sh -d - Code object stripping for RPM packages
- Deprecated the cmake variable
AMDGPU_TARGETS. UseGPU_TARGETSinstead.
- Level 3 and EX functions have an additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments
- amdclang is used as the default compiler instead of hipcc
- Internal performance scripts use amd-smi instead of the deprecated rocm-smi
- Improved performance of Level 2 gbmv
- Improved performance of Level 2 gemv for float and double precisions for problem sizes (TransA == N && m==n && m % 128 == 0) measured on a gfx942 GPU
- Fixed stbsv_strided_batched_64 Fortran binding
- rocblas_Xgemm_kernel_name APIs are deprecated
- Removed Device_Memory_Allocation.pdf link in documentation.
- Fixed error/warning message during
rocblas_set_stream()call.
- Level 2 functions and level 3 trsm have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments
- Cache flush timing for gemm_batched_ex, gemm_strided_batched_ex, axpy
- Benchmark class for common timing code
- An environment variable "ROCBLAS_DEFAULT_ATOMICS_MODE" to set default atomics mode during creation of 'rocblas_handle'
- Extended dot_ex to support single-precision (fp32_r) input and double-precision (fp64_r) output and compute types
- Improved performance of Level 1 dot_batched and dot_strided_batched for all precisions. Performance enhanced by 6 times for bigger problem sizes measured on MI210 GPU
- Linux AOCL dependency updated to release 4.2 gcc build
- Windows vcpkg dependencies updated to release 2024.02.14
- Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40
- rocblas_gemm_ex3, gemm_batched_ex3 and gemm_strided_batched_ex3 are deprecated and will be removed in the next major release of rocBLAS. Please refer to hipBLASLt for future 8 bit float usage https://github.com/ROCm/hipBLASLt
- Level 1 and Level 1 Extension functions have additional ILP64 API for both C and Fortran (
_64name suffix) with int64_t function arguments - Cache flush timing for
gemm_ex
- Some Level 2 function argument names have changed
mtonto match legacy BLAS; there is no change in implementation - Standardized the use of non-blocking streams for copying results from device to host
- Fixed host-pointer mode reductions for non-blocking streams
- Beta API
rocblas_gemm_batched_ex3androcblas_gemm_strided_batched_ex3 - Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and
gemv_strided_batched:
rocblas_hshgemv_batched/strided_batched,rocblas_hssgemv_batched/strided_batched,rocblas_tstgemv_batched/strided_batchedandrocblas_tssgemv_batched/strided_batched - Use of
rocblas_status_excluded_from_buildwhen calling functions that require Tensile (when using rocBLAS built without Tensile) - System for asynchronous kernel launches that set a
rocblas_statusfailure based on ahipPeekAtLastErrordiscrepancy
- TRSM performance for small sizes (m < 32 && n < 32)
- Atomic operations will be disabled by default in a future release of rocBLAS
rocblas_gemm_ext2API function- In-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and out-of-place trmm
- int8x4 support is removed (int8 support is unchanged)
#define __STDC_WANT_IEC_60559_TYPES_EXT__is removed fromrocblas-types.h(if you want ISO/IEC TS 18661-3:2015 functionality, you must define__STDC_WANT_IEC_60559_TYPES_EXT__before includingfloat.h,math.h, androcblas.h)- The default build removes device code for gfx803 architecture from the fat binary
- Made offset calculations for 64-bit rocBLAS functions safe
- Fixes for very large leading dimension or increment potentially causing overflow:
- Level2:
gbmv,gemv,hbmv,sbmv,spmv,tbmv,tpmv,tbsv, andtpsv
- Level2:
- Fixes for very large leading dimension or increment potentially causing overflow:
- Lazy loading supports heterogeneous architecture setup and load-appropriate tensile library files, based on device architecture
- Guards against no-op kernel launches that result in a potential
hipGetLastError
- Reduced the default verbosity of
rocblas-test(you can see all tests by setting theGTEST_LISTENER=PASS_LINE_IN_LOGenvironment variable)
- YAML lock step argument scans for
rocblas-benchandrocblas-testclients rocblas-gemm-tuneis used to find the best performing GEMM kernel for each set of GEMM problems
- Made offset calculations for 64-bit rocBLAS functions safe
- Fixes for very large leading dimensions or increments potentially causing overflow:
- Level 1:
axpy,copy,rot,rotm,scal,swap,asum,dot,iamax,iamin,andnrm2` - Level 2:
gemv,symv,hemv,trmv,ger,syr,her,syr2,her2, andtrsv - Level 3:
gemm,symm,hemm,trmm,syrk,herk,syr2k,her2k,syrkx,herkx,trsm,trtri,dgmm, andgeam - General:
set_vector,get_vector,set_matrix, andget_matrix - Related fixes: internal scalar loads with > 32-bit offsets
- In-place functionality for all
trtrisizes
- Level 1:
- Fixes for very large leading dimensions or increments potentially causing overflow:
- Dot when using
rocblas_pointer_mode_hostis now synchronous in order to match legacy BLAS as it stores results in host memory - Enhanced reporting of installation issues caused by runtime libraries (Tensile)
- Standardized internal rocBLAS C++ interface across most functions
__STDC_WANT_IEC_60559_TYPES_EXT__define will be removed in a future release
- Optional use of AOCL BLIS 4.0 on Linux for clients
- Optional build tool-only dependency on Python
psutil
- Level 2 rocBLAS GEMV performance on gfx90a GPU for non-transposed problems that have small
matrices (
mandn<= 32) and large batch counts (batch_count>= 256) - rocBLAS syr2k performance for single, double, and double-complex precision
- rocBLAS her2k performance for double-complex precision
- Improved performance for general sizes on gfx90a
- bf16 inputs and f32 compute support to Level 1 rocBLAS extension functions:
axpy_ex,scal_ex, andnrm2_ex
- In-place trmm has been replaced with trmm that has in-place and out-of-place functionality
rocblas_query_int8_layout_flag()rocblas_gemm_flags_pack_int8x4rocblas_set_device_memory_size()will be replaced withrocblas_increase_device_memory_size()rocblas_is_user_managing_device_memory()
is_complexhelper: userocblas_is_complexinstead- The enum
truncate_t: userocblas_truncate_tinstead - The value
truncate: userocblas_truncateinstead rocblas_set_int8_type_for_hipblasrocblas_get_int8_type_for_hipblas
- Python
joblibbuild-only dependency (used in Tensile builds)
- Made 64-bit trsm offset calculations safe
- CMake install fixed on some operating systems when using
install.sh -d --cmake_install
- Refactored ROTG test code
rocblas_geam_exfunctionality for matrix-matrix minimum operations- HIP Graph support (beta feature) for rocBLAS Level 1, Level 2, and Level 3 (pointer mode host) functions
- Beta features API, exposed using compiler define
ROCBLAS_BETA_FEATURES_API - Support for vector initialization in the rocBLAS test framework with negative increments
- Windows build documentation for HIP SDK support
- Scripts for plotting the performance of multiple functions
- Performance improvements for Level 2 rocBLAS GEMV for float and double precision (150-200% improvement for certain problem sizes when (m==n) measured on a gfx90a GPU)
- Performance improvements for Level 2 rocBLAS GER for float, double, and complex float precisions (5-7% improvement for certain problem sizes when measured on a gfx90a GPU)
- Performance improvements for Level 2 rocBLAS SYMV for float and double precisions (120-150% improvement for certain problem sizes measured on both gfx908 and gfx90a GPUs)
- Executable mode setting on
rocblas_gentest.pyclient script to avoid potential permission errors with clientsrocblas-testandrocblas-bench - Deprecated API compatibility with Visual Studio compiler
- Test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory
install.shinternally runsrmake.py(also used on Windows) andrmake.pycan be used directly on Linux (use--help)- rocBLAS client executables all now begin with the
rocblas-prefix
install.shno longer has the options-o --covbecause Tensile will now use the default COV format, which is set bycmake define Tensile_CODE_OBJECT_VERSION=default
- client smoke test dataset added for quick validation using command
rocblas-test --yaml rocblas_smoke.yaml - Added stream order device memory allocation as a non-default beta option.
- Improved trsm performance for small sizes by using a substitution method technique
- Improved syr2k and her2k performance significantly by using a block-recursive algorithm
- Level 2, Level 1, and Extension functions: argument checking when the handle is set to
rocblas_pointer_mode_hostnow returns the status ofrocblas_status_invalid_pointeronly for pointers that must be dereferenced based on the alpha and beta argument values. With handle moderocblas_pointer_mode_deviceonly pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status ofrocblas_status_invalid_pointer. This improves consistency with legacy BLAS behavior. - Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm
- Allow hipBLAS to select int8 datatype
- Disallow B == C && ldb != ldc in
rocblas_xtrmm_outofplace
- Fortran interfaces generalized for Fortran compilers other than GFortran
- fix for
trsm_strided_batched rocblas-benchperformance gathering - Fix for rocm-smi path in
commandrunner.pyscript to match ROCm 5.2 and above
install.shoption--upgrade_tensile_venv_pipto upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIPinstall.shoption--relocatableor-radds rpath and removes ldconf entry on rocBLAS buildinstall.shoption--lazy-library-loadingto enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization- Support for RHEL9 and CS9
- Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value
trmm_outofplaceperformance improvements for all sizes and data types using block-recursive algorithm- herkx performance improvements for all sizes and data types using block-recursive algorithm
- syrk/herk performance improvements by utilising optimised syrkx/herkx code
- symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm
- Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now.
- Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour
- Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset
- Add quick-return for internal trmm and gemm template functions
- Moved function block sizes to a shared header file
- Level 1, 2, and 3 functions use rocblas_stride datatype for offset
- Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions
- Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure
- Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions
- Improved logic to #include vs <experimental/filesystem>
install.sh -soption to build rocblas as a static library.- dot function now sets the device results asynchronously for N <= 0
- is_complex helper is now deprecated. Use
rocblas_is_complexinstead - The enum
truncate_tand the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced byrocblas_truncate_tandrocblas_truncate, respectively. The new enumrocblas_truncate_tand the valuerocblas_truncatecould be used from this ROCm release for an easy transition
install.shoptions--hip-clang,--no-hip-clang,--merge-files,--no-merge-filesare removed
- Packages for test and benchmark executables on all supported operating systems using CPack
- Added denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions
- Added denormal number detection to the numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions
- Added NaN initialization tests to the YAML files of Level 2 rocBLAS batched and strided-batched functions for testing purposes
- Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests
- Improved performance of non-batched and batched her2 for all sizes and data types
- Improved performance of non-batched and batched amin for all data types using shuffle reductions
- Improved performance of non-batched and batched amax for all data types using shuffle reductions
- Improved performance of trsv for all sizes and data types
- Modifying
gemm_exfor HBH (high-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16 - Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions
- For
gemm,gemm_ex,gemm_ex2internal API userocblas_stridedatatype for offset - For symm, hemm, syrk, herk, dgmm, geam internal API use
rocblas_stridedatatype for offset - AMD copyright year for all rocBLAS files
- For
gemv(transpose-case), typecasted the 'lda'(offset) datatype tosize_tduring offset calculation to avoid overflow and remove duplicate template functions
- For function her2 avoid overflow in offset calculation
- For trsm when alpha == 0 and on host, allow A to be nullptr
- Fixed memory access issue in trsv
- Fixed git pre-commit script to update only AMD copyright year
- Fixed dgmm, geam test functions to set correct stride values
- For functions ssyr2k and dsyr2k allow trans ==
rocblas_operation_conjugate_transpose - Fixed compilation error for clients-only build
- Remove Navi12 (gfx1011) from fat binary
- Option to install script for number of jobs to use for rocBLAS and Tensile compilation (
-j,--jobs) - Option to install script to build clients without using any Fortran (
--clients_no_fortran) rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time.- Added tests for output of reduction functions when given bad input
- Added user specified initialization (
rand_int/trig_float/hpl) for initializing matrices and vectors inrocblas-bench
- Improved performance of trsm with side == left and n == 1
- Improved performance of trsm with side == left and m <= 32 along with side == right and n <= 32
- For syrkx and trmm internal API use
rocblas_stridedatatype for offset - For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match
- Test client dependencies updated to GTest 1.11
- non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives
- Help menu messages in
install.sh - For ger function, typecast the 'lda'(offset) datatype to
size_tduring offset calculation to avoid overflow and remove duplicate template functions - Modified default initialization from
rand_intto hpl for initializing matrices and vectors inrocblas-bench
- For function trmv (non-transposed cases) avoid overflow in offset calculation
- Fixed cppcheck errors/warnings
- Fixed Doxygen warnings
- Added
rocblas_get_version_string_sizeconvenience function - Added
rocblas_xtrmm_outofplace, an out-of-place version ofrocblas_xtrmm - Added hpl and trig initialization for
gemm_extorocblas-bench - Added source code gemm. It can be used as an alternative to Tensile for debugging and development
- Added option
ROCM_MATHLIBS_API_USE_HIP_COMPLEXto opt-in to usehipFloatComplexandhipDoubleComplex
- Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.
- Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.
- Instantiate templated rocBLAS functions to reduce size of
librocblas.so - Removed static library dependency on msgpack
- Removed boost dependencies for clients
- Option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Correctly set output of
nrm2_batched_exandnrm2_strided_batched_exwhen given bad input - Fix for dgmm with side ==
rocblas_side_leftand a negative incx - Fixed out-of-bounds read for small trsm
- Fixed numerical checking for
tbmv_strided_batched
- Improved performance of non-batched and batched syr for all sizes and data types
- Improved performance of non-batched and batched hemv for all sizes and data types
- Improved performance of non-batched and batched symv for all sizes and data types
- Improved memory utilization in
rocblas-bench,rocblas-testgemm functions, increasing possible runtime sizes. - Improved performance of non-batched and batched dot, dotc, and dot_ex for small n. e.g. sdot n <= 31000.
- Improved performance of non-batched and batched trmv for all sizes and matrix types.
- Improved performance of non-batched and batched gemv transpose case for all sizes and datatypes.
- Improved performance of sger and dger for all sizes, in particular the larger dger sizes.
- Improved performance of syrkx for for large size including those in rocBLAS Issue #1184.
- Update from C++14 to C++17.
- Packaging split into a runtime package (called rocblas) and a development package (called rocblas-dev for
.debpackages, and rocblas-devel for.rpmpackages). The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.
- For function geam avoid overflow in offset calculation.
- For function syr avoid overflow in offset calculation.
- For function gemv (Transpose-case) avoid overflow in offset calculation.
- For functions ssyrk and dsyrk, allow conjugate-transpose case to match legacy BLAS. Behavior is the same as the transpose case.
- Improved performance of non-batched and batched
rocblas_Xgemvfor gfx908 when m <= 15000 and n <= 15000 - Improved performance of non-batched and batched
rocblas_sgemvandrocblas_dgemvfor gfx906 when m <= 6000 and n <= 6000 - Improved the overall performance of non-batched and
batched rocblas_cgemvfor gfx906 - Improved the overall performance of
rocblas_Xtrsv
- Internal use only APIs prefixed with
rocblas_internal_and deprecated to discourage use
- Added option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Supported gemm ext for unpacked int8 input layout on gfx908 GPUs
- Added new flags
rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4to specify if using the packed layout- Set the
rocblas_gemm_flags_pack_int8x4when using packed int8x4, this should be always set on GPUs before gfx908. - For gfx908 GPUs, unpacked int8 is supported so no need to set this flag.
- Notice the default flags 0 uses unpacked int8, this somehow changes the behaviour of int8 gemm from ROCm 4.1.0
- Set the
- Added new flags
- Added a query function
rocblas_query_int8_layout_flagto get the preferable layout of int8 for gemm by device
- Improved performance of single precision copy, swap, and scal when
incx== 1 andincy== 1 - Improved performance of single precision axpy when
incx== 1,incy== 1 andbatch_count=< 8192 - Improved performance of trmm
- Change
cmake_minimum_requiredto VERSION 3.16.8
- Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions
- Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions
- Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
- Make functions compliant with Legacy Blas for special values
alpha== 0,k== 0,beta== 1,beta== 0
- Improved performance of single precision
axpy_batchedandaxpy_strided_batched:batch_count>= 8192 - Improved performance of trmm.
- Add changelog.
- Improved performance of gemm_batched for small m, n, k and NT, NC, TN, TT, TC, CN, CT, CC
- Improved performance of gemv, gemv_batched, gemv_strided_batched: small n large m
- Removed support for legacy hcc compiler
- Add
rot_ex,rot_batched_ex, androt_strided_batched_ex
- Removed
-DUSE_TENSILE_HOSTfromroc::rocblasCMake usage requirements. This is a rocblas internal variable, and does not need to be defined in user code
- Improved performance of
gemm_batchedfor NN, general m, n, k, small m, n, k
- Slight improvements to FP16 Megatron BERT performance on MI50
- Improvements to FP16 Transformer performance on MI50
- Slight improvements to FP32 Transformer performance on MI50
- Improvements to FP32 DLRM Terabyte performance on gfx908
-
added two functions:
rocblas_status rocblas_set_atomics_mode(rocblas_atomics_mode mode)rocblas_status rocblas_get_atomics_mode(rocblas_atomics_mode mode)
-
added enum
rocblas_atomics_mode. It can have two valuesrocblas_atomics_allowedrocblas_atomics_not_allowedThe default isrocblas_atomics_not_allowed
-
function
rocblas_Xdgmmalgorithm corrected andincx=0 support added -
dependencies:
rocblas-tensileinternal component requires msgpack instead of LLVM
-
Moved the following files from /opt/rocm/include to /opt/rocm/include/internal:
rocblas-auxillary.hrocblas-complex-types.hrocblas-functions.hrocblas-types.hrocblas-version.hrocblas_bfloat16.h
These files should NOT be included directly as this may lead to errors. Instead,
/opt/rocm/include/rocblas.hshould be included directly./opt/rocm/include/rocblas_module.f90can also be directly used
- Improvements to
rocblas_Xgemm_batchedperformance for small m, n, k - Improvements to
rocblas_Xgemv_batchedandrocblas_Xgemv_strided_batchedperformance for small m (QMCPACK use) - Improvements to
rocblas_Xdot(batched and non-batched) performance when both incx and incy are 1 - Improvements to FP32 ONNX BERT performance for MI50
- Significant improvements to FP32 Resnext, Inception Convolution performance for gfx908
- Slight improvements to FP32 DLRM Terabyte performance for gfx908
- Significant improvements to FP32 BDAS performance for gfx908
- Significant improvements to FP32 BDAS performance for MI50 and MI60
- Added substitution method for small trsm sizes with m <= 64 && n <= 64. Increases performance drastically for small batched trsm
- Improvements to User Guide and Design Document
- L1 dot function optimized to utilize shuffle instructions (improvements on bf16, f16, f32 data types)
- L1 dot function added x dot x optimized kernel
- Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
- Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
- Added Fortran interface for all rocBLAS functions
- add
geam complex,geam_batched, andgeam_strided_batched - add
dgmm,dgmm_batched, anddgmm_strided_batched - Optimized performance
- ger
rocblas_sger,rocblas_dgerrocblas_sger_batched,rocblas_dger_batchedrocblas_sger_strided_batched,rocblas_dger_strided_batched
- geru
rocblas_cgeru,rocblas_zgerurocblas_cgeru_batched,rocblas_zgeru_batchedrocblas_cgeru_strided_batched,rocblas_zgeru_strided_batched
- gerc
rocblas_cgerc,rocblas_zgercrocblas_cgerc_batched,rocblas_zgerc_batchedrocblas_cgerc_strided_batched,rocblas_zgerc_strided_batched
- symv
rocblas_ssymv,rocblas_dsymv,rocblas_csymv,rocblas_zsymvrocblas_ssymv_batched,rocblas_dsymv_batched,rocblas_csymv_batched,rocblas_zsymv_batchedrocblas_ssymv_strided_batched,rocblas_dsymv_strided_batched,rocblas_csymv_strided_batched,rocblas_zsymv_strided_batched
- sbmv
rocblas_ssbmv,rocblas_dsbmvrocblas_ssbmv_batched,rocblas_dsbmv_batchedrocblas_ssbmv_strided_batched,rocblas_dsbmv_strided_batched
- spmv
rocblas_sspmv,rocblas_dspmvrocblas_sspmv_batched,rocblas_dspmv_batchedrocblas_sspmv_strided_batched,rocblas_dspmv_strided_batched
- ger
- improved documentation.
- Fix argument checking in functions to match legacy BLAS.
- Fixed conjugate-transpose version of geam.
- Compilation for GPU Targets:
When using the install.sh script for "all" GPU Targets, which is the default, you must first set an environment variable
HCC_AMDGPU_TARGETlisting the GPU targets, e.g.HCC_AMDGPU_TARGET=gfx803,gfx900,gfx906,gfx908If building for a specific architecture(s) using the-a| --architecture flag, you should also set the environment variableHCC_AMDGPU_TARGETto match. Mismatching the environment variable to the-aflag architectures creates builds that may result inSEGFAULTSwhen running on GPUs which weren't specified.