pyvkfft-2022.1.1/0000755000076500000240000000000014202465377014233 5ustar vincentstaff00000000000000pyvkfft-2022.1.1/LICENSE0000644000076500000240000000214514202465263015234 0ustar vincentstaff00000000000000MIT License Copyright (c) 2021 ESRF-European Synchrotron Radiation Facility / Vincent Favre-Nicolin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. pyvkfft-2022.1.1/MANIFEST.in0000644000076500000240000000010214202465263015754 0ustar vincentstaff00000000000000include src/vkFFT.h include LICENSE LICENSE_VkFFT README_VkFFT.md pyvkfft-2022.1.1/PKG-INFO0000644000076500000240000002677614202465377015352 0ustar vincentstaff00000000000000Metadata-Version: 1.2 Name: pyvkfft Version: 2022.1.1 Summary: Python wrapper for the CUDA and OpenCL backends of VkFFT,providing GPU FFT for PyCUDA, PyOpenCL and CuPy Home-page: https://github.com/vincefn/pyvkfft Author: Vincent Favre-Nicolin Author-email: favre@esrf.fr License: UNKNOWN Project-URL: Bug Tracker, https://github.com/vincefn/pyvkfft/issues Project-URL: VkFFT project, https://github.com/DTolm/VkFFT Description: pyvkfft - python interface to the CUDA and OpenCL backends of VkFFT (Vulkan Fast Fourier Transform library) =========================================================================================================== `VkFFT `_ is a GPU-accelerated Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL. pyvkfft offers a simple python interface to the **CUDA** and **OpenCL** backends of VkFFT, compatible with **pyCUDA**, **CuPy** and **pyOpenCL**. Installation ------------ Install using ``pip install pyvkfft`` (works on macOS, Linux and Windows). Notes: - the PyPI package includes ``vkfft.h`` and will automatically install ``pyopencl`` if opencl is available. However you should manually install either ``cupy`` or ``pycuda`` to use the cuda backend. - if you want to specify the backend to be installed (which can be necessary e.g. if you have ``nvcc`` installed but cuda is not actually available), you can do that using e.g. ``VKFFT_BACKEND=opencl pip install pyvkfft``. By default the opencl backend is always installed, and the cuda one if nvcc is found. Requirements: - ``pyopencl`` and the opencl libraries/development tools for the opencl backend - ``pycuda`` or ``cupy`` and CUDA developments tools (`nvcc`) for the cuda backend - ``numpy`` - on Windows, this requires visual studio (c++ tools) and a cuda toolkit installation, with either CUDA_PATH or CUDA_HOME environment variable. - *Only when installing from source*: ``vkfft.h`` installed in the usual include directories, or in the 'src' directory This package can be installed from source using ``pip install .``. *Note:* ``python setup.py install`` is now disabled, to avoid messed up environments where both methods have been used. Examples -------- The simplest way to use pyvkfft is to use the ``pyvkfft.fft`` interface, which will automatically create the VkFFTApp (the FFT plans) according to the type of GPU arrays (pycuda, pyopencl or cupy), and also cache these apps: .. code-block:: python import pycuda.autoinit import pycuda.gpuarray as cua from pyvkfft.fft import fftn import numpy as np d0 = cua.to_gpu(np.random.uniform(0,1,(200,200)).astype(np.complex64)) # This will compute the fft to a new GPU array d1 = fftn(d0) # An in-place transform can also be done by specifying the destination d0 = fftn(d0, d0) # Or an out-of-place transform to an existing array (the destination array is always returned) d1 = fftn(d0, d1) See the scripts and notebooks in the examples directory. An example notebook is also `available on google colab `_. Make sure to select a GPU for the runtime. Features -------- - CUDA (using PyCUDA or CuPy) and OpenCL (using PyOpenCL) backends - C2C, R2C/C2R for inplace and out-of-place transforms - Direct Cosine Transform (DCT) of type 1, 2, 3 and 4 (EXPERIMENTAL, comparison with scipy DCT transforms are OK, but there are limitations on the array dimensions) - single and double precision for all transforms (double precision requires device support) - 1D, 2D and 3D transforms. - array can be have more dimensions than the FFT (batch transforms). - arbitrary array size, using Bluestein algorithm for prime numbers>13 (note that in this case the performance can be significantly lower, up to ~4x, depending on the transform size, see example performance plot below) - transform along a given list of axes - this requires that after collapsing non-transformed axes, the last transformed axis is at most along the 3rd dimension, e.g. the following axes are allowed: (-2,-3), (-1,-3), (-1,-4), (-4,-5),... but not (-2, -4), (-1, -3, -4) or (-2, -3, -4). This is not allowed for R2C transforms. - normalisation=0 (array L2 norm * array size on each transform) and 1 (the backward transform divides the L2 norm by the array size, so FFT*iFFT restores the original array) - unit tests for all transforms: see test sub-directory. Note that these take a **long** time to finish due to the exhaustive number of sub-tests. - Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2 - tested on macOS (10.13.6), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2) - GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs. - inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case a buffer of a size equal to the array is necessary. This makes larger FFT transforms possible based on memory requirements (even for R2C !) compared to cuFFT. For example you can compute the 3D FFT for a 1600**3 complex64 array with 32GB of memory. - transforms can either be done by creating a VkFFTApp (a.k.a. the fft 'plan'), with the selected backend (``pyvkfft.cuda`` for pycuda/cupy or ``pyvkfft.opencl`` for pyopencl) or by using the ``pyvkfft.fft`` interface with the ``fftn``, ``ifftn``, ``rfftn`` and ``irfftn`` functions which automatically detect the type of GPU array and cache the corresponding VkFFTApp (see the example notebook pyvkfft-fft.ipynb). - the ``pyvkfft-test`` command-line script allows to test specifc transforms against expected accuracy values, for all types of transforms. - pyvkfft results are now evaluated before any release with a comprehensive test suite, comparing transform results for all types of transforms: single and double precision, 1D, 2D and 3D, inplace and out-of-place, different norms, radix and Bluestein, etc... See ``pyvkfft/pyvkfft_test_suite.py`` to run the full suite, which takes 28 hours on a V100 GPU using up to 20 parallel process. Performance ----------- See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare with cuFFT (using scikit-cuda) and clFFT (using gpyfft). Example result for batched 2D FFT with array dimensions of batch x N x N using a Titan V: .. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-TITAN_V-Linux.png Notes regarding this plot: * the computed throughput is *theoretical*, as if each transform axis for the couple (FFT, iFFT) required exactly one read and one write. This is obviously not true, and explains the drop after N=1024 for cuFFT and (in a smaller extent) vkFFT. * the batch size is adapted for each N so the transform takes long enough, in practice the transformed array is at around 600MB. Transforms on small arrays with small batch sizes could produce smaller performances, or better ones when fully cached. * a number of blue + (CuFFT) are actually performed as radix-N transforms with 71024 vkFFT is much more efficient than cuFFT due to the smaller number of read and write per FFT axis (apart from isolated radix-2 3 sizes) * the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges where CUDA performs better, due to different cache . [Note that if the card is also used for display, then difference can increase, e.g. for nvidia cards opencl performance is more affected when being used for display than the cuda backend] * clFFT (via gpyfft) generally performs much worse than the other transforms, though this was tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations to find the fastest combination) Accuracy -------- See the accuracy notebook, which allows to compare the accuracy for different FFT libraries (pyvkfft with different options and backend, scikit-cuda (cuFFT), pyfftw), using pyfftw long-double precision as a reference. Example results for 1D transforms (radix 2,3,5 and 7) using a Titan V: .. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/accuracy-1DFFT-TITAN_V.png Analysis: * in single precision on the nVidia Titan V card, the VkFFT computed accuracy is about 3 times larger (worse) than pyfftw (also computed in single precision), e.g. 6e-7 vs 2e-7, which can be pretty negligible for most applications. However when using a lookup-table for trigonometric values instead of hardware functions (useLUT=1 in VkFFTApp), the accuracy is identical to pyfftw, and better than cuFFT. * accuracy is the same for cuda and opencl, though this can depend on the card and drivers used (e.g. it's different on a GTX 1080) You can easily test a transform using the ``pyvkfft-test`` command line script, e.g.: ``pyvkfft-test --systematic --backend pycuda --nproc 8 --range 2 4500 --radix --ndim 2`` Use ``pyvkfft-test --help`` to list available options. You can use the ``pyvkfft/pyvkfft_test_suite.py`` script to run the comprehensive test suite which is used to evaluate pyvkfft before a new release. TODO ---- - access to the other backends: - for vulkan and rocm this only makes sense combined to a pycuda/cupy/pyopencl equivalent. - out-of-place C2R transform without modifying the C array ? This would require using a R array padded with two wolumns, as for the inplace transform - half precision ? - convolution ? - zero-padding ? - access to tweaking parameters in VkFFTConfiguration ? - access to the code of the generated kernels ? Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0) Classifier: Operating System :: OS Independent Classifier: Environment :: GPU pyvkfft-2022.1.1/README.rst0000644000076500000240000002261414202465263015721 0ustar vincentstaff00000000000000pyvkfft - python interface to the CUDA and OpenCL backends of VkFFT (Vulkan Fast Fourier Transform library) =========================================================================================================== `VkFFT `_ is a GPU-accelerated Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL. pyvkfft offers a simple python interface to the **CUDA** and **OpenCL** backends of VkFFT, compatible with **pyCUDA**, **CuPy** and **pyOpenCL**. Installation ------------ Install using ``pip install pyvkfft`` (works on macOS, Linux and Windows). Notes: - the PyPI package includes ``vkfft.h`` and will automatically install ``pyopencl`` if opencl is available. However you should manually install either ``cupy`` or ``pycuda`` to use the cuda backend. - if you want to specify the backend to be installed (which can be necessary e.g. if you have ``nvcc`` installed but cuda is not actually available), you can do that using e.g. ``VKFFT_BACKEND=opencl pip install pyvkfft``. By default the opencl backend is always installed, and the cuda one if nvcc is found. Requirements: - ``pyopencl`` and the opencl libraries/development tools for the opencl backend - ``pycuda`` or ``cupy`` and CUDA developments tools (`nvcc`) for the cuda backend - ``numpy`` - on Windows, this requires visual studio (c++ tools) and a cuda toolkit installation, with either CUDA_PATH or CUDA_HOME environment variable. - *Only when installing from source*: ``vkfft.h`` installed in the usual include directories, or in the 'src' directory This package can be installed from source using ``pip install .``. *Note:* ``python setup.py install`` is now disabled, to avoid messed up environments where both methods have been used. Examples -------- The simplest way to use pyvkfft is to use the ``pyvkfft.fft`` interface, which will automatically create the VkFFTApp (the FFT plans) according to the type of GPU arrays (pycuda, pyopencl or cupy), and also cache these apps: .. code-block:: python import pycuda.autoinit import pycuda.gpuarray as cua from pyvkfft.fft import fftn import numpy as np d0 = cua.to_gpu(np.random.uniform(0,1,(200,200)).astype(np.complex64)) # This will compute the fft to a new GPU array d1 = fftn(d0) # An in-place transform can also be done by specifying the destination d0 = fftn(d0, d0) # Or an out-of-place transform to an existing array (the destination array is always returned) d1 = fftn(d0, d1) See the scripts and notebooks in the examples directory. An example notebook is also `available on google colab `_. Make sure to select a GPU for the runtime. Features -------- - CUDA (using PyCUDA or CuPy) and OpenCL (using PyOpenCL) backends - C2C, R2C/C2R for inplace and out-of-place transforms - Direct Cosine Transform (DCT) of type 1, 2, 3 and 4 (EXPERIMENTAL, comparison with scipy DCT transforms are OK, but there are limitations on the array dimensions) - single and double precision for all transforms (double precision requires device support) - 1D, 2D and 3D transforms. - array can be have more dimensions than the FFT (batch transforms). - arbitrary array size, using Bluestein algorithm for prime numbers>13 (note that in this case the performance can be significantly lower, up to ~4x, depending on the transform size, see example performance plot below) - transform along a given list of axes - this requires that after collapsing non-transformed axes, the last transformed axis is at most along the 3rd dimension, e.g. the following axes are allowed: (-2,-3), (-1,-3), (-1,-4), (-4,-5),... but not (-2, -4), (-1, -3, -4) or (-2, -3, -4). This is not allowed for R2C transforms. - normalisation=0 (array L2 norm * array size on each transform) and 1 (the backward transform divides the L2 norm by the array size, so FFT*iFFT restores the original array) - unit tests for all transforms: see test sub-directory. Note that these take a **long** time to finish due to the exhaustive number of sub-tests. - Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2 - tested on macOS (10.13.6), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2) - GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs. - inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case a buffer of a size equal to the array is necessary. This makes larger FFT transforms possible based on memory requirements (even for R2C !) compared to cuFFT. For example you can compute the 3D FFT for a 1600**3 complex64 array with 32GB of memory. - transforms can either be done by creating a VkFFTApp (a.k.a. the fft 'plan'), with the selected backend (``pyvkfft.cuda`` for pycuda/cupy or ``pyvkfft.opencl`` for pyopencl) or by using the ``pyvkfft.fft`` interface with the ``fftn``, ``ifftn``, ``rfftn`` and ``irfftn`` functions which automatically detect the type of GPU array and cache the corresponding VkFFTApp (see the example notebook pyvkfft-fft.ipynb). - the ``pyvkfft-test`` command-line script allows to test specifc transforms against expected accuracy values, for all types of transforms. - pyvkfft results are now evaluated before any release with a comprehensive test suite, comparing transform results for all types of transforms: single and double precision, 1D, 2D and 3D, inplace and out-of-place, different norms, radix and Bluestein, etc... See ``pyvkfft/pyvkfft_test_suite.py`` to run the full suite, which takes 28 hours on a V100 GPU using up to 20 parallel process. Performance ----------- See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare with cuFFT (using scikit-cuda) and clFFT (using gpyfft). Example result for batched 2D FFT with array dimensions of batch x N x N using a Titan V: .. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-TITAN_V-Linux.png Notes regarding this plot: * the computed throughput is *theoretical*, as if each transform axis for the couple (FFT, iFFT) required exactly one read and one write. This is obviously not true, and explains the drop after N=1024 for cuFFT and (in a smaller extent) vkFFT. * the batch size is adapted for each N so the transform takes long enough, in practice the transformed array is at around 600MB. Transforms on small arrays with small batch sizes could produce smaller performances, or better ones when fully cached. * a number of blue + (CuFFT) are actually performed as radix-N transforms with 71024 vkFFT is much more efficient than cuFFT due to the smaller number of read and write per FFT axis (apart from isolated radix-2 3 sizes) * the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges where CUDA performs better, due to different cache . [Note that if the card is also used for display, then difference can increase, e.g. for nvidia cards opencl performance is more affected when being used for display than the cuda backend] * clFFT (via gpyfft) generally performs much worse than the other transforms, though this was tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations to find the fastest combination) Accuracy -------- See the accuracy notebook, which allows to compare the accuracy for different FFT libraries (pyvkfft with different options and backend, scikit-cuda (cuFFT), pyfftw), using pyfftw long-double precision as a reference. Example results for 1D transforms (radix 2,3,5 and 7) using a Titan V: .. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/accuracy-1DFFT-TITAN_V.png Analysis: * in single precision on the nVidia Titan V card, the VkFFT computed accuracy is about 3 times larger (worse) than pyfftw (also computed in single precision), e.g. 6e-7 vs 2e-7, which can be pretty negligible for most applications. However when using a lookup-table for trigonometric values instead of hardware functions (useLUT=1 in VkFFTApp), the accuracy is identical to pyfftw, and better than cuFFT. * accuracy is the same for cuda and opencl, though this can depend on the card and drivers used (e.g. it's different on a GTX 1080) You can easily test a transform using the ``pyvkfft-test`` command line script, e.g.: ``pyvkfft-test --systematic --backend pycuda --nproc 8 --range 2 4500 --radix --ndim 2`` Use ``pyvkfft-test --help`` to list available options. You can use the ``pyvkfft/pyvkfft_test_suite.py`` script to run the comprehensive test suite which is used to evaluate pyvkfft before a new release. TODO ---- - access to the other backends: - for vulkan and rocm this only makes sense combined to a pycuda/cupy/pyopencl equivalent. - out-of-place C2R transform without modifying the C array ? This would require using a R array padded with two wolumns, as for the inplace transform - half precision ? - convolution ? - zero-padding ? - access to tweaking parameters in VkFFTConfiguration ? - access to the code of the generated kernels ? pyvkfft-2022.1.1/pyvkfft/0000755000076500000240000000000014202465377015724 5ustar vincentstaff00000000000000pyvkfft-2022.1.1/pyvkfft/__init__.py0000644000076500000240000000024014202465263020023 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2021- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr pyvkfft-2022.1.1/pyvkfft/accuracy.py0000644000076500000240000005073014202465263020067 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2021- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr # # # Functions for accuracy tests. import os import multiprocessing import timeit import atexit import psutil import numpy as np from numpy.fft import fftn, ifftn, rfftn, irfftn try: # We prefer scipy over numpy for fft, and we can also test dct from scipy.fft import dctn, idctn, fftn, ifftn, rfftn, irfftn has_dct_ref = True has_scipy = True except ImportError: has_dct_ref = False has_scipy = False # pyfftw speed is not good compared to scipy, when using every transform once # try: # from pyfftw.interfaces import scipy_fft as pyfftw_fft # if not has_scipy: # from pyfftw.interfaces.scipy_fft import dctn, idctn, fftn, ifftn, rfftn, irfftn # # has_pyfftw = True # has_dct_ref = True # except ImportError: # has_pyfftw = False try: import pyopencl as cl import pyopencl.array as cla from pyvkfft.opencl import VkFFTApp as clVkFFTApp, primes has_opencl = True except ImportError: has_opencl = False try: from pyvkfft.cuda import VkFFTApp as cuVkFFTApp, primes, has_pycuda, has_cupy if has_pycuda: import pycuda.driver as cu_drv import pycuda.gpuarray as cua if has_cupy: import cupy as cp except ImportError: has_cupy = False has_pycuda = False # Dictionnary of cuda/opencl (device, context). Will be initialised on-demand. # This is needed for multiprocessing. # The pyopencl entry is a tuple with (device, context, queue, has_cl_fp64) gpu_ctx_dic = {} def init_ctx(backend, gpu_name=None, verbose=False): if backend in gpu_ctx_dic: return if backend == "pycuda": if not has_pycuda: raise RuntimeError("init_ctx: backend=%s is not available" % backend) cu_drv.init() d = None if gpu_name is not None: for i in range(cu_drv.Device.count()): if gpu_name.lower() in cu_drv.Device(i).name().lower(): d = cu_drv.Device(i) break else: d = cu_drv.Device(0) if d is None: if gpu_name is not None: raise RuntimeError("Selected backend is pycuda, but no device found (name=%s)" % gpu_name) else: raise RuntimeError("Selected backend is pycuda, but no device found") gpu_ctx_dic["pycuda"] = (d, d.make_context()) if verbose: print("Selected device for pycuda: %s" % d.name()) elif backend == "pyopencl": if not has_opencl: raise RuntimeError("init_ctx: backend=%s is not available" % backend) d = None for p in cl.get_platforms(): if d is not None: break for d0 in p.get_devices(): if d0.type & cl.device_type.GPU: if gpu_name is not None: if gpu_name.lower() in d0.name.lower(): d = d0 else: d = d0 if d is not None: break if d is None: if gpu_name is not None: raise RuntimeError("Selected backend is pyopencl, but no device found (name=%s)" % gpu_name) else: raise RuntimeError("Selected backend is pyopencl, but no device found") cl_ctx = cl.Context([d]) cq = cl.CommandQueue(cl_ctx) gpu_ctx_dic["pyopencl"] = d, cl_ctx, cq, 'cl_khr_fp64' in cq.device.extensions if verbose: print("Selected device for pyopencl: %s [%s]" % (d.name, p.name)) elif backend == "cupy": if not has_cupy: raise RuntimeError("init_ctx: backend=%s is not available" % backend) # Is it possible to select a device by name with cupy ? # The name does not appear in the device attributes gpu_ctx_dic["cupy"] = cp.cuda.Device(0).use() # TODO: The following somehow helps initialising cupy, not sure why it's useful. # (some context auto-init...). Otherwise a cuLaunchKernel error occurs with # the first transform. cupy_a = cp.array(np.zeros((128, 128), dtype=np.float32)) cupy_a.sum() else: raise RuntimeError("init_ctx: unknown backend ", backend) def cleanup_cu_ctx(): # Is that really clean ? if has_pycuda: if cu_drv.Context is not None: while cu_drv.Context.get_current() is not None: cu_drv.Context.pop() atexit.register(cleanup_cu_ctx) def l2(a, b): """L2 norm""" return np.sqrt((abs(a - b) ** 2).sum() / (abs(a) ** 2).sum()) def li(a, b): """Linf norm""" return abs(a - b).max() / abs(a).max() def test_accuracy(backend, shape, ndim, axes, dtype, inplace, norm, use_lut, r2c=False, dct=False, gpu_name=None, stream=None, queue=None, return_array=False, init_array=None, verbose=False, colour_output=False, ref_long_double=True): """ Measure the :param backend: either 'pyopencl', 'pycuda' or 'cupy' :param shape: the shape of the array to test. If this is an inplace r2c, the x-axis length must be even, and two extra values will be appended along x, so the actual transform shape is the one supplied :param ndim: the number of FFT dimensions. Can be None if axes is given :param axes: the transform axes. Supersedes ndim :param dtype: either np.complex64 or np.complex128, or np.float32/np.float64 for r2c & dct :param inplace: if True, make an inplace transform. Note that for inplace r2c transforms, the size for the last (x, fastest) axis must be even. :param norm: either 0, 1 or "ortho" :param use_lut: if True,1, False or 0, will trigger useLUT=1 or 0 for VkFFT. If None, the default VkFFT behaviour is used. :param r2c: if True, test an r2c transform. If inplace, the last dimension (x, fastest axis) must be even :param dct: either 1, 2, 3 or 4 to test different dct. Only norm=1 is can be tested (native scipy normalisation). :param gpu_name: the name of the gpu to use. If None, the first available for the backend will be used. :param stream: the cuda stream to use, or None :param queue: the opencl queue to use (mandatory for the 'pyopencl' backend) :param return_array: if True, will return the generated random array so it can be re-used for different parameters :param init_array: the initial (numpy) random array to use (should be filled with uniform random numbers between +/-0.5 for both real and imaginary fields), to save time. The correct type will be applied. If None, a random array is generated. :param verbose: if True, print a 1-line info for both fft and ifft results :param colour_output: if True, use some colour to tag the quality of the accuracy :param ref_long_double: if True and scipy is available, long double precision will be used for the reference transform. Otherwise, this is ignored. :return: a dictionary with (l2_fft, li_fft, l2_ifft, li_ifft, tol, dt_array, dt_app, dt_fft, dt_ifft, src_unchanged_fft, src_unchanged_ifft, tol_test, str), with the L2 and Linf normalised norms comparing pyvkfft's result with either numpy, scipy, the reference tolerance, and the times spent in preparing the initial random array, creating the VkFFT app, and performing the forward and backward transforms (including the GPU and reference transforms, plus the L2 and Linf computations - don't use this for benchmarking), 'src_fft_unchanged' and 'srf_ifft_unchanged' are True if for an out-of-place transform, the source array is actually unmodified (which is not true for r2c ifft with ndim>=2). The last fields are 'tol_test' which is True if both li_fft and li_ifft are smaller than tol, and str the string summarising the results (printed if verbose is True). If return_array is True, the initial random array used is returned as 'd0'. All input parameters are also returned as key/values, except stream, queue, return_array, ini_array and verbose. """ if backend == "cupy" and has_cupy: mempool = cp.get_default_memory_pool() if mempool is not None: # Is that test necessary ? # Clean memory pool, we are changing array sizes constantly, and using # N parallel process so memory management must be done manually mempool.free_all_blocks() t0 = timeit.default_timer() init_ctx(backend, gpu_name=gpu_name, verbose=False) if backend == "pyopencl" and queue is None: queue = gpu_ctx_dic["pyopencl"][2] shape0 = shape dtype0 = dtype if dtype in (np.complex64, np.float32): dtype = np.complex64 dtypef = np.float32 else: dtype = np.complex128 dtypef = np.float64 if dct: if norm != 1: raise RuntimeError("test_accuracy: only norm=1 can be used with dct") if r2c: if inplace: # Add two extra columns in the source array # so the transform has the desired shape shape = list(shape) shape[-1] += 2 else: shapec = list(shape) shapec[-1] = shapec[-1] // 2 + 1 shapec = tuple(shapec) else: shapec = tuple(shape) shape = tuple(shape) if init_array is not None: if r2c: if inplace: d0 = np.empty(shape, dtype=dtypef) d0[..., :-2] = init_array else: d0 = init_array.astype(dtypef) elif dct: d0 = init_array.astype(dtypef) else: d0 = init_array.astype(dtype) else: if r2c or dct: d0 = np.random.uniform(-0.5, 0.5, shape).astype(dtypef) else: d0 = (np.random.uniform(-0.5, 0.5, shape) + 1j * np.random.uniform(-0.5, 0.5, shape)).astype(dtype) t1 = timeit.default_timer() if 'opencl' in backend: app = clVkFFTApp(d0.shape, d0.dtype, queue, ndim=ndim, norm=norm, axes=axes, useLUT=use_lut, inplace=inplace, r2c=r2c, dct=dct) t2 = timeit.default_timer() d_gpu = cla.to_device(queue, d0) else: if backend == "pycuda": to_gpu = cua.to_gpu else: to_gpu = cp.array app = cuVkFFTApp(d0.shape, d0.dtype, ndim=ndim, norm=norm, axes=axes, useLUT=use_lut, inplace=inplace, r2c=r2c, dct=dct, stream=stream) t2 = timeit.default_timer() d_gpu = to_gpu(d0) if axes is None: axes_numpy = list(range(len(shape)))[-ndim:] else: axes_numpy = axes # base FFT scale for numpy (not used for DCT) s = np.sqrt(np.prod([d0.shape[i] for i in axes_numpy])) if r2c and inplace: s = np.sqrt(s ** 2 / d0.shape[-1] * (d0.shape[-1] - 2)) # Tolerance estimated from accuracy notebook if dtype in (np.complex64, np.float32): tol = 2e-6 + 5e-7 * np.log10(s ** 2) else: tol = 5e-15 + 5e-16 * np.log10(s ** 2) n = max(shape) bluestein = max(primes(n)) > 13 if bluestein: tol *= 2 # FFT if inplace: d1_gpu = d_gpu else: if r2c: if backend == "pyopencl": d1_gpu = cla.empty(queue, shapec, dtype=dtype) elif backend == "pycuda": d1_gpu = cua.empty(shapec, dtype=dtype) elif backend == "cupy": d1_gpu = cp.empty(shapec, dtype=dtype) else: d1_gpu = d_gpu.copy() if has_scipy and ref_long_double: # Use long double precision if r2c or dct: d0n = d0.astype(np.longdouble) else: d0n = d0.astype(np.clongdouble) else: d0n = d0 d1_gpu = app.fft(d_gpu, d1_gpu) if not dct: d1_gpu *= app.get_fft_scale() if r2c: if inplace: d = rfftn(d0n[..., :-2], axes=axes_numpy) / s else: d = rfftn(d0n, axes=axes_numpy) / s elif dct: d = dctn(d0n, axes=axes_numpy, type=dct) else: d = fftn(d0n, axes=axes_numpy) / s if inplace and r2c: assert d1_gpu.dtype == dtype, "The array type is incorrect after an inplace FFT" n2, ni = l2(d, d1_gpu.get()), li(d, d1_gpu.get()) src_unchanged_fft = np.all(np.equal(d_gpu.get(), d0)) # Output string if r2c: t = "R2C" elif dct: t = "DCT%d" % dct else: t = "C2C" if r2c and inplace: tmp = list(d0.shape) tmp[-1] -= 2 shstr = str(tuple(tmp)).replace(" ", "") if ",)" in shstr: shstr = shstr.replace(",)", "+2)") else: shstr = shstr.replace(")", "+2)") else: shstr = str(d0.shape).replace(" ", "") shax = str(axes).replace(" ", "") if colour_output: red = max(0, min(int((ni / tol - 0.2) * 255), 255)) stol = "\x1b[48;2;%d;0;0m%6.2e < %6.2e (%5.3f)\x1b[0m" % (red, ni, tol, ni / tol) else: stol = "%6.2e < %6.2e (%5.3f)" % (ni, tol, ni / tol) verb_out = "%8s %4s %14s axes=%10s ndim=%4s %10s lut=%4s inplace=%d " \ " norm=%4s %5s: n2=%6.2e ninf=%s %d" % \ (backend, t, shstr, shax, str(ndim), str(d0.dtype), str(use_lut), int(inplace), str(norm), "FFT", n2, stol, src_unchanged_fft) t3 = timeit.default_timer() # Clean memory del d_gpu, d1_gpu if backend == "cupy" and has_cupy: mempool = cp.get_default_memory_pool() if mempool is not None: # Is that test necessary ? # Clean memory pool, we are changing array sizes constantly, and using # N parallel process so memory management must be done manually mempool.free_all_blocks() # IFFT - from original array to avoid error propagation if r2c: # Exception: we need a proper half-Hermitian array d0 = d.astype(dtype) if has_scipy and ref_long_double: d0n = d0.astype(np.clongdouble) else: d0n = d0 if 'opencl' in backend: d_gpu = cla.to_device(queue, d0) else: d_gpu = to_gpu(d0) if inplace: d1_gpu = d_gpu else: if r2c: if backend == "pyopencl": d1_gpu = cla.empty(queue, shape, dtype=dtypef) elif backend == "pycuda": d1_gpu = cua.empty(shape, dtype=dtypef) elif backend == "cupy": d1_gpu = cp.empty(shape, dtype=dtypef) else: d1_gpu = d_gpu.copy() d1_gpu = app.ifft(d_gpu, d1_gpu) if not dct: d1_gpu *= app.get_ifft_scale() if r2c: d = irfftn(d0n, axes=axes_numpy) * s elif dct: d = idctn(d0n, axes=axes_numpy, type=dct) else: d = ifftn(d0n, axes=axes_numpy) * s if inplace: if dct or r2c: assert d1_gpu.dtype == dtypef, "The array type is incorrect after an inplace iFFT" else: assert d1_gpu.dtype == dtype, "The array type is incorrect after an inplace iFFT" if r2c and inplace: n2i, nii = l2(d, d1_gpu.get()[..., :-2]), li(d, d1_gpu.get()[..., :-2]) else: n2i, nii = l2(d, d1_gpu.get()), li(d, d1_gpu.get()) src_unchanged_ifft = np.all(np.equal(d_gpu.get(), d0)) # Max N for radix 1D C2R transforms to not overwrite source nmaxr2c1d = 3072 * (1 + int(dtype in (np.float32, np.complex64))) if max(ni, nii) <= tol and (inplace or src_unchanged_fft) and \ (inplace or src_unchanged_ifft or (r2c and ndim > 1 or n >= nmaxr2c1d or bluestein)): success = 'OK' else: success = 'FAIL' if colour_output: red = max(0, min(int((nii / tol - 0.2) * 255), 255)) stol = "\x1b[48;2;%d;0;0m%6.2e < %6.2e (%5.3f)\x1b[0m" % (red, nii, tol, nii / tol) else: stol = "%6.2e < %6.2e (%5.3f)" % (nii, tol, nii / tol) verb_out += "%5s: n2=%6.2e ninf=%s %d %4s" % ("iFFT", n2i, stol, src_unchanged_ifft, success) if verbose: print(verb_out) t4 = timeit.default_timer() if backend == "pyopencl": gpu_name = gpu_ctx_dic["pyopencl"][0].name elif backend == "pycuda": gpu_name = gpu_ctx_dic["pycuda"][0].name() else: gpu_name = "" res = {"n2": n2, "ni": ni, "n2i": n2i, "nii": nii, "tol": tol, "dt_array": t1 - t0, "dt_app": t2 - t1, "dt_fft": t3 - t2, "dt_ifft": t4 - t3, "src_unchanged_fft": src_unchanged_fft, "src_unchanged_ifft": src_unchanged_ifft, "tol_test": max(ni, nii) < tol, "str": verb_out, "backend": backend, "shape": shape0, "ndim": ndim, "axes": axes, "dtype": dtype0, "inplace": inplace, "norm": norm, "use_lut": use_lut, "r2c": r2c, "dct": dct, "gpu_name": gpu_name} if return_array: res["d0"] = d0 return res def test_accuracy_kwargs(kwargs): # This function must be defined here, so it can be used with a multiprocessing pool # in test_fft, otherwise this will fail, see: # https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror if kwargs['backend'] == 'pyopencl' and has_opencl: try: t = test_accuracy(**kwargs) except cl.RuntimeError as ex: # The cl.RuntimeError can't be pickled, so is not correctly reported # when using multiprocessing. So we raise another and the traceback # from the previous one is still printed. raise RuntimeError("An OpenCL RuntimeError was encountered") return t return test_accuracy(**kwargs) def exhaustive_test(backend, vn, ndim, dtype, inplace, norm, use_lut, r2c=False, dct=False, nproc=None, verbose=True, return_res=False): """ Run tests on a large range of sizes using multiprocessing. Manual function. :param backend: either 'pyopencl', 'pycuda' or 'cupy' :param vn: the list/iterable of sizes n. :param ndim: the number of dimensions. The array shape will be [n]*ndim :param dtype: either np.complex64 or np.complex128, or np.float32/np.float64 for r2c & dct :param inplace: True or False :param norm: either 0, 1 or "ortho" :param use_lut: if True,1, False or 0, will trigger useLUT=1 or 0 for VkFFT. If None, the default VkFFT behaviour is used. Always True by default for double precision, so no need to force it. :param r2c: if True, test an r2c transform. If inplace, the last dimension (x, fastest axis) must be even :param dct: either 1, 2, 3 or 4 to test different dct. Only norm=1 is can be tested (native scipy normalisation). :param nproc: the maximum number of parallel process to use. If None, the number of detected cores will be used (this may use too much memory !) :param verbose: if True, prints 1 line per test :param return_res: if True, return the list of result dictionaries. :return: True if all tests passed, False otherwise. If return_res is True, return the list of result dictionaries instead. """ try: # Get the real number of processor cores available # os.sched_getaffinity is only available on some *nix platforms nproc1 = len(os.sched_getaffinity(0)) * psutil.cpu_count(logical=False) // psutil.cpu_count(logical=True) except AttributeError: nproc1 = os.cpu_count() if nproc is None: nproc = nproc1 else: nproc = min(nproc, nproc1) # Generate the list of configurations as kwargs for test_accuracy() vkwargs = [] for n in vn: kwargs = {"backend": backend, "shape": [n] * ndim, "ndim": ndim, "axes": None, "dtype": dtype, "inplace": inplace, "norm": norm, "use_lut": use_lut, "r2c": r2c, "dct": dct, "stream": None, "verbose": False} vkwargs.append(kwargs) vok = [] vres = [] # Need to use spawn to handle the GPU context with multiprocessing.get_context('spawn').Pool(nproc) as pool: for res in pool.imap(test_accuracy_kwargs, vkwargs): # TODO: this should better be logged if verbose: print(res['str']) ni, n2 = res["ni"], res["n2"] nii, n2i = res["nii"], res["n2i"] tol = res["tol"] ok = max(ni, nii) < tol if not inplace: ok = ok and res["src_unchanged_fft"] if not r2c: ok = ok and res["src_unchanged_ifft"] vok.append(ok) if return_res: vres.append(res) if return_res: return vres return np.alltrue(vok) pyvkfft-2022.1.1/pyvkfft/base.py0000644000076500000240000006270214202465263017211 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2021- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr import os import platform import sysconfig import ctypes import warnings from enum import Enum import numpy as np from .config import USE_LUT # np.complex32 does not exist yet https://github.com/numpy/numpy/issues/14753 complex32 = np.dtype([('re', np.float16), ('im', np.float16)]) class VkFFTResult(Enum): """ VkFFT error codes from vkFFT.h """ VKFFT_SUCCESS = 0 VKFFT_ERROR_MALLOC_FAILED = 1 VKFFT_ERROR_INSUFFICIENT_CODE_BUFFER = 2 VKFFT_ERROR_INSUFFICIENT_TEMP_BUFFER = 3 VKFFT_ERROR_PLAN_NOT_INITIALIZED = 4 VKFFT_ERROR_NULL_TEMP_PASSED = 5 VKFFT_ERROR_INVALID_PHYSICAL_DEVICE = 1001 VKFFT_ERROR_INVALID_DEVICE = 1002 VKFFT_ERROR_INVALID_QUEUE = 1003 VKFFT_ERROR_INVALID_COMMAND_POOL = 1004 VKFFT_ERROR_INVALID_FENCE = 1005 VKFFT_ERROR_ONLY_FORWARD_FFT_INITIALIZED = 1006 VKFFT_ERROR_ONLY_INVERSE_FFT_INITIALIZED = 1007 VKFFT_ERROR_INVALID_CONTEXT = 1008 VKFFT_ERROR_INVALID_PLATFORM = 1009 VKFFT_ERROR_ENABLED_saveApplicationToString = 1010, VKFFT_ERROR_EMPTY_FFTdim = 2001 VKFFT_ERROR_EMPTY_size = 2002 VKFFT_ERROR_EMPTY_bufferSize = 2003 VKFFT_ERROR_EMPTY_buffer = 2004 VKFFT_ERROR_EMPTY_tempBufferSize = 2005 VKFFT_ERROR_EMPTY_tempBuffer = 2006 VKFFT_ERROR_EMPTY_inputBufferSize = 2007 VKFFT_ERROR_EMPTY_inputBuffer = 2008 VKFFT_ERROR_EMPTY_outputBufferSize = 2009 VKFFT_ERROR_EMPTY_outputBuffer = 2010 VKFFT_ERROR_EMPTY_kernelSize = 2011 VKFFT_ERROR_EMPTY_kernel = 2012 VKFFT_ERROR_EMPTY_applicationString = 2013, VKFFT_ERROR_UNSUPPORTED_RADIX = 3001 VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH = 3002 VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_R2C = 3003 VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_DCT = 3004 VKFFT_ERROR_UNSUPPORTED_FFT_OMIT = 3005 VKFFT_ERROR_FAILED_TO_ALLOCATE = 4001 VKFFT_ERROR_FAILED_TO_MAP_MEMORY = 4002 VKFFT_ERROR_FAILED_TO_ALLOCATE_COMMAND_BUFFERS = 4003 VKFFT_ERROR_FAILED_TO_BEGIN_COMMAND_BUFFER = 4004 VKFFT_ERROR_FAILED_TO_END_COMMAND_BUFFER = 4005 VKFFT_ERROR_FAILED_TO_SUBMIT_QUEUE = 4006 VKFFT_ERROR_FAILED_TO_WAIT_FOR_FENCES = 4007 VKFFT_ERROR_FAILED_TO_RESET_FENCES = 4008 VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_POOL = 4009 VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_SET_LAYOUT = 4010 VKFFT_ERROR_FAILED_TO_ALLOCATE_DESCRIPTOR_SETS = 4011 VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE_LAYOUT = 4012 VKFFT_ERROR_FAILED_SHADER_PREPROCESS = 4013 VKFFT_ERROR_FAILED_SHADER_PARSE = 4014 VKFFT_ERROR_FAILED_SHADER_LINK = 4015 VKFFT_ERROR_FAILED_SPIRV_GENERATE = 4016 VKFFT_ERROR_FAILED_TO_CREATE_SHADER_MODULE = 4017 VKFFT_ERROR_FAILED_TO_CREATE_INSTANCE = 4018 VKFFT_ERROR_FAILED_TO_SETUP_DEBUG_MESSENGER = 4019 VKFFT_ERROR_FAILED_TO_FIND_PHYSICAL_DEVICE = 4020 VKFFT_ERROR_FAILED_TO_CREATE_DEVICE = 4021 VKFFT_ERROR_FAILED_TO_CREATE_FENCE = 4022 VKFFT_ERROR_FAILED_TO_CREATE_COMMAND_POOL = 4023 VKFFT_ERROR_FAILED_TO_CREATE_BUFFER = 4024 VKFFT_ERROR_FAILED_TO_ALLOCATE_MEMORY = 4025 VKFFT_ERROR_FAILED_TO_BIND_BUFFER_MEMORY = 4026 VKFFT_ERROR_FAILED_TO_FIND_MEMORY = 4027 VKFFT_ERROR_FAILED_TO_SYNCHRONIZE = 4028 VKFFT_ERROR_FAILED_TO_COPY = 4029 VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM = 4030 VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM = 4031 VKFFT_ERROR_FAILED_TO_GET_CODE_SIZE = 4032 VKFFT_ERROR_FAILED_TO_GET_CODE = 4033 VKFFT_ERROR_FAILED_TO_DESTROY_PROGRAM = 4034 VKFFT_ERROR_FAILED_TO_LOAD_MODULE = 4035 VKFFT_ERROR_FAILED_TO_GET_FUNCTION = 4036 VKFFT_ERROR_FAILED_TO_SET_DYNAMIC_SHARED_MEMORY = 4037 VKFFT_ERROR_FAILED_TO_MODULE_GET_GLOBAL = 4038 VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL = 4039 VKFFT_ERROR_FAILED_TO_EVENT_RECORD = 4040 VKFFT_ERROR_FAILED_TO_ADD_NAME_EXPRESSION = 4041 VKFFT_ERROR_FAILED_TO_INITIALIZE = 4042 VKFFT_ERROR_FAILED_TO_SET_DEVICE_ID = 4043 VKFFT_ERROR_FAILED_TO_GET_DEVICE = 4044 VKFFT_ERROR_FAILED_TO_CREATE_CONTEXT = 4045 VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE = 4046 VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG = 4047 VKFFT_ERROR_FAILED_TO_CREATE_COMMAND_QUEUE = 4048 VKFFT_ERROR_FAILED_TO_RELEASE_COMMAND_QUEUE = 4049 VKFFT_ERROR_FAILED_TO_ENUMERATE_DEVICES = 4050 VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE = 4051 VKFFT_ERROR_FAILED_TO_CREATE_EVENT = 4052 def load_library(basename): if platform.system() == 'Windows': # We patched build_ext so the module is a .so and not a dll ext = '.so' else: ext = sysconfig.get_config_var('SO') return ctypes.cdll.LoadLibrary(os.path.join(os.path.dirname(__file__) or os.path.curdir, basename + ext)) def primes(n): """ Returns the prime decomposition of n as a list. This only remains as a useful function, but VkFFT allows any prime decomposition, even if performance can be better for prime numbers <=13. """ v = [1] assert (n > 0) i = 2 while i * i <= n: while n % i == 0: v.append(i) n //= i i += 1 if n > 1: v.append(n) return v def radix_gen(nmax, radix, even=False, exclude_one=True, inverted=False, nmin=None, max_pow=None): """ Generate an array of integers which are only multiple of powers of base integers, e.g. 2**N1 * 3**N2 * 5**N3 etc... :param nmax: the maximum integer to return (included) :param radix: the list/tuple of base integers - which don't need to be primes :param even: if True, only return even numbers :param exclude_one: if True (the default), exclude 1 :param inverted: if True, the returned array will only include integers which are NOT in the form 2**N1 * 3**N2 * 5**N3... :param nmin: if not None, the integer values returned will be >=nmin :param max_pow: if not None, the N1, N2, N3... powers (for sizes in the form 2**N1 * 3**N2 * 5**N3) will be at max equal to this value, which allows to reduce the number of generated sizes while testing all base radixes :return: the numpy array of integers, sorted """ a = np.ones(1, dtype=np.int64) for i in range(len(radix)): if max_pow is None: tmp = np.arange(int(np.floor(np.log(nmax) / np.log(radix[i]))) + 1) else: tmp = np.arange(min(max_pow, int(np.floor(np.log(nmax) / np.log(radix[i])))) + 1) a = a * radix[i] ** tmp[:, np.newaxis] a = a.flatten() a = a[a <= nmax] if inverted: b = np.arange(nmax + 1) b[a] = 0 a = b.take(np.nonzero(b)[0]) if even: a = a[(a % 2) == 0] if nmin is not None: a = a[a >= nmin] a.sort() if len(a): if exclude_one and a[0] == 1: return a[1:] return a def radix_gen_n(nmax, max_size, radix, ndim=None, even=False, exclude_one=True, inverted=False, nmin=None, max_pow=None, range_nd_narrow=None, min_size=0): """ Generate a list of array shape with integers which are only multiple of powers of base integers, e.g. 2**N1 * 3**N2 * 5**N3 etc..., for each of the dimensions, and with a maximum size. Note that this can generate a large number of sizes. :param nmax: the maximum value for the length of each dimension (included) :param max_size: the maximum size (number of elements) for the array :param radix: the list/tuple of base integers - which don't need to be primes. If None, all sizes are allowed :param ndim: the number of dimensions allowed. If None, 1D, 2D and 3D shapes are mixed. :param even: if True, only return even numbers :param exclude_one: if True (the default), exclude 1 :param inverted: if True, the returned array will only include integers which are NOT in the form 2**N1 * 3**N2 * 5**N3... :param nmin: if not None, the integer values returned will be >=nmin :param max_pow: if not None, the N1, N2, N3... powers (for sizes in the form 2**N1 * 3**N2 * 5**N3) will be at max equal to this value, which allows to reduce the number of generated sizes while testing all base radixes :param range_nd_narrow: if a tuple of values (drel, dabs) is given, with drel within [0;1], for dimensions>1, in an array of shape (s0, s1, s2), the difference of lengths with respect to the first dimension cannot be larger than min(drel * s0, dabs). This allows to reduce the number of shapes tested. With drel=dabs=0, all dimensions must have identical lengths. :param min_size: the minimum size (number of elements). This can be used to separate large array tests and use a larger number of parallel process for smaller ones. :return: the list of array shapes. """ v = [] if radix is None: if even: if nmin is None: n0 = np.arange(2, min(nmax, max_size), 2) else: n0 = np.arange(nmin + nmin % 2, min(nmax, max_size), 2) else: if nmin is None: n0 = np.arange(2, min(nmax, max_size)) else: n0 = np.arange(nmin, min(nmax, max_size)) else: n0 = radix_gen(nmax, radix, even=even, exclude_one=exclude_one, inverted=inverted, nmin=nmin, max_pow=max_pow) if ndim is None or ndim in [1, 12, 123]: idx = np.nonzero((n0 <= max_size) * (n0 >= min_size))[0] if len(idx): v += list(zip(n0.take(idx))) if ndim is None or ndim in [2, 12, 123]: vidx = list(range(0, len(n0), 1000)) if vidx[-1] != len(n0): vidx.append(len(n0)) for i1 in range(len(vidx) - 1): l01 = n0[vidx[i1]:vidx[i1 + 1]] for i2 in range(len(vidx) - 1): l2 = n0[vidx[i2]:vidx[i2 + 1]][:, np.newaxis] s = (l01 * l2).flatten() l1, l2 = (l01 + np.zeros_like(l2)).flatten(), (l2 + np.zeros_like(l01)).flatten() tmp = (s <= max_size) * (s >= min_size) if range_nd_narrow is not None: drel, dabs = range_nd_narrow m = np.maximum(dabs, l1 * drel) tmp = np.logical_and(tmp, abs(l1 - l2) <= m) idx = np.nonzero(tmp)[0] if len(idx): v += list(zip(l1.take(idx), l2.take(idx))) if ndim is None or ndim in [3, 123]: vidx = list(range(0, len(n0), 100)) if vidx[-1] != len(n0): vidx.append(len(n0)) for i1 in range(len(vidx) - 1): l01 = n0[vidx[i1]:vidx[i1 + 1]] for i2 in range(len(vidx) - 1): l02 = n0[vidx[i2]:vidx[i2 + 1], np.newaxis] for i3 in range(len(vidx) - 1): l3 = n0[vidx[i3]:vidx[i3 + 1], np.newaxis, np.newaxis] # print(i1, i2, i3, l1.shape, l2.shape, l3.shape) s = (l01 * l02 * l3).flatten() l1, l2, l3 = (l01 + np.zeros_like(l02) + np.zeros_like(l3)).flatten(), \ (l02 + np.zeros_like(l01) + np.zeros_like(l3)).flatten(), \ (l3 + np.zeros_like(l01) + np.zeros_like(l02)).flatten() tmp = (s <= max_size) * (s >= min_size) if range_nd_narrow is not None: drel, dabs = range_nd_narrow m = np.maximum(dabs, l1 * drel) tmp = np.logical_and(tmp, (abs(l1 - l2) <= m) * (abs(l1 - l3) <= m)) idx = np.nonzero(tmp)[0] if len(idx): v += list(zip(l1.take(idx), l2.take(idx), l3.take(idx))) return v def calc_transform_axes(shape, axes=None, ndim=None): """ Compute the final shape of the array to be passed to VkFFT, and the axes for which the transform should be skipped. By collapsing non-transformed consecutive axes and using batch transforms, it is possible to support dimensions>3. However after collapsing axes, transforms are only possible along the first 3 remaining dimensions. Example of possible transforms: - 3D array with ndim=1, 2 or 3 or any set of axes - n-D array with ndim=1, 2 or 3 with n arbitrary large (the dimensions above ndim will be collapsed in a batch transform) - shape=(4,5,6,7) and axes=(2,3): the first two axes will be collapsed to a (20,6,7) axis - shape=(4,5,6,7,8,9) and axes=(2,3): the first two axes will be collapsed to a (20,6,7) axis and a batch size of 8*9=81 will be used Examples of impossible transforms: - shape=(4,5,6,7) with ndim=4: only 3D transforms are allowed - shape=(4,5,6,7) with axes=(1,2,3): the index of the last transformed axis cannot be >2 :param shape: the initial shape of the data array. Note that this shape should be in th usual numpy order, i.e. the fastest axis is listed last. e.g. (nz, ny, nx) :param axes: the axes to be transformed. if None, all axes are transformed, or up to ndim. :param ndim: the number of dimensions for the transform. If None, the number of axes is used :return: (shape, skip_axis, ndim) with the 4D shape after collapsing consecutive non-transformed axes (padded with ones if necessary, with the order adequate for VkFFT (nx, ny, nz, n_batch), the list of 3 booleans indicating if the (x, y, z) axes should be skipped, and the number of transform axes. """ # reverse order to have list as (nx, ny, nz,...) shape1 = list(reversed(list(shape))) if np.isscalar(axes): axes = [axes] if ndim is None: if axes is None: ndim1 = len(shape) axes = list(range(ndim1)) else: ndim1 = ndim if axes is None: axes = list(range(-1, -ndim - 1, -1)) else: if ndim1 != len(axes): raise RuntimeError("The number of transform axes does not match ndim:", axes, ndim) # Collapse non-transform axes when possible skip_axis = [True for i in range(len(shape1))] for i in axes: skip_axis[i] = False skip_axis = list(reversed(skip_axis)) # numpy (z,y,x) to vkfft (x,y,z) order # print(shape1, skip_axis, axes) i = 0 while i <= len(shape1) - 2: if skip_axis[i] and skip_axis[i + 1]: shape1[i] *= shape1[i + 1] shape1.pop(i + 1) skip_axis.pop(i + 1) else: i += 1 # We can pass 3 dimensions to VkFFT (plus a batch dimension) if len(skip_axis) - list(reversed(skip_axis)).index(False) - 1 >= 3: raise RuntimeError("Unsupported VkFFT transform:", shape, axes, ndim, shape1, skip_axis) if len(shape1) < 4: shape1 += [1] * (4 - len(shape1)) if len(skip_axis) < 3: skip_axis += [True] * (3 - len(skip_axis)) skip_axis = skip_axis[:3] # Fix ndim so skipped axes are counted ndim1 = 3 - list(reversed(skip_axis)).index(False) # Axes beyond ndim are marked skipped for i in range(ndim1, 3): skip_axis[i] = False return shape1, skip_axis, ndim1 def check_vkfft_result(res, shape=None, dtype=None, ndim=None, inplace=None, norm=None, r2c=None, dct=None, axes=None, backend=None): """ Check VkFFTResult code. :param res: the result code from launching a transform. :param shape: shape of the array :param dtype: data type of the array :param ndim: number of transform dimensions :param inplace: True or False :param norm: 0 or1 or "ortho" :param r2c: True or False :param dct: False, 1, 2, 3 or 4 :param axes: transform axes :param backend: the backend :raises RuntimeError: if res != 0 """ if isinstance(res, ctypes.c_int): res = res.value if res != 0: s = "" if r2c: s += "R2C " elif dct: s += "DCT%d " % dct else: s += "C2C " if r2c and inplace and shape is not None: tmp = list(shape) tmp[-1] -= 2 shstr = str(tuple(tmp)).replace(" ", "") if ",)" in shstr: s += shstr.replace(",)", "+2)") + " " else: s += shstr.replace(")", "+2)") + " " else: s += str(shape).replace(" ", "") + " " if dtype is not None: s += str(dtype) + " " if axes is not None: s += str(axes).replace(" ", "") + " " if ndim is not None: s += "%dD " % ndim if inplace: s += "inplace " if norm: s += "norm=%s " % str(norm) if backend is not None: s += "[%s]" % backend try: r = VkFFTResult(res) raise RuntimeError("VkFFT error %d: %s %s" % (res, r.name, s)) except ValueError: raise RuntimeError("VkFFT error %d (unknown) %s" % (res, s)) class VkFFTApp: """ VkFFT application interface implementing a FFT plan, base implementation handling functions and paremeters common to the CUDA and OpenCL backends. """ def __init__(self, shape, dtype: type, ndim=None, inplace=True, norm=1, r2c=False, dct=False, axes=None, **kwargs): """ Init function for the VkFFT application. :param shape: the shape of the array to be transformed. The number of dimensions of the array can be larger than the FFT dimensions. :param dtype: the numpy dtype of the source array (can be complex64 or complex128) :param ndim: the number of dimensions to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2, etc.. Unless axes are given. :param inplace: if True (the default), performs an inplace transform and the destination array should not be given in fft() and ifft(). :param norm: if 0 (unnormalised), every transform multiplies the L2 norm of the array by its size (or the size of the transformed array if ndimcomplex transform, where the complex destination is a half-hermitian array. For an inplace transform, if the input data shape is (...,nx), the input float array should have a shape of (..., nx+2), the last two columns being ignored in the input data, and the resulting complex array (using pycuda's GPUArray.view(dtype=np.complex64) to reinterpret the type) will have a shape (..., nx//2 + 1). For an out-of-place transform, if the input (real) shape is (..., nx), the output (complex) shape should be (..., nx//2+1). Note that for C2R transforms with ndim>=2, the source (complex) array is modified. :param dct: used to perform a Direct Cosine Transform (DCT) aka a R2R transform. An integer can be given to specify the type of DCT (1, 2, 3 or 4). if dct=True, the DCT type 2 will be performed, following scipy's convention. :param axes: a list or tuple of axes along which the transform should be made. if None, the transform is done along the ndim fastest axes, or all axes if ndim is None. Not allowed for R2C transforms :raises RuntimeError: if the transform dimensions are not allowed by VkFFT. """ self.app = None self.config = None if dct and r2c: raise RuntimeError("R2C and DCT cannot both be selected !") if (r2c or dct) and dtype not in [np.float16, np.float32, np.float64]: raise RuntimeError("R2C or DCT selected but input type is not real !") if r2c and axes is not None: raise RuntimeError("axes=... is not allowed for R2C transforms") # Get the final shape passed to VkFFT, collapsing non-transform axes # as necessary. The calculated shape has 4 dimensions (nx, ny, nz, n_batch) self.shape, self.skip_axis, self.ndim = calc_transform_axes(shape, axes, ndim) self.inplace = inplace self.r2c = r2c if dct is False: self.dct = 0 elif dct is True: self.dct = 2 else: self.dct = dct if dct and self.dct < 1 or dct > 4: raise RuntimeError("Only DCT of types 1, 2, 3 and 4 are allowed") # print("VkFFTApp:", shape, axes, ndim, "->", self.shape, self.skip_axis, self.ndim) # Experimental parameters. Not much difference is seen, so don't document this, # VkFFT default parameters seem fine. if "disableReorderFourStep" in kwargs: self.disableReorderFourStep = kwargs["disableReorderFourStep"] else: self.disableReorderFourStep = -1 if "registerBoost" in kwargs: self.registerBoost = kwargs["registerBoost"] else: self.registerBoost = -1 if "useLUT" in kwargs: # useLUT=1 may be beneficial on platforms which have a low accuracy for # the native sincos functions. if kwargs["useLUT"] is None: self.use_lut = -1 else: self.use_lut = kwargs["useLUT"] elif USE_LUT is not None: self.use_lut = USE_LUT else: self.use_lut = -1 if "keepShaderCode" in kwargs: # This will print the compiled code if equal to 1 self.keepShaderCode = kwargs["keepShaderCode"] else: self.keepShaderCode = -1 if norm == "backward": norm = 1 self.norm = norm # Precision: number of bytes per float if dtype in [np.float16, complex32]: self.precision = 2 elif dtype in [np.float32, np.complex64]: self.precision = 4 elif dtype in [np.float64, np.complex128]: self.precision = 8 def _get_fft_scale(self, norm): """Return the scale factor by which an array must be multiplied to keep its L2 norm after a forward FT :param norm: the norm option for which the scale is computed, either 0 or 1 :return: the scale factor, as a numpy float with the precision used for the fft """ dtype = np.float32 if self.precision == 8: dtype = np.float64 elif self.precision == 2: dtype = np.float16 s = 1 ndim_real = 0 for i in range(self.ndim): if not self.skip_axis[i]: s *= self.shape[i] ndim_real += 1 s = np.sqrt(s) if self.r2c and self.inplace: s *= np.sqrt((self.shape[0] - 2) / self.shape[0]) if self.dct: s *= 2 ** (0.5 * ndim_real) if self.dct != 4: warnings.warn("A DCT type 2 or 3 cannot be strictly normalised, using approximation," " see https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II") if norm == 0 or norm == 1: return dtype(1 / s) elif norm == "ortho": return dtype(1) raise RuntimeError("Unknown norm choice !") def get_fft_scale(self): """Return the scale factor by which an array must be multiplied to keep its L2 norm after a forward FT """ return self._get_fft_scale(self.norm) def _get_ifft_scale(self, norm): """Return the scale factor by which an array must be multiplied to keep its L2 norm after a backward FT :param norm: the norm option for which the scale is computed, either 0 or 1 :return: the scale factor, as a numpy float with the precision used for the fft """ dtype = np.float32 if self.precision == 8: dtype = np.float64 elif self.precision == 2: dtype = np.float16 s = 1 s_dct = 1 for i in range(self.ndim): if not self.skip_axis[i]: s *= self.shape[i] if self.dct: s_dct *= np.sqrt(2) s = np.sqrt(s) if self.r2c and self.inplace: s *= np.sqrt((self.shape[0] - 2) / self.shape[0]) if self.dct and self.dct != 4: warnings.warn("A DCT type 2 or 3 cannot be strictly normalised, using approximation," " see https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II") if norm == 0: return dtype(1 / (s * s_dct)) elif norm == 1: # Not sure why the difference in scale factors if self.dct == 2: s_dct = s_dct ** 1 elif self.dct == 3: s_dct = s_dct ** 2 elif self.dct == 4: s_dct = s_dct ** 3 return dtype(s * s_dct) elif norm == "ortho": return dtype(1) raise RuntimeError("Unknown norm choice !") def get_ifft_scale(self): """Return the scale factor by which an array must be multiplied to keep its L2 norm after a backward FT """ return self._get_ifft_scale(self.norm) pyvkfft-2022.1.1/pyvkfft/config.py0000644000076500000240000000327214202465263017541 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # Global configuration variables. The approach is adapted # from Numba's config.py import os # Number of VkFFTApp to cache through the pyvkfft.fft interface # This must be modified *before* importing pyvkfft.fft FFT_CACHE_NB = 32 # Force using a LUT for single-precision transforms ? # If None, this will be activated automatically for some GPU (Intel) # Use only to improve the accuracy by a factor 3 or 4 # If useLUT is passed directly to a VkFFTApp, this is ignored # Valid values: either None or 1 USE_LUT = None def process_environ(environ): if "PYVKFFT_FFT_CACHE_NB" in environ: FFT_CACHE_NB = eval(environ["PYVKFFT_FFT_CACHE_NB"]) else: FFT_CACHE_NB = 32 if "PYVKFFT_USE_LUT" in environ: USE_LUT = eval(environ["PYVKFFT_USE_LUT"]) else: USE_LUT = None # Inject values into the module globals for name, value in locals().copy().items(): if name.isupper(): globals()[name] = value class _EnvReloader(object): def __init__(self): self.reset() self.old_environ = {} def reset(self): self.old_environ = {} self.update(force=True) def update(self, force=False): new_environ = {} for name, value in os.environ.items(): if name.startswith('PYVKFFT_'): print(name, value) new_environ[name] = value if force or self.old_environ != new_environ: process_environ(new_environ) self.old_environ = new_environ _env_reloader = _EnvReloader() def _reload_config(): """ Reload the configuration from environment variables, if necessary. """ _env_reloader.update() pyvkfft-2022.1.1/pyvkfft/cuda.py0000644000076500000240000003241414202465263017210 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2021- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr import ctypes import numpy as np try: import pycuda.driver as cu_drv has_pycuda = True except ImportError: has_pycuda = False try: import cupy as cp has_cupy = True except ImportError: has_cupy = False if has_pycuda is False: raise ImportError("You need either PyCUDA or CuPy to use pyvkfft.cuda.") from .base import load_library, primes, VkFFTApp as VkFFTAppBase, VkFFTResult, check_vkfft_result _vkfft_cuda = load_library("_vkfft_cuda") class _types: """Aliases""" vkfft_config = ctypes.c_void_p stream = ctypes.c_void_p vkfft_app = ctypes.c_void_p _vkfft_cuda.make_config.restype = ctypes.c_void_p _vkfft_cuda.make_config.argtypes = [ctypes.c_size_t, ctypes.c_size_t, ctypes.c_size_t, ctypes.c_size_t, ctypes.c_void_p, ctypes.c_void_p, _types.stream, ctypes.c_int, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int] _vkfft_cuda.init_app.restype = ctypes.c_void_p _vkfft_cuda.init_app.argtypes = [_types.vkfft_config, ctypes.POINTER(ctypes.c_int)] _vkfft_cuda.fft.restype = ctypes.c_int _vkfft_cuda.fft.argtypes = [_types.vkfft_app, ctypes.c_void_p, ctypes.c_void_p] _vkfft_cuda.ifft.restype = ctypes.c_int _vkfft_cuda.ifft.argtypes = [_types.vkfft_app, ctypes.c_void_p, ctypes.c_void_p] _vkfft_cuda.free_app.restype = None _vkfft_cuda.free_app.argtypes = [_types.vkfft_app] _vkfft_cuda.free_config.restype = None _vkfft_cuda.free_config.argtypes = [_types.vkfft_config] class VkFFTApp(VkFFTAppBase): """ VkFFT application interface, similar to a cuFFT plan. """ def __init__(self, shape, dtype: type, ndim=None, inplace=True, stream=None, norm=1, r2c=False, dct=False, axes=None, **kwargs): """ :param shape: the shape of the array to be transformed. The number of dimensions of the array can be larger than the FFT dimensions, but only for 1D and 2D transforms. 3D FFT transforms can only be done on 3D arrays. :param dtype: the numpy dtype of the source array (can be complex64 or complex128) :param ndim: the number of dimensions to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2. :param inplace: if True (the default), performs an inplace transform and the destination array should not be given in fft() and ifft(). :param stream: the pycuda.driver.Stream or cupy.cuda.Stream to use for the transform. If None, the default one will be used :param norm: if 0 (unnormalised), every transform multiplies the L2 norm of the array by its size (or the size of the transformed array if ndimcomplex transform, where the complex destination is a half-hermitian array. For an inplace transform, if the input data shape is (...,nx), the input float array should have a shape of (..., nx+2), the last two columns being ignored in the input data, and the resulting complex array (using pycuda's GPUArray.view(dtype=np.complex64) to reinterpret the type) will have a shape (..., nx//2 + 1). For an out-of-place transform, if the input (real) shape is (..., nx), the output (complex) shape should be (..., nx//2+1). Note that for C2R transforms with ndim>=2, the source (complex) array is modified. :param dct: used to perform a Direct Cosine Transform (DCT) aka a R2R transform. An integer can be given to specify the type of DCT (1, 2, 3 or 4). if dct=True, the DCT type 2 will be performed, following scipy's convention. :param axes: a list or tuple of axes along which the transform should be made. if None, the transform is done along the ndim fastest axes, or all axes if ndim is None. Not allowed for R2C transforms :raises RuntimeError: if the initialisation fails, e.g. if the CUDA driver has not been properly initialised. """ super().__init__(shape, dtype, ndim=ndim, inplace=inplace, norm=norm, r2c=r2c, dct=dct, axes=axes, **kwargs) self.stream = stream self.config = self._make_config() if self.config is None: raise RuntimeError("Error creating VkFFTConfiguration. Was the CUDA context properly initialised ?") res = ctypes.c_int(0) self.app = _vkfft_cuda.init_app(self.config, ctypes.byref(res)) check_vkfft_result(res, shape, dtype, ndim, inplace, norm, r2c, dct, axes, "cuda") if self.app is None: raise RuntimeError("Error creating VkFFTApplication. Was the CUDA driver initialised ?") if has_pycuda: # TODO: This is a kludge to keep a reference to the context, so that it is deleted # after the app in __delete__, which throws an error if the context does not exist # anymore. Except that we cannot be sure this is the right context, if a stream # has been given because we don't have access to cuStreamGetCtx from python... self._ctx = cu_drv.Context.get_current() def __del__(self): """ Takes care of deleting allocated memory in the underlying VkFFTApplication and VkFFTConfiguration. """ if self.app is not None: _vkfft_cuda.free_app(self.app) if self.config is not None: _vkfft_cuda.free_config(self.config) def _make_config(self): """ Create a vkfft configuration for a FFT transform""" nx, ny, nz, n_batch = self.shape skipx, skipy, skipz = self.skip_axis if self.r2c and self.inplace: # the last two columns are ignored in the R array, and will be used # in the C array with a size nx//2+1 nx -= 2 s = 0 if self.stream is not None: if has_pycuda: if isinstance(self.stream, cu_drv.Stream): s = self.stream.handle if has_cupy: if isinstance(self.stream, cp.cuda.Stream): s = self.stream.ptr if self.norm == "ortho": norm = 0 else: norm = self.norm # We pass fake buffer pointer addresses to VkFFT. The real ones will be # given when performing the actual FFT. dest_gpudata = 2 if self.inplace: dest_gpudata = 0 return _vkfft_cuda.make_config(nx, ny, nz, self.ndim, 1, dest_gpudata, s, norm, self.precision, int(self.r2c), int(self.dct), int(self.disableReorderFourStep), int(self.registerBoost), int(self.use_lut), int(self.keepShaderCode), n_batch, skipx, skipy, skipz) def fft(self, src, dest=None): """ Compute the forward FFT :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. Should be None for an inplace transform :raises RuntimeError: in case of a GPU kernel launch error :return: the transformed array. For a R2C inplace transform, the complex view of the array is returned. """ use_cupy = False if has_cupy: if isinstance(src, cp.ndarray): use_cupy = True if use_cupy: src_ptr = src.__cuda_array_interface__['data'][0] else: # Must cast the gpudata to int as it can either be a DeviceAllocation object # or an int (e.g. when using a view of another array) src_ptr = int(src.gpudata) if dest is not None: if use_cupy: dest_ptr = dest.__cuda_array_interface__['data'][0] else: dest_ptr = int(dest.gpudata) else: dest_ptr = src_ptr if self.inplace: if src_ptr != dest_ptr: raise RuntimeError("VkFFTApp.fft: dest is not None but this is an inplace transform") res = _vkfft_cuda.fft(self.app, int(src_ptr), int(src_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="cuda") if self.norm == "ortho": src *= self._get_fft_scale(norm=0) if self.r2c: if src.dtype == np.float32: return src.view(dtype=np.complex64) elif src.dtype == np.float64: return src.view(dtype=np.complex128) return src else: if dest is None: raise RuntimeError("VkFFTApp.fft: dest is None but this is an out-of-place transform") if src_ptr == dest_ptr: raise RuntimeError("VkFFTApp.fft: dest and src are identical but this is an out-of-place transform") if self.r2c: assert (dest.size == src.size // src.shape[-1] * (src.shape[-1] // 2 + 1)) res = _vkfft_cuda.fft(self.app, int(src_ptr), int(dest_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="cuda") if self.norm == "ortho": dest *= self._get_fft_scale(norm=0) return dest def ifft(self, src, dest=None): """ Compute the backward FFT :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. Should be None for an inplace transform :raises RuntimeError: in case of a GPU kernel launch error :return: the transformed array. For a C2R inplace transform, the float view of the array is returned. """ use_cupy = False if has_cupy: if isinstance(src, cp.ndarray): use_cupy = True if use_cupy: src_ptr = src.__cuda_array_interface__['data'][0] else: # Must cast the gpudata to int as it can either be a DeviceAllocation object # or an int (e.g. when using a view of another array) src_ptr = int(src.gpudata) if dest is not None: if use_cupy: dest_ptr = dest.__cuda_array_interface__['data'][0] else: dest_ptr = int(dest.gpudata) else: dest_ptr = src_ptr if self.inplace: if dest is not None: if src_ptr != dest_ptr: raise RuntimeError("VkFFTApp.fft: dest!=src but this is an inplace transform") res = _vkfft_cuda.ifft(self.app, int(src_ptr), int(src_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="cuda") if self.norm == "ortho": src *= self._get_ifft_scale(norm=0) if self.r2c: if src.dtype == np.complex64: return src.view(dtype=np.float32) elif src.dtype == np.complex128: return src.view(dtype=np.float64) return src if not self.inplace: if dest is None: raise RuntimeError("VkFFTApp.ifft: dest is None but this is an out-of-place transform") if src_ptr == dest_ptr: raise RuntimeError("VkFFTApp.ifft: dest and src are identical but this is an out-of-place transform") if self.r2c: assert (src.size == dest.size // dest.shape[-1] * (dest.shape[-1] // 2 + 1)) # Special case, src and dest buffer sizes are different, # VkFFT is configured to go back to the source buffer res = _vkfft_cuda.ifft(self.app, int(dest_ptr), int(src_ptr)) else: res = _vkfft_cuda.ifft(self.app, int(src_ptr), int(dest_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="cuda") if self.norm == "ortho": dest *= self._get_ifft_scale(norm=0) return dest def vkfft_version(): """ Get VkFFT version :return: version as X.Y.Z """ int_ver = _vkfft_cuda.vkfft_version() return "%d.%d.%d" % (int_ver // 10000, (int_ver % 10000) // 100, int_ver % 100) pyvkfft-2022.1.1/pyvkfft/fft.py0000644000076500000240000004572514202465263017064 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2021- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr __all__ = ['fftn', 'ifftn', 'rfftn', 'irfftn', 'vkfft_version', 'clear_vkfftapp_cache', 'has_pycuda', 'has_opencl', 'has_cupy'] from enum import Enum from functools import lru_cache import numpy as np from .base import complex32 from .config import FFT_CACHE_NB try: from .cuda import VkFFTApp as VkFFTApp_cuda, has_pycuda, has_cupy, vkfft_version if has_pycuda: import pycuda.gpuarray as cua if has_cupy: import cupy as cp except ImportError: has_cupy, has_pycuda = False, False try: from .opencl import VkFFTApp as VkFFTApp_cl, cla, vkfft_version has_opencl = True except ImportError: has_opencl = False class Backend(Enum): """ Backend language & library""" UNKNOWN = 0 PYCUDA = 1 PYOPENCL = 2 CUPY = 3 def _prepare_transform(src, dest, cl_queue, r2c=False): """ Determine the backend from the input data. Create the destination array if necessary. :param src: the source GPU array :param dest: the destination array. If None, a new GPU array is created. :param cl_queue: the opencl queue to use, or None :param r2c: if True, this is for an R2C transform, so adapt the destination array accordingly. :return: a tuple (backend, inplace, dest, cl_queue), also appending the destination dtype for an r2c transform. """ backend = Backend.UNKNOWN if r2c: if src.dtype in [np.float16, np.float32, np.float64]: sh = list(src.shape) sh[-1] = sh[-1] // 2 + 1 dtype = np.complex64 if src.dtype == np.float16: dtype = complex32 elif src.dtype == np.float64: dtype = np.complex128 else: sh = list(src.shape) sh[-1] = (sh[-1] - 1) * 2 dtype = np.float32 if src.dtype == complex32: dtype = np.float16 elif src.dtype == np.complex128: dtype = np.float64 else: sh, dtype = None, None if has_pycuda: if isinstance(src, cua.GPUArray): backend = Backend.PYCUDA # Must cast the gpudata to int as it can either be a DeviceAllocation object # or an int (e.g. when using a view of another array) src_ptr = int(src.gpudata) if dest is None: if r2c: dest = cua.empty(tuple(sh), dtype=dtype, allocator=src.allocator) else: dest = cua.empty_like(src) dest_ptr = int(dest.gpudata) if backend == Backend.UNKNOWN and has_opencl: if isinstance(src, cla.Array): backend = Backend.PYOPENCL src_ptr = src.data.int_ptr if dest is None: if r2c: dest = cla.empty(src.queue, tuple(sh), dtype=dtype, allocator=src.allocator) else: dest = cla.empty_like(src) dest_ptr = dest.data.int_ptr if cl_queue is None: cl_queue = src.queue if backend == Backend.UNKNOWN and has_cupy: if isinstance(src, cp.ndarray): backend = Backend.CUPY src_ptr = src.__cuda_array_interface__['data'][0] if dest is None: if r2c: dest = cp.empty(tuple(sh), dtype=dtype) else: dest = cp.empty_like(src) dest_ptr = dest.__cuda_array_interface__['data'][0] if backend == Backend.UNKNOWN: raise RuntimeError("Could note determine the type of GPU array supplied, or the " "corresponding backend is not installed " "(has_pycuda=%d, has_pyopencl=%d, has_cupy=%d)" % (has_pycuda, has_opencl, has_cupy)) inplace = dest_ptr == src_ptr if r2c: if inplace: dest = src.view(dtype=dtype) return backend, inplace, dest, cl_queue, dtype else: return backend, inplace, dest, cl_queue @lru_cache(maxsize=FFT_CACHE_NB) def _get_fft_app(backend, shape, dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue): if backend in [Backend.PYCUDA, Backend.CUPY]: return VkFFTApp_cuda(shape, dtype, ndim=ndim, inplace=inplace, stream=cuda_stream, norm=norm, axes=axes) elif backend == Backend.PYOPENCL: return VkFFTApp_cl(shape, dtype, cl_queue, ndim=ndim, inplace=inplace, norm=norm, axes=axes) @lru_cache(maxsize=FFT_CACHE_NB) def _get_rfft_app(backend, shape, dtype, inplace, ndim, norm, cuda_stream, cl_queue): if backend in [Backend.PYCUDA, Backend.CUPY]: return VkFFTApp_cuda(shape, dtype, ndim=ndim, inplace=inplace, stream=cuda_stream, norm=norm, r2c=True) elif backend == Backend.PYOPENCL: return VkFFTApp_cl(shape, dtype, cl_queue, ndim=ndim, inplace=inplace, norm=norm, r2c=True) @lru_cache(maxsize=FFT_CACHE_NB) def _get_dct_app(backend, shape, dtype, inplace, ndim, norm, dct_type, cuda_stream, cl_queue): if backend in [Backend.PYCUDA, Backend.CUPY]: return VkFFTApp_cuda(shape, dtype, ndim=ndim, inplace=inplace, stream=cuda_stream, norm=norm, dct=dct_type) elif backend == Backend.PYOPENCL: return VkFFTApp_cl(shape, dtype, cl_queue, ndim=ndim, inplace=inplace, norm=norm, dct=dct_type) def fftn(src, dest=None, ndim=None, norm=1, axes=None, cuda_stream=None, cl_queue=None, return_scale=False): """ Perform a FFT on a GPU array, automatically creating the VkFFTApp and caching it for future re-use. :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. If None, a new GPU array will be created and returned (using the source array allocator (pycuda, pyopencl) if available). If dest is the same array as src, an inplace transform is done. :param ndim: the number of dimensions (<=3) to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2. :param norm: if 0 (un-normalised), every transform multiplies the L2 norm of the array by the transform size. if 1 (the default) or "backward", the inverse transform divides the L2 norm by the array size, so FFT+iFFT will keep the array norm. if "ortho", each transform will keep the L2 norm, but that will involve an extra read & write operation. :param axes: a list or tuple of axes along which the transform is made. if None, the transform is done along the ndim fastest axes, or all axes if ndim is None. Not allowed for R2C transforms :param cuda_stream: the pycuda.driver.Stream or cupy.cuda.Stream to use for the transform. If None, the default one will be used :param cl_queue: the pyopencl.CommandQueue to be used. If None, the source array default queue will be used :param return_scale: if True, return the scale factor by which the result must be multiplied to keep its L2 norm after the transform :return: the destination array if return_scale is False, or (dest, scale) """ backend, inplace, dest, cl_queue = _prepare_transform(src, dest, cl_queue, False) app = _get_fft_app(backend, src.shape, src.dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue) app.fft(src, dest) if return_scale: s = app.get_fft_scale() return dest, s return dest def ifftn(src, dest=None, ndim=None, norm=1, axes=None, cuda_stream=None, cl_queue=None, return_scale=False): """ Perform an inverse FFT on a GPU array, automatically creating the VkFFTApp and caching it for future re-use. :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. If None, a new GPU array will be created and returned (using the source array allocator (pycuda, pyopencl) if available). If dest is the same array as src, an inplace transform is done. :param ndim: the number of dimensions (<=3) to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2. :param norm: if 0 (un-normalised), every transform multiplies the L2 norm of the array by the transform size. if 1 (the default) or "backward", the inverse transform divides the L2 norm by the array size, so FFT+iFFT will keep the array norm. if "ortho", each transform will keep the L2 norm, but that will involve an extra read & write operation. :param axes: a list or tuple of axes along which the transform is made. if None, the transform is done along the ndim fastest axes, or all axes if ndim is None. Not allowed for R2C transforms :param cuda_stream: the pycuda.driver.Stream or cupy.cuda.Stream to use for the transform. If None, the default one will be used :param cl_queue: the pyopencl.CommandQueue to be used. If None, the source array default queue will be used :param return_scale: if True, return the scale factor by which the result must be multiplied to keep its L2 norm after the transform :return: the destination array if return_scale is False, or (dest, scale) """ backend, inplace, dest, cl_queue = _prepare_transform(src, dest, cl_queue, False) app = _get_fft_app(backend, src.shape, src.dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue) app.ifft(src, dest) if return_scale: s = app.get_fft_scale() return dest, s return dest def rfftn(src, dest=None, ndim=None, norm=1, cuda_stream=None, cl_queue=None, return_scale=False): """ Perform a real->complex transform on a GPU array, automatically creating the VkFFTApp and caching it for future re-use. For an out-of-place transform, the length of the destination last axis will be src.shape[-1]//2+1. For an in-place transform, if the src array has a shape (..., nx+2), the last two values along the last (X) axis are ignored, and the destination array will have a shape of (..., nx//2+1). :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. If None, a new GPU array will be created and returned (using the source array allocator (pycuda, pyopencl) if available). If dest is the same array as src, an inplace transform is done. :param ndim: the number of dimensions (<=3) to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2. :param norm: if 0 (un-normalised), every transform multiplies the L2 norm of the array by the transform size. if 1 (the default) or "backward", the inverse transform divides the L2 norm by the array size, so FFT+iFFT will keep the array norm. if "ortho", each transform will keep the L2 norm, but that will involve an extra read & write operation. :param cuda_stream: the pycuda.driver.Stream or cupy.cuda.Stream to use for the transform. If None, the default one will be used :param cl_queue: the pyopencl.CommandQueue to be used. If None, the source array default queue will be used :param return_scale: if True, return the scale factor by which the result must be multiplied to keep its L2 norm after the transform :return: the destination array if return_scale is False, or (dest, scale). For an in-place transform, the returned value is a view of the array with the appropriate type. """ backend, inplace, dest, cl_queue, dtype = _prepare_transform(src, dest, cl_queue, True) app = _get_rfft_app(backend, src.shape, src.dtype, inplace, ndim, norm, cuda_stream, cl_queue) app.fft(src, dest) if return_scale: s = app.get_fft_scale() return dest.view(dtype=dtype), s return dest.view(dtype=dtype) def irfftn(src, dest=None, ndim=None, norm=1, cuda_stream=None, cl_queue=None, return_scale=False): """ Perform a complex->real transform on a GPU array, automatically creating the VkFFTApp and caching it for future re-use. For an out-of-place transform, the length of the destination last axis will be (src.shape[-1]-1)*2. For an in-place transform, if the src array has a shape (..., nx), the destination array will have a shape of (..., nx*2) but the last two vales along the last axis are used as buffer. :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. If None, a new GPU array will be created and returned (using the source array allocator (pycuda, pyopencl) if available). If dest is the same array as src, an inplace transform is done. :param ndim: the number of dimensions (<=3) to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2. :param norm: if 0 (un-normalised), every transform multiplies the L2 norm of the array by the transform size. if 1 (the default) or "backward", the inverse transform divides the L2 norm by the array size, so FFT+iFFT will keep the array norm. if "ortho", each transform will keep the L2 norm, but that will involve an extra read & write operation. :param cuda_stream: the pycuda.driver.Stream or cupy.cuda.Stream to use for the transform. If None, the default one will be used :param cl_queue: the pyopencl.CommandQueue to be used. If None, the source array default queue will be used :param return_scale: if True, return the scale factor by which the result must be multiplied to keep its L2 norm after the transform :return: the destination array if return_scale is False, or (dest, scale) For an in-place transform, the returned value is a view of the array with the appropriate type. """ backend, inplace, dest, cl_queue, dtype = _prepare_transform(src, dest, cl_queue, True) app = _get_rfft_app(backend, dest.shape, dest.dtype, inplace, ndim, norm, cuda_stream, cl_queue) app.ifft(src, dest) if return_scale: s = app.get_fft_scale() return dest.view(dtype=dtype), s return dest.view(dtype=dtype) def dctn(src, dest=None, ndim=None, norm=1, dct_type=2, cuda_stream=None, cl_queue=None): """ Perform a real->real Direct Cosine Transform on a GPU array, automatically creating the VkFFTApp and caching it for future re-use. :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. If None, a new GPU array will be created and returned (using the source array allocator (pycuda, pyopencl) if available). If dest is the same array as src, an inplace transform is done. :param ndim: the number of dimensions (<=3) to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2. :param norm: normalisation mode, either 0 (un-normalised) or 1 (the default, also available as "backward) which will normalise the inverse transform, so DCT+iDCT will keep the array norm. :param dct_type: the type of dct desired: 1, 2 (default), 3 or 4 :param cuda_stream: the pycuda.driver.Stream or cupy.cuda.Stream to use for the transform. If None, the default one will be used :param cl_queue: the pyopencl.CommandQueue to be used. If None, the source array default queue will be used :return: the destination array. """ backend, inplace, dest, cl_queue = _prepare_transform(src, dest, cl_queue, False) app = _get_dct_app(backend, src.shape, src.dtype, inplace, ndim, norm, dct_type, cuda_stream, cl_queue) app.fft(src, dest) return dest def idctn(src, dest=None, ndim=None, norm=1, dct_type=2, cuda_stream=None, cl_queue=None): """ Perform a real->real inverse Direct Cosine Transform on a GPU array, automatically creating the VkFFTApp and caching it for future re-use. :param src: the source pycuda.gpuarray.GPUArray or cupy.ndarray :param dest: the destination GPU array. If None, a new GPU array will be created and returned (using the source array allocator (pycuda, pyopencl) if available). If dest is the same array as src, an inplace transform is done. :param ndim: the number of dimensions (<=3) to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2. :param norm: normalisation mode, either 0 (un-normalised) or 1 (the default, also available as "backward) which will normalise the inverse transform, so DCT+iDCT will keep the array norm. :param dct_type: the type of dct desired: 2 (default), 3 or 4 :param cuda_stream: the pycuda.driver.Stream or cupy.cuda.Stream to use for the transform. If None, the default one will be used :param cl_queue: the pyopencl.CommandQueue to be used. If None, the source array default queue will be used :return: the destination array. """ backend, inplace, dest, cl_queue = _prepare_transform(src, dest, cl_queue, False) app = _get_dct_app(backend, src.shape, src.dtype, inplace, ndim, norm, dct_type, cuda_stream, cl_queue) app.ifft(src, dest) return dest def clear_vkfftapp_cache(): """ Remove all cached VkFFTApp""" _get_fft_app.cache_clear() _get_rfft_app.cache_clear() pyvkfft-2022.1.1/pyvkfft/opencl.py0000644000076500000240000003034214202465263017552 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2021- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr import warnings import ctypes import numpy as np import pyopencl as cl import pyopencl.array as cla from .base import load_library, primes, VkFFTApp as VkFFTAppBase, VkFFTResult, check_vkfft_result _vkfft_opencl = load_library("_vkfft_opencl") class _types: """Aliases""" vkfft_config = ctypes.c_void_p vkfft_app = ctypes.c_void_p _vkfft_opencl.make_config.restype = ctypes.c_void_p _vkfft_opencl.make_config.argtypes = [ctypes.c_size_t, ctypes.c_size_t, ctypes.c_size_t, ctypes.c_size_t, ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p, ctypes.c_int, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int] _vkfft_opencl.init_app.restype = ctypes.c_void_p _vkfft_opencl.init_app.argtypes = [_types.vkfft_config, ctypes.c_void_p, ctypes.POINTER(ctypes.c_int)] _vkfft_opencl.fft.restype = ctypes.c_int _vkfft_opencl.fft.argtypes = [_types.vkfft_app, ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p] _vkfft_opencl.ifft.restype = ctypes.c_int _vkfft_opencl.ifft.argtypes = [_types.vkfft_app, ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p] _vkfft_opencl.free_app.restype = None _vkfft_opencl.free_app.argtypes = [_types.vkfft_app] _vkfft_opencl.free_config.restype = None _vkfft_opencl.free_config.argtypes = [_types.vkfft_config] _vkfft_opencl.vkfft_version.restype = ctypes.c_uint32 _vkfft_opencl.vkfft_version.argtypes = None class VkFFTApp(VkFFTAppBase): """ VkFFT application interface implementing a FFT plan. """ def __init__(self, shape, dtype: type, queue: cl.CommandQueue, ndim=None, inplace=True, norm=1, r2c=False, dct=False, axes=None, **kwargs): """ Init function for the VkFFT application. :param shape: the shape of the array to be transformed. The number of dimensions of the array can be larger than the FFT dimensions. :param dtype: the numpy dtype of the source array (can be complex64 or complex128) :param queue: the pyopencl CommandQueue to use for the transform. :param ndim: the number of dimensions to use for the FFT. By default, uses the array dimensions. Can be smaller, e.g. ndim=2 for a 3D array to perform a batched 3D FFT on all the layers. The FFT is always performed along the last axes if the array's number of dimension is larger than ndim, i.e. on the x-axis for ndim=1, on the x and y axes for ndim=2, etc.. Unless axes are given. :param inplace: if True (the default), performs an inplace transform and the destination array should not be given in fft() and ifft(). :param norm: if 0 (unnormalised), every transform multiplies the L2 norm of the array by its size (or the size of the transformed array if ndimcomplex transform, where the complex destination is a half-hermitian array. For an inplace transform, if the input data shape is (...,nx), the input float array should have a shape of (..., nx+2), the last two columns being ignored in the input data, and the resulting complex array (using pycuda's GPUArray.view(dtype=np.complex64) to reinterpret the type) will have a shape (..., nx//2 + 1). For an out-of-place transform, if the input (real) shape is (..., nx), the output (complex) shape should be (..., nx//2+1). Note that for C2R transforms with ndim>=2, the source (complex) array is modified. :param dct: used to perform a Direct Cosine Transform (DCT) aka a R2R transform. An integer can be given to specify the type of DCT (1, 2, 3 or 4). if dct=True, the DCT type 2 will be performed, following scipy's convention. :param axes: a list or tuple of axes along which the transform should be made. if None, the transform is done along the ndim fastest axes, or all axes if ndim is None. Not allowed for R2C transforms :raises RuntimeError: if the initialisation fails, e.g. if the GPU driver has not been properly initialised, or if the transform dimensions are not allowed by VkFFT. """ super().__init__(shape, dtype, ndim=ndim, inplace=inplace, norm=norm, r2c=r2c, dct=dct, axes=axes, **kwargs) self.queue = queue if self.precision == 2 and 'cl_khr_fp16' not in self.queue.device.extensions: raise RuntimeError("Half precision required but cl_khr_fp16 extension is not available") if self.precision == 8 and 'cl_khr_fp64' not in self.queue.device.extensions: raise RuntimeError("Double precision required but cl_khr_fp64 extension is not available") self.config = self._make_config() if self.config is None: raise RuntimeError("Error creating VkFFTConfiguration. Was the OpenCL context properly initialised ?") res = ctypes.c_int(0) self.app = _vkfft_opencl.init_app(self.config, queue.int_ptr, ctypes.byref(res)) check_vkfft_result(res, shape, dtype, ndim, inplace, norm, r2c, dct, axes, "opencl") if self.app is None: raise RuntimeError("Error creating VkFFTApplication. Was the OpenCL context properly initialised ?") def __del__(self): """ Takes care of deleting allocated memory in the underlying VkFFTApplication and VkFFTConfiguration. """ if self.app is not None: _vkfft_opencl.free_app(self.app) if self.config is not None: _vkfft_opencl.free_config(self.config) def _make_config(self): """ Create a vkfft configuration for a FFT transform""" nx, ny, nz, n_batch = self.shape skipx, skipy, skipz = self.skip_axis if self.r2c and self.inplace: # the last two columns are ignored in the R array, and will be used # in the C array with a size nx//2+1 nx -= 2 if self.norm == "ortho": norm = 0 else: norm = self.norm # We pass fake buffer pointer addresses to VkFFT. The real ones will be # given when performing the actual FFT. ctx = self.queue.context device = ctx.devices[0] platform = device.platform dest_gpudata = 2 if self.inplace: dest_gpudata = 0 return _vkfft_opencl.make_config(nx, ny, nz, self.ndim, 1, dest_gpudata, platform.int_ptr, device.int_ptr, ctx.int_ptr, norm, self.precision, int(self.r2c), int(self.dct), int(self.disableReorderFourStep), int(self.registerBoost), int(self.use_lut), int(self.keepShaderCode), n_batch, skipx, skipy, skipz) def fft(self, src: cla.Array, dest: cla.Array = None): """ Compute the forward FFT :param src: the source pyopencl Array :param dest: the destination pyopencl Array. Should be None for an inplace transform :raises RuntimeError: in case of a GPU kernel launch error :return: the transformed array. For a R2C inplace transform, the complex view of the array is returned. """ if self.inplace: if dest is not None: if src.data.int_ptr != dest.data.int_ptr: raise RuntimeError("VkFFTApp.fft: dest is not None but this is an inplace transform") res = _vkfft_opencl.fft(self.app, int(src.data.int_ptr), int(src.data.int_ptr), int(self.queue.int_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="opencl") if self.norm == "ortho": src *= self._get_fft_scale(norm=0) if self.r2c: if src.dtype == np.float32: return src.view(dtype=np.complex64) elif src.dtype == np.float64: return src.view(dtype=np.complex128) return src else: if dest is None: raise RuntimeError("VkFFTApp.fft: dest is None but this is an out-of-place transform") elif src.data.int_ptr == dest.data.int_ptr: raise RuntimeError("VkFFTApp.fft: dest and src are identical but this is an out-of-place transform") if self.r2c: assert (dest.size == src.size // src.shape[-1] * (src.shape[-1] // 2 + 1)) res = _vkfft_opencl.fft(self.app, int(src.data.int_ptr), int(dest.data.int_ptr), int(self.queue.int_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="opencl") if self.norm == "ortho": dest *= self._get_fft_scale(norm=0) return dest def ifft(self, src: cla.Array, dest: cla.Array = None): """ Compute the backward FFT :param src: the source pyopencl.Array :param dest: the destination pyopencl.Array. Can be None for an inplace transform :raises RuntimeError: in case of a GPU kernel launch error :return: the transformed array. For a C2R inplace transform, the float view of the array is returned. """ if self.inplace: if dest is not None: if src.data.int_ptr != dest.data.int_ptr: raise RuntimeError("VkFFTApp.fft: dest!=src but this is an inplace transform") res = _vkfft_opencl.ifft(self.app, int(src.data.int_ptr), int(src.data.int_ptr), int(self.queue.int_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="opencl") if self.norm == "ortho": src *= self._get_ifft_scale(norm=0) if self.r2c: if src.dtype == np.complex64: return src.view(dtype=np.float32) elif src.dtype == np.complex128: return src.view(dtype=np.float64) return src if not self.inplace: if dest is None: raise RuntimeError("VkFFTApp.ifft: dest is None but this is an out-of-place transform") elif src.data.int_ptr == dest.data.int_ptr: raise RuntimeError("VkFFTApp.ifft: dest and src are identical but this is an out-of-place transform") if self.r2c: assert (src.size == dest.size // dest.shape[-1] * (dest.shape[-1] // 2 + 1)) # Special case, src and dest buffer sizes are different, # VkFFT is configured to go back to the source buffer res = _vkfft_opencl.ifft(self.app, int(dest.data.int_ptr), int(src.data.int_ptr), int(self.queue.int_ptr)) else: res = _vkfft_opencl.ifft(self.app, int(src.data.int_ptr), int(dest.data.int_ptr), int(self.queue.int_ptr)) check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c, self.dct, backend="opencl") if self.norm == "ortho": dest *= self._get_ifft_scale(norm=0) return dest def vkfft_version(): """ Get VkFFT version :return: version as X.Y.Z """ int_ver = _vkfft_opencl.vkfft_version() return "%d.%d.%d" % (int_ver // 10000, (int_ver % 10000) // 100, int_ver % 100) pyvkfft-2022.1.1/pyvkfft/scripts/0000755000076500000240000000000014202465377017413 5ustar vincentstaff00000000000000pyvkfft-2022.1.1/pyvkfft/scripts/__init__.py0000644000076500000240000000000014202465263021504 0ustar vincentstaff00000000000000pyvkfft-2022.1.1/pyvkfft/scripts/pyvkfft_test.py0000755000076500000240000007266414202465263022531 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2022- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr # # # pyvkfft script to run short or long unit tests import argparse import os.path import sys import unittest import time import timeit import socket import numpy as np from pyvkfft.test import TestFFT, TestFFTSystematic from pyvkfft.version import __version__, vkfft_version def suite_default(): suite = unittest.TestSuite() load_tests = unittest.defaultTestLoader.loadTestsFromTestCase suite.addTest(load_tests(TestFFT)) return suite def suite_systematic(): suite = unittest.TestSuite() load_tests = unittest.defaultTestLoader.loadTestsFromTestCase suite.addTest(load_tests(TestFFTSystematic)) return suite def make_html_pre_post(overwrite=False): if ('pyvkfft-test1000.html' not in os.listdir()) or overwrite: # Need the html header, styles and the results' table beginning tmp = '\n \n \n' \ '\n' \ '\n' \ '
' \ '

pyVkFFT test results

\n' \ '

pyvkfft version: %s VkFFFT version:%s host : %s

\n' \ '
' \ '

Methodology: the included graphs measure the accuracy of the forward ' \ 'and backward transforms: an array is generated with random uniform values ' \ 'between -0.5 and 0.5, and the results of its transform are compared ' \ 'with either pyfftw (in long double precision) if available, or scipy if ' \ 'available, or numpy fft. The L2 curve measures the average square norm ' \ 'difference, and the L the maximum difference.' \ '

Note: for the R2C inverse transform, the result of the forward ' \ 'transform is used instead of re-using the random array (in order to have ' \ 'a proper half-Hermitian array), contrary to what is done for other ' \ 'transforms. This explains with the IFFT R2C maximum (L) ' \ 'errors are larger.' \ '

Note 2: some "errors" for DCT may be due to unsupported sizes in VkFFT, ' \ 'which vary depending on the card and language used (amount of ' \ 'shared/local memory). So they just indicate a current limit for the ' \ 'transform sizes rather than a real error.' \ '

[Click on the highlighted cells for details and accuracy graphs ' \ 'vs the transform size]
\n' \ '

\n' \ ' \n' \ ' \n' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' ' \ ' \n' \ ' \n' \ '\n' % (__version__, vkfft_version(), socket.gethostname()) open("pyvkfft-test1000.html", "w").write(tmp) if ('pyvkfft-test1999.html' not in os.listdir()) or overwrite: tmp = '\n' \ '\n' \ '\n' \ ' ' \ ' ' \ '' \ '\n' \ '' open("pyvkfft-test1999.html", "w").write(tmp) def name_next_file(pattern="pyvkfft-test%04d.html"): """ Find the first unused name for a file, starting at i=1 :param pattern: the pattern for the file name. :return: the filename """ lsdir = os.listdir() for i in range(1001, 1999): if pattern % i not in lsdir: return pattern % i raise RuntimeError("name_next_file: '%s' files all used from 1001 to 1998. Maybe cleanup ?" % pattern) def main(): t0 = timeit.default_timer() localt0 = time.localtime() epilog = "Examples:\n" \ " pyvkfft-test\n" \ " the regular test which tries the fft interface, using parallel\n" \ " streams (for pycuda), and C2C/R2C transforms for sizes N=30, 34\n" \ " with 1, 2 and 3D transforms, and N=808 for 1D and 2D transforms.\n" \ " All tests are done with single and double precision, in and\n" \ " out-of-place, norm=0 and 1, and all available backends (pyopencl,\n" \ " pycuda and cupy). For C2C arrays up to dimension 4 are tested,\n" \ " with all possible combination of transform axes.\n" \ " DCT transforms of type 1,2,3 and 4 are also tested for N=30,34.\n" \ " That's for a total of a few thousands transforms, which are tested\n" \ " against the result of numpy, scipy or pyfftw (when available) for\n" \ " accuracy.\n" \ " The text output gives the N2 and Ninf (aka max) relative norm of\n" \ " the transform, with the ratio in () to the expected tolerance for\n" \ " both direct and invrese transforms.\n" \ "\n" \ " pyvkfft-test --nproc 8 --gpu v100 --mailto_fail toto@pyvkfft.org\n" \ " same test, but using 8 parallel process to speed up, and use a GPU\n" \ " with 'v100' in its name. Also, send the results in case of a failure\n" \ " to the given email address\n" \ "\n" \ " pyvkfft-test --systematic --backend pycuda --nproc 8 --radix --range 2 10000\n" \ " Perform a systematic test of C2C transforms in (by default) 1D and\n" \ " single precision, for N=2 to 10000, only for radix transforms\n" \ "\n" \ " pyvkfft-test --systematic --backend pycuda --nproc 8 --radix 2 7 11 --range 2 10000 --double\n" \ " Same test, but only for radix sizes with factors 2, 7 and 11, and double accuracy\n" \ "\n" \ " pyvkfft-test --systematic --backend cupy --nproc 8 --bluestein --range 2 10000 --ndim 2 " \ "--lut --inplace\n" \ " test with cupy backend, only non-radix 2D inplace R2C transforms\n," \ " using a lookup table( lut) for higher single precision accuracy.\n" \ " (this automatically limits the x-sizes to even values for the inplace R2C\n" parser = argparse.ArgumentParser(prog='pyvkfft-test', epilog=epilog, description='Run pyvkfft unit tests, regular or systematic', formatter_class=argparse.RawDescriptionHelpFormatter) parser.add_argument('--colour', action='store_true', help="Use colour depending on how good the measured accuracy is") parser.add_argument('--html', action='store', nargs='*', help="Summarises the results in html row(s). This is saved to " "'pyvkfft-test%%04d.html', starting at i=1001 and incrementing. " "Files with i=1000 and i=1999 are the beginning and the end of the" "html file, which can be concatenated to form a valid html page." "If --graph is also used, this includes a graph of the accuracy " "which can be displayed by clicking on the type of transform.") parser.add_argument('--gpu', action='store', help="Name (or sub-string) of the GPU to use") parser.add_argument('--mailto', action='store', help="Email address the results will be sent to") parser.add_argument('--mailto_fail', action='store', help="Email address the results will be sent to, only if the test fails") parser.add_argument('--mailto_smtp', action='store', default="localhost", help="SMTP server address to mail the results") parser.add_argument('--nproc', action='store', nargs=1, help="Number of parallel process to use to speed up tests. " "Make sure the sum of parallel process will not use too much " "GPU memory", default=[1], type=int) parser.add_argument('--silent', action='store_true', help="Use this to minimise the written output " "(note that tests can take a long time be patient") parser.add_argument('--systematic', action='store_true', help="Perform a systematic accuracy test over a range of array sizes.\n" "Without this argument a faster test (a few minutes) will be " "performed with selected array sizes for all possible transforms.") sysgrp = parser.add_argument_group("systematic", "Options for --systematic:") sysgrp.add_argument('--axes', action='store', nargs='*', type=int, help="transform axes: x (fastest) is 1," "y is 2, z is 3, e.g. '--axes 1', '--axes 2 3'." "The default is to perform the transform along the ndim fastest " "axes. Using this overrides --ndim") sysgrp.add_argument('--backend', action='store', nargs='+', help="Choose single or multiple GPU backends," "by default all available backends are selected.", choices=['pycuda', 'cupy', 'pyopencl']) sysgrp.add_argument('--bluestein', action='store_true', help="Only perform transform with non-radix dimensions, i.e. the " "largest number in the prime decomposition of each array dimension " "must be larger than 13") sysgrp.add_argument('--db', nargs='*', action='store', help="Save the results to an sql database. If no filename is" "given, pyvkfft-test.sql will be used. If the file already" "exists, the results are added to the file. Fields stored" "include HOSTNAME, EPOCH, BACKEND, LANGUAGE, TRANSFORM (c2c, r2c or " "dct1/2/3/4, AXES, ARRAY_SHAPE, NDIMS, NDIM, PRECISION, INPLACE," "NORM, LUT, N, N2_FFT, N2_IFFT, NI_FFT, NI_IFFT, TOLERANCE," "DT_APP, DT_FFT, DT_IFFT, SRC_UNCHANGED_FFT, SRC_UNCHANGED_IFFT, " "GPU_NAME, SUCCESS, ERROR, VKFFT_ERROR_CODE") sysgrp.add_argument('--dct', nargs='*', action='store', type=int, help="Test direct cosine transforms (default is c2c):" " '--dct' (defaults to dct 2), '--dct 1'", choices=[1, 2, 3, 4]) sysgrp.add_argument('--double', action='store_true', help="Use double precision (float64/complex128) instead of single") sysgrp.add_argument('--dry-run', action='store_true', help="Perform a dry-run, printing the number of array shapes to test") sysgrp.add_argument('--inplace', action='store_true', help="Use inplace transforms (NB: for R2C with ndim>=2, the x-axis " "must be even-sized)") sysgrp.add_argument('--graph', action='store', nargs="?", const="", help="Save the graph of the accuracy as a function of the size" "to the given filename (if no name is given, it will be " "automatically generated)." "Requires matplotlib, and scipy for linear regression.") sysgrp.add_argument('--lut', action='store_true', help="Force the use of a LUT for the transform, to improve accuracy. " "By default VkFFT will activate the LUT on some GPU with less " "accurate accelerated trigonometric functions. " "This is automatically true for double precision") sysgrp.add_argument('--max-nb-tests', action='store', nargs=1, help="Maximum number of tests. If the number of generated test " "cases is larger, the program will abort.", default=[1000], type=int) sysgrp.add_argument('--ndim', action='store', nargs=1, help="Number of dimensions for the transform. Using 12 or 123 " "will result in testing bother 1 and 2 or 1,2 and 3. It is" "recommended to use --range_mb and ", default=[1], type=int, choices=[1, 2, 3, 12, 123]) # sysgrp.add_argument('--ndims', action='store', nargs=1, # help="Number of dimensions for the array (must be >=ndim). " # "By default, the array will have the same dimensionality " # "as the transform (ndim)", # type=int, choices=[1, 2, 3, 4]) sysgrp.add_argument('--norm', action='store', nargs=1, type=int, help="Normalisation to test (must be 1 for dct)", default=[1], choices=[0, 1]) sysgrp.add_argument('--ref-long-double', action='store_true', help="Use long double precision for the reference calculation, " "(requires scipy). This gives more objective accuracy plots but " "can be slower (or much slower on some architectures).") sysgrp.add_argument('--r2c', action='store_true', help="Test real-to-complex transform " "(default is c2c)") sysgrp.add_argument('--radix', action='store', nargs='*', type=int, help="Perform only radix transforms. If no value is given, all available " "radix transforms are allowed. Alternatively a list can be given: " "'--radix 2' (only 2**n array sizes), '--radix 2 3 5' " "(only 2**N1 * 3**N2 * 5**N3)", choices=[2, 3, 5, 7, 11, 13]) sysgrp.add_argument('--radix-max-pow', action='store', nargs=1, type=int, help="For radix runs, specify the maximum exponent of each base " "integer, i.e. for '--radix 2 3 --radix-max-pow 2' will limit " "lengths to 2**N1 * 3**N2 with N1,N2<=2") sysgrp.add_argument('--range', action='store', nargs=2, type=int, help="Range of array lengths [min, max] along each transform dimension, " "'--range 2 128'", default=[2, 128]) sysgrp.add_argument('--range-mb', action='store', nargs=2, type=int, help="Allowed range of array sizes [min, max] in Mbytes, e.g. " "'--range 2 128'. This can be used to limit the arrays size " "while allowing large lengths along individual dimensions. " "It can also be used to separate runs with a given size range " "and different nproc values. This takes into account the " "type (single or double), and also whether the transform " "is made inplace, so this represents the total GPU memory" "used.", default=[0, 128]) sysgrp.add_argument('--range-nd-narrow', action='store', nargs=2, default=['0', '0'], help="Two values (drel dabs), e.g. '--range_nd_narrow 0.10 11' " "with 0<=drel<=1 and dabs (integer>=0) must be given " "to allow 2D and 3D tests to be done on arrays with different " "lengths along every dimension, but while limiting the " "difference between lengths. For example in 2D for an " "(N1,N2) array shape, generated lengths will verify " "abs(n2-n1)1.") sysgrp.add_argument('--timeout', action='store', nargs=1, type=int, help="Change the timeout (in seconds) to raise a TimeOut error for " "individual tests. After 4 have failed, give up.", default=[120]) # parser.print_help() args = parser.parse_args() if args.serial and args.nproc[0] > 1: raise RuntimeError("Cannot use --serial with --nproc") if args.graph is not None: if not len(args.graph): args.graph = name_next_file("pyvkfft-test%03d.svg") # We modify class attributes to pass arguments - not a great approach but works.. if args.systematic: t = TestFFTSystematic t.axes = args.axes t.bluestein = args.bluestein t.colour = args.colour t.dct = False if args.dct is None else args.dct[0] if len(args.dct) else 2 t.db = args.db[0] if args.db is not None else None t.dry_run = args.dry_run t.dtype = np.float64 if args.double else np.float32 t.gpu = args.gpu t.graph = args.graph t.inplace = args.inplace t.lut = args.lut t.max_nb_tests = args.max_nb_tests[0] t.ndim = args.ndim[0] # t.ndims = args.ndims t.norm = args.norm[0] t.nproc = args.nproc[0] t.r2c = args.r2c t.radix = args.radix t.ref_long_double = args.ref_long_double t.max_pow = None if args.radix_max_pow is None else args.radix_max_pow[0] t.range = args.range size_min_max = np.array(args.range_mb) * 1024 ** 2 // 8 if args.r2c or args.dct: size_min_max = size_min_max * 2 if args.double: size_min_max = size_min_max / 2 if args.inplace: size_min_max = size_min_max / 2 t.range_size = size_min_max.tolist() t.range_nd_narrow = float(args.range_nd_narrow[0]), int(args.range_nd_narrow[1]) t.serial = args.serial t.timeout = args.timeout[0] t.vbackend = args.backend t.verbose = not args.silent t.vn = args.range suite = unittest.defaultTestLoader.loadTestsFromTestCase(t) if t.verbose: res = unittest.TextTestRunner(verbosity=2).run(suite) else: res = unittest.TextTestRunner(verbosity=1).run(suite) if t.dry_run: print(t.nb_shapes_gen) sys.exit() else: t = TestFFT t.verbose = not args.silent t.colour = args.colour t.gpu = args.gpu t.nproc = args.nproc[0] suite = unittest.defaultTestLoader.loadTestsFromTestCase(t) if t.verbose: res = unittest.TextTestRunner(verbosity=2).run(suite) else: res = unittest.TextTestRunner(verbosity=1).run(suite) sub = os.path.split(sys.argv[0])[-1] for i in range(1, len(sys.argv)): arg = sys.argv[i] if 'mail' not in arg and 'mail' not in sys.argv[i - 1] and 'html' not in arg and 'graph' not in arg: sub += " " + arg info = "Ran:\n %s\n\n Result:%s\n\n" % (sub, "OK" if res.wasSuccessful() else "FAILURE") info += "Elapsed time for tests: %s\n\n" % time.strftime("%Hh %Mm %Ss", time.gmtime(timeit.default_timer() - t0)) nb_err_fail = len(res.errors) + len(res.failures) if args.html is not None: make_html_pre_post(overwrite=False) html = '' # One row for the summary html += '' html += '' if args.systematic: has_graph = False if args.graph is not None: if os.path.exists(args.graph): has_graph = True if has_graph: tmp = '" if args.bluestein: html += "" elif args.radix is None: html += "" else: html += "' html += "" % ('float64' if args.double else 'float32') html += "" % ('inplace' if args.inplace else 'out-of-place') html += "" % ('True' if args.lut else 'Auto') html += "" % args.norm[0] nbts = '[%5d tests]' % t.nb_test else: html += '' html += '' nbts = '' html += '' % (time.strftime("%Y-%m-%d %Hh%M:%S", localt0), time.strftime("%Hh %Mm %Ss", time.gmtime(timeit.default_timer() - t0)), nbts) if len(res.failures): html += '' % len(res.failures) else: html += '' if len(res.errors): html += '' % len(res.errors) else: html += '' html += '\n' if args.systematic and args.graph is not None: if os.path.exists(args.graph): # Do not put the img tag of the file, else it gets pre-loaded and # this can crash the browser when aggregating many tests. # It will only be added when clicking on the transform cell. html += '' if len(res.errors): tmp = "%s\n\nERRORS:\n\n" % sub for t, s in res.errors: tid = t.id() tid1 = tid.split('(')[0].split('.')[-1] tid0, tid2 = tid.split('.' + tid1) tmp += "=" * 70 + "\n" + '%s %s [%s]:\n' % (tid1, tid2, tid0) tmp += "-" * 70 + "\n" + s + "\n" info += tmp if args.html is not None: html += '' \ '' % tmp if len(res.failures): tmp = "%s\n\nFAILURES:\n\n" % sub for t, s in res.failures: tid = t.id() tid1 = tid.split('(')[0].split('.')[-1] tid0, tid2 = tid.split('.' + tid1) tmp += "=" * 70 + "\n" + '%s %s [%s]:\n' % (tid1, tid2, tid0) tmp += "-" * 70 + "\n" + s + "\n" info += tmp if args.html is not None: html += '' \ '' % tmp if args.mailto_fail is not None and (nb_err_fail > 0) or args.mailto is not None: import smtplib try: from email.message import EmailMessage msg = EmailMessage() msg['From'] = '"pyvkfft" <%s>' % args.mailto if args.mailto is not None else args.mailto_fail msg['To'] = msg['From'] msg['Subject'] = '[fail=%d error=%d] %s' % \ (len(res.failures), len(res.errors), sub) print("Mailing results:\nFrom: %s\nTo: \n %sSubject: %s" % (msg['to'], msg['to'], msg['Subject'])) msg.set_content(info) s = smtplib.SMTP(args.mailto_smtp) s.send_message(msg) s.quit() except (ConnectionRefusedError, smtplib.SMTPConnectError): print("Could not connect to SMTP server (%s) to send email." % args.mailto_smtp) if args.html is not None: if len(args.html): html_out = args.html[0] else: html_out = name_next_file("pyvkfft-test%03d.html") with open(html_out, 'w') as f: f.write(html) sys.exit(int(nb_err_fail > 0)) if __name__ == '__main__': main() pyvkfft-2022.1.1/pyvkfft/scripts/pyvkfft_test_suite.py0000644000076500000240000001373114202465263023725 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2022- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr # # # script to run a long multi-test accuracy suite import os def main(): gpu = 'v100' nproc0 = 20 gpu_gb = 32 dry_run = False backend = 'cupy' # Basic test com = "pyvkfft-test --nproc %d --html --range-mb 0 4100" % nproc0 if dry_run: print(com) else: os.system(com) # systematic tests vtransform = [' ', ' --r2c ', ' --dct 1', ' --dct 2', ' --dct 3', ' --dct 4'] # vtransform = [' ', ' --r2c '] vdim = [1, 2, 3] vnorm = [' --norm 1', ' --norm 0'] vlut = ['', ' --lut'] vprec = ['', ' --double'] vradix = [' --radix', ' --bluestein'] vinplace = ['', ' --inplace'] for radix in vradix: for norm in vnorm: for lut in vlut: for inplace in vinplace: for prec in vprec: if ' --lut' in lut and 'double' in prec: continue for transform in vtransform: if 'dct' in transform and '0' in norm: continue for ndim in vdim: n1 = 3 if 'dct 4' in transform else 2 if ndim == 1: if 'dct 1' in transform: n2 = 767 if 'double' in prec else 1535 elif 'dct' in transform: if 'double' in prec: n2 = 1536 if 'bluestein' in radix else 3071 else: n2 = 3071 if 'bluestein' in radix else 4096 else: n2 = 100000 if 'radix' in radix else 10000 elif ndim == 2: if 'dct 1' in transform: n2 = 512 if 'double' in prec else 1024 elif 'dct' in transform: n2 = 1024 if 'bluestein' in radix and 'double' in prec else 2047 else: n2 = 4500 else: # ndim==3 if 'dct' in transform: n2 = 500 else: n2 = 550 mem = n2 ** ndim * 8 if 'double' in prec: mem *= 2 if 'inplace' not in inplace: mem *= 2 if 'dct' in transform or 'r2c' in transform: mem /= 2 nproc1 = gpu_gb // (mem / 1024 ** 3 * 1.5) nproc = max(1, min(nproc1, nproc0)) com = 'pyvkfft-test --systematic --backend %s --gpu %s --graph --html' % (backend, gpu) com += ' --max-nb-tests 0' com += ' --nproc %2d --ndim %d --range %d %6d' % (nproc, ndim, n1, n2) com += transform + radix + inplace + prec + lut + norm + ' --range-mb 0 4100' if dry_run: print(com) else: os.system(com) # Last, run a few 2D and 3D tests where the lengths can differ, # and radix and Bluestein transforms are mixed. for transform in ['', ' --r2c']: norm = ' --norm 1' for radix, lut, inplace, prec, ndim, n1, n2, rn in \ [(' --radix', '', '', '', 2, 2, 4500, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), (' --radix', ' --lut', '', '', 2, 2, 4500, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), (' --radix', '', ' --inplace', '', 2, 2, 4500, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), (' --radix', '', '', ' --double', 2, 2, 4500, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), (' --radix', '', '', '', 3, 2, 150, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), (' --radix', ' --lut', '', '', 3, 2, 150, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), (' --radix', '', ' --inplace', '', 3, 2, 150, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), (' --radix', '', '', ' --double', 3, 2, 150, ' --range-nd-narrow 0.02 4 --radix-max-pow 3'), ('', '', '', '', 2, 1008, 1040, ' --range-nd-narrow 0.02 4'), ('', '', '', '', 2, 2032, 2064, ' --range-nd-narrow 0.02 4'), ('', '', '', '', 2, 4080, 4112, ' --range-nd-narrow 0.02 4'), ('', '', '', '', 3, 120, 140, ' --range-nd-narrow 0.02 4'), ]: mem = n2 ** ndim * 8 if 'double' in prec: mem *= 2 if 'inplace' not in inplace: mem *= 2 if 'dct' in transform or 'r2c' in transform: mem /= 2 nproc1 = gpu_gb // (mem / 1024 ** 3 * 1.5) nproc = max(1, min(nproc1, nproc0)) com = 'pyvkfft-test --systematic --backend %s --gpu %s --graph --html' % (backend, gpu) com += ' --max-nb-tests 0' com += ' --nproc %2d --ndim %d --range %d %6d' % (nproc, ndim, n1, n2) com += transform + radix + inplace + prec + lut + norm + rn + ' --range-mb 0 4100' if dry_run: print(com) os.system(com + ' --dry-run') else: os.system(com) if __name__ == '__main__': main() pyvkfft-2022.1.1/pyvkfft/test/0000755000076500000240000000000014202465377016703 5ustar vincentstaff00000000000000pyvkfft-2022.1.1/pyvkfft/test/__init__.py0000644000076500000240000000036314202465263021010 0ustar vincentstaff00000000000000import unittest from .test_fft import suite as test_fft_suite, TestFFT, TestFFTSystematic, has_pycuda, has_cupy, has_pyopencl def suite(): test_suite = unittest.TestSuite() test_suite.addTest(test_fft_suite()) return test_suite pyvkfft-2022.1.1/pyvkfft/test/test_fft.py0000644000076500000240000010376214202465263021076 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- # PyVkFFT # (c) 2021- : ESRF-European Synchrotron Radiation Facility # authors: # Vincent Favre-Nicolin, favre@esrf.fr # # # pyvkfft unit tests. import sys import unittest import multiprocessing import sqlite3 import socket import time import timeit import numpy as np try: from scipy.misc import ascent except ImportError: def ascent(): return np.random.randint(0, 255, (512, 512)) from pyvkfft.version import __version__, vkfft_version from pyvkfft.base import primes, radix_gen, radix_gen_n from pyvkfft.fft import fftn as vkfftn, ifftn as vkifftn, rfftn as vkrfftn, \ irfftn as vkirfftn, dctn as vkdctn, idctn as vkidctn from pyvkfft.accuracy import test_accuracy, test_accuracy_kwargs, fftn, init_ctx, gpu_ctx_dic, has_dct_ref, has_scipy try: import pycuda.gpuarray as cua import pycuda.driver as cu_drv from pyvkfft.cuda import VkFFTApp as cuVkFFTApp has_pycuda = True except ImportError: has_pycuda = False try: import cupy as cp has_cupy = True except ImportError: has_cupy = False try: import pyopencl as cl import pyopencl.array as cla has_pyopencl = True except ImportError: has_pyopencl = False def latex_float(f): float_str = "{0:.2g}".format(f) if "e" in float_str: base, exponent = float_str.split("e") return r"{0} \times 10^{{{1}}}".format(base, int(exponent)) else: return float_str class TestFFT(unittest.TestCase): gpu = None nproc = 1 verbose = False colour = False def test_backend(self): self.assertTrue(has_pycuda or has_pyopencl or has_cupy, "Either pycuda, pyopencl or cupy must be available") @unittest.skipIf(not (has_pycuda or has_cupy or has_pyopencl), "No OpenCL/CUDA backend is available") def test_simple_fft(self): """Test the simple fft API""" vbackend = [] if has_pycuda: vbackend.append("pycuda") if has_cupy: vbackend.append("cupy") if has_pyopencl: vbackend.append("pyopencl") for backend in vbackend: with self.subTest(backend=backend): init_ctx(backend, gpu_name=self.gpu, verbose=False) if backend == "pycuda": dc = cua.to_gpu(ascent().astype(np.complex64)) dr = cua.to_gpu(ascent().astype(np.float32)) elif backend == "cupy": dc = cp.array(ascent().astype(np.complex64)) dr = cp.array(ascent().astype(np.float32)) else: cq = gpu_ctx_dic["pyopencl"][2] dc = cla.to_device(cq, ascent().astype(np.complex64)) dr = cla.to_device(cq, ascent().astype(np.float32)) # C2C, new destination array d = vkfftn(dc) d = vkifftn(d) # C2C in-place d = vkfftn(d, d) d = vkifftn(d, d) # C2C out-of-place d2 = d.copy() d2 = vkfftn(d, d2) d = vkifftn(d2, d) # R2C, new destination array d = vkrfftn(dr) self.assertTrue(d.dtype == np.complex64) d = vkirfftn(d) self.assertTrue(d.dtype == np.float32) # R2C, inplace d = vkrfftn(dr, dr) self.assertTrue(d.dtype == np.complex64) d = vkirfftn(d, d) self.assertTrue(d.dtype == np.float32) # DCT, new destination array d = vkdctn(dr) d = vkidctn(d) # DCT, out-of-place d2 = dr.copy() d2 = vkdctn(dr, d2) dr = vkidctn(d2, dr) # DCT, inplace d = vkdctn(dr, dr) d = vkidctn(d, d) def run_fft(self, vbackend, vn, dims_max=4, ndim_max=3, shuffle_axes=True, vtype=(np.complex64, np.complex128), vlut="auto", vinplace=(True, False), vnorm=(0, 1), vr2c=(False,), vdct=(False,), verbose=False, dry_run=False): """ Run a series of tests :param vbackend: list of backends to test among "pycuda", "cupy and "pyopencl" :param vn: list of transform sizes to test :param dims_max: max number of dimensions for the array (up to 4) :param ndim_max: max transform dimension :param shuffle_axes: if True, all possible axes combinations will be tried for the given shape of the array and the number of transform dimensions, e.g. for a 3D array and ndim=2 this would try (-1, -2), (-1, -3) and (-2,-3). This applies only to C2C transforms. :param vtype: list of array types among float32, float64, complex64, complex128 :param vlut: if "auto" (the default), will test useLUT=None and True, except for double precision where LUT is always enabled. Can be a list of values among None (uses VkFFT default), 0/False and 1/True. :param vinplace: a list among True and False :param vnorm: a list among 0, 1, and (for C2C only) "ortho" :param vr2c: a list among True, False to perform r2c tests :param vdct: a list among False/0, 1, 2, 3, 4 to test various DCT :param verbose: True or False - prints two lines per test (FFT and iFFT result) :param dry_run: if True, only count the number of test to run :return: the number of tests performed, and the list of kwargs (dry run) """ ct = 0 vkwargs = [] for backend in vbackend: init_ctx(backend, gpu_name=self.gpu, verbose=False) cq = gpu_ctx_dic["pyopencl"][2] if backend == "pyopencl" else None for n in vn: for dims in range(1, dims_max + 1): for ndim0 in range(1, min(dims, ndim_max) + 1): for r2c in vr2c: for dct in vdct: # Setup use of either ndim or axes, also test skipping dimensions ndim_axes = [(ndim0, None)] if shuffle_axes and not (r2c or dct): for i in range(1, 2 ** (ndim0 - 1)): axes = [] for ii in range(ndim0): if not (i & 2 ** ii): axes.append(-ii - 1) ndim_axes.append((None, axes)) for ndim, axes in ndim_axes: for dtype in vtype: if axes is None: axes_numpy = list(range(dims))[-ndim:] else: axes_numpy = axes # Array shape sh = [n] * dims # Use only a size of 2 for non-transform axes for ii in range(len(sh)): if ii not in axes_numpy and (-len(sh) + ii) not in axes_numpy: sh[ii] = 2 if not dry_run: if dtype in (np.float32, np.float64): d0 = np.random.uniform(-0.5, 0.5, sh).astype(dtype) else: d0 = (np.random.uniform(-0.5, 0.5, sh) + 1j * np.random.uniform(-0.5, 0.5, sh)).astype(dtype) if vlut == "auto": if dtype in (np.float64, np.complex128): # By default LUT is enabled for complex128, no need to test twice tmp = [None] else: tmp = [None, True] else: tmp = vlut for use_lut in tmp: for inplace in vinplace: for norm in vnorm: with self.subTest(backend=backend, n=n, dims=dims, ndim=ndim, axes=axes, dtype=np.dtype(dtype), norm=norm, use_lut=use_lut, inplace=inplace, r2c=r2c, dct=dct): ct += 1 if not dry_run: res = test_accuracy(backend, sh, ndim, axes, dtype, inplace, norm, use_lut, r2c=r2c, dct=dct, gpu_name=self.gpu, stream=None, queue=cq, return_array=False, init_array=d0, verbose=verbose) npr = primes(n) ni, n2 = res["ni"], res["n2"] nii, n2i = res["nii"], res["n2i"] tol = res["tol"] src1 = res["src_unchanged_fft"] src2 = res["src_unchanged_ifft"] self.assertTrue(ni < tol, "Accuracy mismatch after FFT, " "n2=%8e ni=%8e>%8e" % (n2, ni, tol)) self.assertTrue(nii < tol, "Accuracy mismatch after iFFT, " "n2=%8e ni=%8e>%8e" % (n2, nii, tol)) if not inplace: self.assertTrue(src1, "The source array was modified " "during the FFT") nmaxr2c1d = 3072 * (1 + int( dtype in (np.float32, np.complex64))) if not r2c or (ndim == 1 and max(npr) <= 13) \ and n < nmaxr2c1d: self.assertTrue(src2, "The source array was modified " "during the iFFT") else: kwargs = {"backend": backend, "shape": sh, "ndim": ndim, "axes": axes, "dtype": dtype, "inplace": inplace, "norm": norm, "use_lut": use_lut, "r2c": r2c, "dct": dct, "gpu_name": self.gpu, "stream": None, "verbose": False, "colour_output": self.colour} vkwargs.append(kwargs) return ct, vkwargs def run_fft_parallel(self, vkwargs): # Need to use spawn to handle the GPU context with multiprocessing.get_context('spawn').Pool(self.nproc) as pool: for res in pool.imap(test_accuracy_kwargs, vkwargs): with self.subTest(backend=res['backend'], n=max(res['shape']), ndim=res['ndim'], dtype=np.dtype(res['dtype']), norm=res['norm'], use_lut=res['use_lut'], inplace=res['inplace'], r2c=res['r2c'], dct=res['dct']): n = max(res['shape']) npr = primes(n) ni, n2 = res["ni"], res["n2"] nii, n2i = res["nii"], res["n2i"] tol = res["tol"] src1 = res["src_unchanged_fft"] src2 = res["src_unchanged_ifft"] if self.verbose: print(res['str']) self.assertTrue(ni < tol, "Accuracy mismatch after FFT, n2=%8e ni=%8e>%8e" % (n2, ni, tol)) self.assertTrue(nii < tol, "Accuracy mismatch after iFFT, n2=%8e ni=%8e>%8e" % (n2, nii, tol)) if not res['inplace']: self.assertTrue(src1, "The source array was modified during the FFT") nmaxr2c1d = 3072 * (1 + int(res['dtype'] in (np.float32, np.complex64))) if not res['r2c'] or (res['ndim'] == 1 and max(npr) <= 13) and n < nmaxr2c1d: # Only 1D radix C2R do not alter the source array, # if n<= 3072 or 6144 (assuming 48kb shared memory) self.assertTrue(src2, "The source array was modified during the iFFT") @unittest.skipIf(not (has_pycuda or has_cupy or has_pyopencl), "No OpenCL/CUDA backend is available") def test_c2c(self): """Run C2C tests""" vbackend = [] if has_pycuda: vbackend.append("pycuda") if has_cupy: vbackend.append("cupy") if has_pyopencl: vbackend.append("pyopencl") for backend in vbackend: init_ctx(backend, gpu_name=self.gpu, verbose=False) has_cl_fp64 = gpu_ctx_dic["pyopencl"][3] if backend == "pyopencl" else True ct = 0 vkwargs = [] for dry_run in [True, False]: vtype = (np.complex64, np.complex128) if backend == "pyopencl" and not has_cl_fp64: vtype = (np.complex64,) v = self.verbose and not dry_run if dry_run or self.nproc == 1: tmp = self.run_fft([backend], [30, 34], vtype=vtype, verbose=v, dry_run=dry_run, shuffle_axes=False) ct += tmp[0] vkwargs += tmp[1] tmp = self.run_fft([backend], [808], vtype=vtype, dims_max=2, verbose=v, dry_run=dry_run, shuffle_axes=False) ct += tmp[0] vkwargs += tmp[1] else: self.run_fft_parallel(vkwargs) if dry_run and self.verbose: print("Running %d C2C tests (backend: %s)" % (ct, backend)) @unittest.skipIf(not (has_pycuda or has_cupy or has_pyopencl), "No OpenCL/CUDA backend is available") def test_r2c(self): """Run R2C tests""" vbackend = [] if has_pycuda: vbackend.append("pycuda") if has_cupy: vbackend.append("cupy") if has_pyopencl: vbackend.append("pyopencl") for backend in vbackend: init_ctx(backend, gpu_name=self.gpu, verbose=False) has_cl_fp64 = gpu_ctx_dic["pyopencl"][3] if backend == "pyopencl" else True ct = 0 vkwargs = [] for dry_run in [True, False]: vtype = (np.float32, np.float64) if backend == "pyopencl" and not has_cl_fp64: vtype = (np.float32,) v = self.verbose and not dry_run if dry_run or self.nproc == 1: tmp = self.run_fft([backend], [30, 34], vtype=vtype, vr2c=(True,), verbose=v, dry_run=dry_run) ct += tmp[0] vkwargs += tmp[1] tmp = self.run_fft([backend], [808], vtype=vtype, dims_max=2, vr2c=(True,), verbose=v, dry_run=dry_run) ct += tmp[0] vkwargs += tmp[1] else: self.run_fft_parallel(vkwargs) if dry_run and self.verbose: print("Running %d R2C tests (backend: %s)" % (ct, backend)) @unittest.skipIf(not (has_pycuda or has_cupy or has_pyopencl), "No OpenCL/CUDA backend is available") @unittest.skipIf(not has_dct_ref, "scipy and pyfftw are not available - cannot test DCT") def test_dct(self): """Run DCT tests""" vbackend = [] if has_pycuda: vbackend.append("pycuda") if has_cupy: vbackend.append("cupy") if has_pyopencl: vbackend.append("pyopencl") for backend in vbackend: init_ctx(backend, gpu_name=self.gpu, verbose=False) has_cl_fp64 = gpu_ctx_dic["pyopencl"][3] if backend == "pyopencl" else True ct = 0 vkwargs = [] for dry_run in [True, False]: vtype = (np.float32, np.float64) if backend == "pyopencl" and not has_cl_fp64: vtype = (np.float32,) v = self.verbose and not dry_run if dry_run or self.nproc == 1: tmp = self.run_fft([backend], [30, 34], vtype=vtype, vnorm=[1], vdct=range(1, 5), verbose=v, dry_run=dry_run) ct += tmp[0] vkwargs += tmp[1] else: self.run_fft_parallel(vkwargs) if dry_run and self.verbose: print("Running %d DCT tests (backend: %s)" % (ct, backend)) @unittest.skipIf(not has_pycuda, "pycuda is not available") def test_pycuda_streams(self): """ Test multiple FFT in // with different cuda streams. """ for dtype in (np.complex64, np.complex128): with self.subTest(dtype=np.dtype(dtype)): init_ctx("pycuda", gpu_name=self.gpu, verbose=False) if dtype == np.complex64: rtol = 1e-6 else: rtol = 1e-12 sh = (256, 256) d = (np.random.uniform(-0.5, 0.5, sh) + 1j * np.random.uniform(-0.5, 0.5, sh)).astype(dtype) n_streams = 5 vd = [] vapp = [] for i in range(n_streams): vd.append(cua.to_gpu(np.roll(d, i * 7, axis=1))) vapp.append(cuVkFFTApp(d.shape, d.dtype, ndim=2, norm=1, stream=cu_drv.Stream())) for i in range(n_streams): vapp[i].fft(vd[i]) for i in range(n_streams): dn = fftn(np.roll(d, i * 7, axis=1)) self.assertTrue(np.allclose(dn, vd[i].get(), rtol=rtol, atol=abs(dn).max() * rtol)) for i in range(n_streams): vapp[i].ifft(vd[i]) for i in range(n_streams): dn = np.roll(d, i * 7, axis=1) self.assertTrue(np.allclose(dn, vd[i].get(), rtol=rtol, atol=abs(dn).max() * rtol)) # The class parameters are written in pyvkfft_test.main() class TestFFTSystematic(unittest.TestCase): axes = None bluestein = False colour = False dct = False db = None dry_run = False dtype = np.float32 graph = None gpu = None inplace = False lut = False max_pow = None max_nb_tests = 1000 nb_test = 0 # Number of tests actually run nb_shapes_gen = None ndim = 1 # t.ndims = args.ndims norm = 1 nproc = 1 r2c = False radix = None range = 2, 128 range_nd_narrow = 0, 0 range_size = 0, 128 * 1024 ** 2 // 8 ref_long_double = False serial = False timeout = 30 vbackend = None verbose = True vshape = [] def setUp(self) -> None: if self.vbackend is None: self.vbackend = [] if has_pycuda: self.vbackend.append("pycuda") if has_cupy: self.vbackend.append("cupy") if has_pyopencl: self.vbackend.append("pyopencl") init_ctx("pyopencl", gpu_name=self.gpu, verbose=False) self.cq, self.has_cl_fp64 = gpu_ctx_dic["pyopencl"][2:] self.assertTrue(not self.bluestein or self.radix is None, "Cannot select both Bluestein and radix") if not self.bluestein and self.radix is None: self.vshape = radix_gen_n(nmax=self.range[1], max_size=self.range_size[1], radix=None, ndim=self.ndim, even=self.r2c, nmin=self.range[0], max_pow=self.max_pow, range_nd_narrow=self.range_nd_narrow, min_size=self.range_size[0]) elif self.bluestein: self.vshape = radix_gen_n(nmax=self.range[1], max_size=self.range_size[1], radix=(2, 3, 5, 7, 11, 13), ndim=self.ndim, even=self.r2c, inverted=True, nmin=self.range[0], max_pow=self.max_pow, range_nd_narrow=self.range_nd_narrow, min_size=self.range_size[0]) else: if len(self.radix) == 0: self.radix = [2, 3, 5, 7, 11, 13] if self.r2c and 2 not in self.radix: # and inplace ? raise RuntimeError("For r2c, the x/fastest axis must be even (requires radix-2)") self.vshape = radix_gen_n(nmax=self.range[1], max_size=self.range_size[1], radix=self.radix, ndim=self.ndim, even=self.r2c, nmin=self.range[0], max_pow=self.max_pow, range_nd_narrow=self.range_nd_narrow, min_size=self.range_size[0]) if not self.dry_run: self.assertTrue(len(self.vshape), "The list of sizes to test is empty !") if self.max_nb_tests: self.assertTrue(len(self.vshape) <= self.max_nb_tests, "Too many array shapes have been generated: " "%d > %d [parameter hint: max-nb-tests]" % (len(self.vshape), self.max_nb_tests)) def test_systematic(self): if self.dry_run: # The array shapes to test have been generated if self.verbose: print("Dry run: %d array shapes generated" % len(self.vshape)) # OK, this lacks elegance, but works to get back the value in the scripts self.__class__.nb_shapes_gen = len(self.vshape) return # Generate the list of configurations as kwargs for test_accuracy() vkwargs = [] for backend in self.vbackend: for s in self.vshape: kwargs = {"backend": backend, "shape": s, "ndim": len(s), "axes": self.axes, "dtype": self.dtype, "inplace": self.inplace, "norm": self.norm, "use_lut": self.lut, "r2c": self.r2c, "dct": self.dct, "gpu_name": self.gpu, "stream": None, "verbose": False, "colour_output": self.colour, "ref_long_double": self.ref_long_double} vkwargs.append(kwargs) if self.db is not None: # TODO secure the db with a context 'with' db = sqlite3.connect(self.db) dbc = db.cursor() dbc.execute('CREATE TABLE IF NOT EXISTS pyvkfft_test (epoch int, hostname int,' 'backend text, language text, transform text, axes text, array_shape text,' 'ndims int, ndim int, precision int, inplace int, norm int, lut int,' 'n int, n2_fft float, n2_ifft float, ni_fft float, ni_ifft float, tolerance float,' 'dt_app float, dt_fft float, dt_ifft float, src_unchanged_fft int, src_unchanged_ifft int,' 'gpu_name text, success int, error int, vkfft_error_code int)') db.commit() hostname = socket.gethostname() lang = 'opencl' if 'opencl' in backend else 'cuda' if self.r2c: transform = "R2C" elif self.dct: transform = "DCT%d" % self.dct else: transform = "C2C" # For graph output vn, vni, vn2, vnii, vn2i, vblue, vshape = [], [], [], [], [], [], [] gpu_name = "GPU" if self.verbose: print("Starting %d tests..." % (len(vkwargs))) t0 = timeit.default_timer() # Handle timeouts if for some weird reason a process hangs indefinitely nb_timeout = 0 i_start = 0 while True: timeout = False # Need to use spawn to handle the GPU context with multiprocessing.get_context('spawn').Pool(self.nproc) as pool: if not self.serial: results = pool.imap(test_accuracy_kwargs, vkwargs[i_start:], chunksize=1) for i in range(i_start, len(vkwargs)): v = vkwargs[i] sh = v['shape'] ndim = len(sh) # We use np.dtype(dtype) instead of dtype because it is written out simply # as e.g. "float32" instead of "" with self.subTest(backend=backend, shape=sh, ndim=ndim, dtype=np.dtype(self.dtype), norm=self.norm, use_lut=self.lut, inplace=self.inplace, r2c=self.r2c, dct=self.dct): if self.serial: res = test_accuracy_kwargs(v) else: try: res = results.next(timeout=self.timeout) except multiprocessing.TimeoutError as ex: # NB: the timeout won't change the next() result, so will need # to terminate & restart the pool timeout = True raise ex n = max(res['shape']) npr = primes(n) ni, n2 = res["ni"], res["n2"] nii, n2i = res["nii"], res["n2i"] tol = res["tol"] src1 = res["src_unchanged_fft"] src2 = res["src_unchanged_ifft"] succ = max(ni, nii) < tol vn.append(n) vblue.append(max(npr) > 13) vni.append(ni) vn2.append(n2) vn2i.append(n2i) vnii.append(nii) vshape.append(sh) if len(vn) == 1: gpu_name = res["gpu_name"] if not self.inplace: if not src1: succ = False elif not self.r2c and not src2: succ = False if self.db is not None: dbc.execute('INSERT INTO pyvkfft_test VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,' '?,?,?,?,?,?,?,?,?,?,?,?,?)', (time.time(), hostname, backend, lang, transform, str(res['axes']).encode('ascii'), str(res['shape']).encode('ascii'), len(res['shape']), ndim, np.dtype(self.dtype).itemsize, self.inplace, self.norm, self.lut, int(max(res['shape'])), float(n2), float(n2i), float(ni), float(nii), float(tol), res["dt_app"], res["dt_fft"], res["dt_ifft"], int(src1), int(src2), res["gpu_name"].encode('ascii'), int(succ), 0, 0)) db.commit() if self.verbose: print(res['str']) self.assertTrue(ni < tol, "Accuracy mismatch after FFT, n2=%8e ni=%8e>%8e" % (n2, ni, tol)) self.assertTrue(nii < tol, "Accuracy mismatch after iFFT, n2=%8e ni=%8e>%8e" % (n2, nii, tol)) if not self.inplace: self.assertTrue(src1, "The source array was modified during the FFT") nmaxr2c1d = 3072 * (1 + int(self.dtype in (np.float32, np.complex64))) if not self.r2c or (ndim == 1 and max(npr) <= 13) and n < nmaxr2c1d: # Only 1D radix C2R do not alter the source array, if n<=? self.assertTrue(src2, "The source array was modified during the iFFT %d %d" % (n, nmaxr2c1d)) if timeout: # One process is stuck, must kill the pool and start again if self.verbose: print("Timeout for N=%d. Re-starting the pool..." % max(v['shape'])) i_start = i + 1 pool.terminate() nb_timeout += 1 break if not timeout or i_start >= len(vkwargs) or nb_timeout >= 4: break self.__class__.nb_test = len(self.vbackend) * len(vkwargs) if self.verbose: print("Finished %d tests in %s" % (len(vkwargs), time.strftime("%Hh %Mm %Ss", time.gmtime(timeit.default_timer() - t0)))) if self.graph is not None and len(vn): if self.r2c: t = "R2C" elif self.dct: t = "DCT%d" % self.dct else: t = "C2C" tmp = "" if self.lut: tmp += "_lut" if self.inplace: tmp += "_inplace" r = "" if self.radix is not None: r = "_radix" for k in self.radix: r += "-%d" % k elif self.bluestein: r = "_bluestein" tit = "%s %s pyvkfft %s VkFFT %s" % (gpu_name, self.vbackend[0], __version__, vkfft_version()) if self.ndim == 12: sndim = "1D2D" elif self.ndim == 123: sndim = "1D2D3D" else: sndim = "%dD" % self.ndim suptit = " %s %s%s N=%d-%d norm=%d %s%s" % \ (t, sndim, r, self.range[0], self.range[1], self.norm, str(np.dtype(np.float32)), tmp) if self.ref_long_double and has_scipy: suptit += " [long double ref]" suptit += " [%d tests]" % self.nb_test import matplotlib.pyplot as plt from scipy import stats plt.figure(figsize=(8, 5)) x = np.array([np.prod(s) for s in vshape], dtype=np.float32) xl = np.log10(x) ms = 4 plt.semilogx(x, vni, 'ob', label=r"$[FFT]L_{\infty}$", alpha=0.2, ms=ms) plt.semilogx(x, vnii, 'og', label=r"$[IFFT]L_{\infty}$", alpha=0.2, ms=ms) r2 = stats.linregress(xl, np.array(vn2, dtype=np.float32)) plt.semilogx(x, vn2, "^b", ms=ms, label=r"$[FFT]L2\approx %s+%s\log(size)$" % (latex_float(r2[1]), latex_float(r2[0]))) r2i = stats.linregress(xl, np.array(vn2i, dtype=np.float32)) plt.semilogx(x, vn2, "vg", ms=ms, label=r"$[IFFT]L2\approx %s+%s\log(size)$" % (latex_float(r2i[1]), latex_float(r2i[0]))) plt.semilogx(x, r2[1] + r2[0] * xl, "b-") plt.semilogx(x, r2i[1] + r2i[0] * xl, "g-") plt.title(tit.replace('_', ' '), fontsize=10) plt.suptitle(suptit, fontsize=12) plt.grid(True) plt.legend(loc='upper left') plt.xlabel("size", loc='right') plt.tight_layout() graph = self.graph if not len(graph): graph = "%s_%s_%s_%s%s_%d-%d_norm%d_%s%s.svg" % \ (gpu_name.replace(' ', ''), self.vbackend[0], t, sndim, r, self.range[0], self.range[1], self.norm, str(np.dtype(np.float32)), tmp) plt.savefig(graph) if self.verbose: print("Saved accuracy graph to: %s" % graph) if nb_timeout >= 4: raise RuntimeError("4 multiprocessing timeouts while testing... giving up") def suite(): test_suite = unittest.TestSuite() load_tests = unittest.defaultTestLoader.loadTestsFromTestCase test_suite.addTest(load_tests(TestFFT)) return test_suite if __name__ == '__main__': unittest.main(defaultTest='suite', verbosity=2) pyvkfft-2022.1.1/pyvkfft/version.py0000644000076500000240000000162614202465263017762 0ustar vincentstaff00000000000000# -*- coding: utf-8 -*- __authors__ = ["Vincent Favre-Nicolin (pyvkfft), Dmitrii Tolmachev (VkFFT)"] __license__ = "MIT" __date__ = "2022/02/14" # Valid numbering includes 3.1, 3.1.0, 3.1.2, 3.1dev0, 3.1a0, 3.1b0 __version__ = "2022.1.1" def vkfft_version(): """ Get VkFFT version :return: version as X.Y.Z """ # We import here as otherwise it would mess with setup.py which reads __version__ # while the opencl library has not yet been compiled. try: from .opencl import vkfft_version as cl_vkfft_version return cl_vkfft_version() except ImportError: # On some platforms (e.g. pp64le) opencl may not be available while cuda is try: from .cuda import vkfft_version as cu_vkfft_version return cu_vkfft_version() except ImportError: raise ImportError("Neither cuda or opencl vkfft_version could be imported") pyvkfft-2022.1.1/pyvkfft.egg-info/0000755000076500000240000000000014202465377017416 5ustar vincentstaff00000000000000pyvkfft-2022.1.1/pyvkfft.egg-info/PKG-INFO0000644000076500000240000002677614202465377020535 0ustar vincentstaff00000000000000Metadata-Version: 1.2 Name: pyvkfft Version: 2022.1.1 Summary: Python wrapper for the CUDA and OpenCL backends of VkFFT,providing GPU FFT for PyCUDA, PyOpenCL and CuPy Home-page: https://github.com/vincefn/pyvkfft Author: Vincent Favre-Nicolin Author-email: favre@esrf.fr License: UNKNOWN Project-URL: Bug Tracker, https://github.com/vincefn/pyvkfft/issues Project-URL: VkFFT project, https://github.com/DTolm/VkFFT Description: pyvkfft - python interface to the CUDA and OpenCL backends of VkFFT (Vulkan Fast Fourier Transform library) =========================================================================================================== `VkFFT `_ is a GPU-accelerated Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL. pyvkfft offers a simple python interface to the **CUDA** and **OpenCL** backends of VkFFT, compatible with **pyCUDA**, **CuPy** and **pyOpenCL**. Installation ------------ Install using ``pip install pyvkfft`` (works on macOS, Linux and Windows). Notes: - the PyPI package includes ``vkfft.h`` and will automatically install ``pyopencl`` if opencl is available. However you should manually install either ``cupy`` or ``pycuda`` to use the cuda backend. - if you want to specify the backend to be installed (which can be necessary e.g. if you have ``nvcc`` installed but cuda is not actually available), you can do that using e.g. ``VKFFT_BACKEND=opencl pip install pyvkfft``. By default the opencl backend is always installed, and the cuda one if nvcc is found. Requirements: - ``pyopencl`` and the opencl libraries/development tools for the opencl backend - ``pycuda`` or ``cupy`` and CUDA developments tools (`nvcc`) for the cuda backend - ``numpy`` - on Windows, this requires visual studio (c++ tools) and a cuda toolkit installation, with either CUDA_PATH or CUDA_HOME environment variable. - *Only when installing from source*: ``vkfft.h`` installed in the usual include directories, or in the 'src' directory This package can be installed from source using ``pip install .``. *Note:* ``python setup.py install`` is now disabled, to avoid messed up environments where both methods have been used. Examples -------- The simplest way to use pyvkfft is to use the ``pyvkfft.fft`` interface, which will automatically create the VkFFTApp (the FFT plans) according to the type of GPU arrays (pycuda, pyopencl or cupy), and also cache these apps: .. code-block:: python import pycuda.autoinit import pycuda.gpuarray as cua from pyvkfft.fft import fftn import numpy as np d0 = cua.to_gpu(np.random.uniform(0,1,(200,200)).astype(np.complex64)) # This will compute the fft to a new GPU array d1 = fftn(d0) # An in-place transform can also be done by specifying the destination d0 = fftn(d0, d0) # Or an out-of-place transform to an existing array (the destination array is always returned) d1 = fftn(d0, d1) See the scripts and notebooks in the examples directory. An example notebook is also `available on google colab `_. Make sure to select a GPU for the runtime. Features -------- - CUDA (using PyCUDA or CuPy) and OpenCL (using PyOpenCL) backends - C2C, R2C/C2R for inplace and out-of-place transforms - Direct Cosine Transform (DCT) of type 1, 2, 3 and 4 (EXPERIMENTAL, comparison with scipy DCT transforms are OK, but there are limitations on the array dimensions) - single and double precision for all transforms (double precision requires device support) - 1D, 2D and 3D transforms. - array can be have more dimensions than the FFT (batch transforms). - arbitrary array size, using Bluestein algorithm for prime numbers>13 (note that in this case the performance can be significantly lower, up to ~4x, depending on the transform size, see example performance plot below) - transform along a given list of axes - this requires that after collapsing non-transformed axes, the last transformed axis is at most along the 3rd dimension, e.g. the following axes are allowed: (-2,-3), (-1,-3), (-1,-4), (-4,-5),... but not (-2, -4), (-1, -3, -4) or (-2, -3, -4). This is not allowed for R2C transforms. - normalisation=0 (array L2 norm * array size on each transform) and 1 (the backward transform divides the L2 norm by the array size, so FFT*iFFT restores the original array) - unit tests for all transforms: see test sub-directory. Note that these take a **long** time to finish due to the exhaustive number of sub-tests. - Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2 - tested on macOS (10.13.6), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2) - GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs. - inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case a buffer of a size equal to the array is necessary. This makes larger FFT transforms possible based on memory requirements (even for R2C !) compared to cuFFT. For example you can compute the 3D FFT for a 1600**3 complex64 array with 32GB of memory. - transforms can either be done by creating a VkFFTApp (a.k.a. the fft 'plan'), with the selected backend (``pyvkfft.cuda`` for pycuda/cupy or ``pyvkfft.opencl`` for pyopencl) or by using the ``pyvkfft.fft`` interface with the ``fftn``, ``ifftn``, ``rfftn`` and ``irfftn`` functions which automatically detect the type of GPU array and cache the corresponding VkFFTApp (see the example notebook pyvkfft-fft.ipynb). - the ``pyvkfft-test`` command-line script allows to test specifc transforms against expected accuracy values, for all types of transforms. - pyvkfft results are now evaluated before any release with a comprehensive test suite, comparing transform results for all types of transforms: single and double precision, 1D, 2D and 3D, inplace and out-of-place, different norms, radix and Bluestein, etc... See ``pyvkfft/pyvkfft_test_suite.py`` to run the full suite, which takes 28 hours on a V100 GPU using up to 20 parallel process. Performance ----------- See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare with cuFFT (using scikit-cuda) and clFFT (using gpyfft). Example result for batched 2D FFT with array dimensions of batch x N x N using a Titan V: .. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-TITAN_V-Linux.png Notes regarding this plot: * the computed throughput is *theoretical*, as if each transform axis for the couple (FFT, iFFT) required exactly one read and one write. This is obviously not true, and explains the drop after N=1024 for cuFFT and (in a smaller extent) vkFFT. * the batch size is adapted for each N so the transform takes long enough, in practice the transformed array is at around 600MB. Transforms on small arrays with small batch sizes could produce smaller performances, or better ones when fully cached. * a number of blue + (CuFFT) are actually performed as radix-N transforms with 71024 vkFFT is much more efficient than cuFFT due to the smaller number of read and write per FFT axis (apart from isolated radix-2 3 sizes) * the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges where CUDA performs better, due to different cache . [Note that if the card is also used for display, then difference can increase, e.g. for nvidia cards opencl performance is more affected when being used for display than the cuda backend] * clFFT (via gpyfft) generally performs much worse than the other transforms, though this was tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations to find the fastest combination) Accuracy -------- See the accuracy notebook, which allows to compare the accuracy for different FFT libraries (pyvkfft with different options and backend, scikit-cuda (cuFFT), pyfftw), using pyfftw long-double precision as a reference. Example results for 1D transforms (radix 2,3,5 and 7) using a Titan V: .. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/accuracy-1DFFT-TITAN_V.png Analysis: * in single precision on the nVidia Titan V card, the VkFFT computed accuracy is about 3 times larger (worse) than pyfftw (also computed in single precision), e.g. 6e-7 vs 2e-7, which can be pretty negligible for most applications. However when using a lookup-table for trigonometric values instead of hardware functions (useLUT=1 in VkFFTApp), the accuracy is identical to pyfftw, and better than cuFFT. * accuracy is the same for cuda and opencl, though this can depend on the card and drivers used (e.g. it's different on a GTX 1080) You can easily test a transform using the ``pyvkfft-test`` command line script, e.g.: ``pyvkfft-test --systematic --backend pycuda --nproc 8 --range 2 4500 --radix --ndim 2`` Use ``pyvkfft-test --help`` to list available options. You can use the ``pyvkfft/pyvkfft_test_suite.py`` script to run the comprehensive test suite which is used to evaluate pyvkfft before a new release. TODO ---- - access to the other backends: - for vulkan and rocm this only makes sense combined to a pycuda/cupy/pyopencl equivalent. - out-of-place C2R transform without modifying the C array ? This would require using a R array padded with two wolumns, as for the inplace transform - half precision ? - convolution ? - zero-padding ? - access to tweaking parameters in VkFFTConfiguration ? - access to the code of the generated kernels ? Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0) Classifier: Operating System :: OS Independent Classifier: Environment :: GPU pyvkfft-2022.1.1/pyvkfft.egg-info/SOURCES.txt0000644000076500000240000000107014202465377021300 0ustar vincentstaff00000000000000LICENSE MANIFEST.in README.rst setup.py pyvkfft/__init__.py pyvkfft/accuracy.py pyvkfft/base.py pyvkfft/config.py pyvkfft/cuda.py pyvkfft/fft.py pyvkfft/opencl.py pyvkfft/version.py pyvkfft.egg-info/PKG-INFO pyvkfft.egg-info/SOURCES.txt pyvkfft.egg-info/dependency_links.txt pyvkfft.egg-info/entry_points.txt pyvkfft.egg-info/requires.txt pyvkfft.egg-info/top_level.txt pyvkfft/scripts/__init__.py pyvkfft/scripts/pyvkfft_test.py pyvkfft/scripts/pyvkfft_test_suite.py pyvkfft/test/__init__.py pyvkfft/test/test_fft.py src/vkFFT.h src/vkfft_cuda.cu src/vkfft_opencl.cpppyvkfft-2022.1.1/pyvkfft.egg-info/dependency_links.txt0000644000076500000240000000000114202465377023464 0ustar vincentstaff00000000000000 pyvkfft-2022.1.1/pyvkfft.egg-info/entry_points.txt0000644000076500000240000000010414202465377022707 0ustar vincentstaff00000000000000[console_scripts] pyvkfft-test = pyvkfft.scripts.pyvkfft_test:main pyvkfft-2022.1.1/pyvkfft.egg-info/requires.txt0000644000076500000240000000002614202465377022014 0ustar vincentstaff00000000000000numpy psutil pyopencl pyvkfft-2022.1.1/pyvkfft.egg-info/top_level.txt0000644000076500000240000000001014202465377022137 0ustar vincentstaff00000000000000pyvkfft pyvkfft-2022.1.1/setup.cfg0000644000076500000240000000004614202465377016054 0ustar vincentstaff00000000000000[egg_info] tag_build = tag_date = 0 pyvkfft-2022.1.1/setup.py0000644000076500000240000002606514202465263015750 0ustar vincentstaff00000000000000# The setup used here is derived with bits from: # - https://github.com/rmcgibbo/npcuda-example import os import sys import platform from os.path import join as pjoin import warnings from setuptools import setup, find_packages from setuptools.command.sdist import sdist from distutils.extension import Extension from distutils import unixccompiler from setuptools.command.build_ext import build_ext as build_ext_orig from pyvkfft.version import __version__ from setuptools.command.bdist_egg import bdist_egg def find_in_path(name, path): """Find a file in a search path""" # Adapted fom http://code.activestate.com/recipes/52224 for dir in path.split(os.pathsep): binpath = pjoin(dir, name) if os.path.exists(binpath): return os.path.abspath(binpath) return None def locate_cuda(): """Locate the CUDA environment on the system Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64' and values giving the absolute path to each directory. Starts by looking for the CUDAHOME or CUDA_PATH env variable. If not found, find 'nvcc' in the PATH. """ if platform.system() == "Windows": if 'CUDA_PATH' in os.environ: home = os.environ['CUDA_PATH'] nvcc = pjoin(home, 'bin', 'nvcc.exe') else: # Otherwise, search the PATH for NVCC nvcc = find_in_path('nvcc.exe', os.environ['PATH']) if nvcc is None: raise EnvironmentError('The nvcc binary could not be ' 'located in your $PATH. Either add it to your path, ' 'or set $CUDA_PATH') home = os.path.dirname(os.path.dirname(nvcc)) libdir = pjoin(home, 'lib', 'x64') extra_compile_args = ['-O3', '--ptxas-options=-v', '-Xcompiler', '/MD'] extra_link_args = ['-L%s' % libdir] else: # First check if the CUDAHOME env variable is in use if 'CUDAHOME' in os.environ: home = os.environ['CUDAHOME'] nvcc = pjoin(home, 'bin', 'nvcc') else: # Otherwise, search the PATH for NVCC nvcc = find_in_path('nvcc', os.environ['PATH']) if nvcc is None: raise EnvironmentError('The nvcc binary could not be ' 'located in your $PATH. Either add it to your path, ' 'or set $CUDAHOME or $CUDA_PATH') home = os.path.dirname(os.path.dirname(nvcc)) if os.path.exists(pjoin(home, 'lib64')): libdir = pjoin(home, 'lib64') else: libdir = pjoin(home, 'lib') extra_compile_args = ['-O3', '--ptxas-options=-v', '-std=c++11', '--compiler-options=-fPIC'] extra_link_args = ['--shared', '-L%s' % libdir] cudaconfig = {'home': home, 'nvcc': nvcc, 'include_dirs': [pjoin(home, 'include')], 'extra_compile_args': extra_compile_args, 'extra_link_args': extra_link_args} for k in ['home', 'nvcc']: if not os.path.exists(cudaconfig[k]): raise EnvironmentError('The CUDA %s path could not be ' 'located in %s' % (k, v)) return cudaconfig def locate_opencl(): """ Get the opencl configuration :return: """ include_dirs = [] library_dirs = [] extra_compile_args = ['-std=c++11', '-Wno-format-security'] extra_link_args = None if platform.system() == 'Darwin': libraries = None extra_link_args = ['-Wl,-framework,OpenCL'] elif platform.system() == "Windows": # Add include & lib dirs if possible from usual nvidia and AMD paths for path in ["CUDA_HOME", "CUDAHOME", "CUDA_PATH"]: if path in os.environ: include_dirs.append(pjoin(os.environ[path], 'include')) library_dirs.append(pjoin(os.environ[path], 'lib', 'x64')) libraries = ['OpenCL'] extra_compile_args = None else: # Linux libraries = ['OpenCL'] extra_link_args = ['--shared'] opencl_config = {'libraries': libraries, 'extra_link_args': extra_link_args, 'include_dirs': include_dirs, 'library_dirs': library_dirs, 'extra_compile_args': extra_compile_args} return opencl_config class build_ext_custom(build_ext_orig): """Custom `build_ext` command which will correctly compile and link the OpenCL and CUDA modules. The hooks are based on the name of the extension""" def build_extension(self, ext): if "cuda" in ext.name: # Use nvcc for compilation. This assumes all sources are .cu files # for this extension. default_compiler = self.compiler # Create unix compiler patched for cu self.compiler = unixccompiler.UnixCCompiler() self.compiler.src_extensions.append('.cu') tmp = CUDA['nvcc'] # .replace('\\\\','toto') self.compiler.set_executable('compiler_so', [tmp]) self.compiler.set_executable('linker_so', [tmp]) if platform.system() == "Windows": CUDA['extra_link_args'] += ['--shared', '-Xcompiler', '/MD'] # pythonXX.lib must be in the linker paths # Is using sys.prefix\libs always correct ? CUDA['extra_link_args'].append('-L%s' % pjoin(sys.prefix, 'libs')) super().build_extension(ext) # Restore default linker and compiler self.compiler = default_compiler else: super().build_extension(ext) def get_export_symbols(self, ext): """ Hook to make sure we get the correct symbols for windows""" if ("opencl" in ext.name or "cuda" in ext.name) and platform.system() == "Windows": return ext.export_symbols return super().get_export_symbols(ext) def get_ext_filename(self, ext_name): """ Hook to make sure we keep the correct name (*.so) for windows""" if ("opencl" in ext_name or "cuda" in ext_name) and platform.system() == "Windows": return ext_name + '.so' return super().get_ext_filename(ext_name) class sdist_vkfft(sdist): """ Sdist overloaded to get vkfft header, readme and license from VkFFT's git """ def run(self): # Get the latest vkFFT.h from github os.system('curl -L https://raw.githubusercontent.com/DTolm/VkFFT/master/vkFFT/vkFFT.h -o src/vkFFT.h') os.system('curl -L https://raw.githubusercontent.com/DTolm/VkFFT/master/LICENSE -o LICENSE_VkFFT') os.system('curl -L https://raw.githubusercontent.com/DTolm/VkFFT/master/README.md -o README_VkFFT.md') super(sdist_vkfft, self).run() ext_modules = [] install_requires = ['numpy', 'psutil'] exclude_packages = ['examples'] CUDA = None OPENCL = None for k, v in os.environ.items(): if "VKFFT_BACKEND" in k: # Kludge to manually select vkfft backends. useful e.g. if nvidia tools # are installed but not functional # e.g. use: # VKFFT_BACKEND=cuda,opencl python setup.py install # VKFFT_BACKEND=opencl pip install pyvkfft if 'opencl' not in v.lower(): exclude_packages.append('opencl') if 'cuda' not in v.lower(): exclude_packages.append('cuda') if 'cuda' not in exclude_packages: try: CUDA = locate_cuda() vkfft_cuda_ext = Extension('pyvkfft._vkfft_cuda', sources=['src/vkfft_cuda.cu'], libraries=['nvrtc', 'cuda'], extra_compile_args=CUDA['extra_compile_args'], include_dirs=CUDA['include_dirs'] + ['src'], extra_link_args=CUDA['extra_link_args'], depends=['vkFFT.h'] ) ext_modules.append(vkfft_cuda_ext) # install_requires.append("pycuda") try: import pycuda has_pycuda = True except ImportError: has_pycuda = False try: import cupy except ImportError: if has_pycuda is False: print("Reminder: you need to install either PyCUDA or CuPy to use pyvkfft.cuda") except: exclude_packages.append('cuda') warnings.warn("CUDA not available ($CUDAHOME/$CUDA_PATH variables missing " "and nvcc not in path. " "Skipping pyvkfft.cuda module installation.", UserWarning) if 'opencl' not in exclude_packages: OPENCL = locate_opencl() install_requires.append('pyopencl') # OpenCL extension vkfft_opencl_ext = Extension('pyvkfft._vkfft_opencl', sources=['src/vkfft_opencl.cpp'], extra_compile_args=OPENCL['extra_compile_args'], include_dirs=OPENCL['include_dirs'] + ['src'], libraries=OPENCL['libraries'], library_dirs=OPENCL['library_dirs'], extra_link_args=OPENCL['extra_link_args'], depends=['vkFFT.h'] ) ext_modules.append(vkfft_opencl_ext) with open("README.rst", "r", encoding="utf-8") as fh: long_description = fh.read() class bdist_egg_disabled(bdist_egg): """ Disabled bdist_egg, to prevent use of 'python setup.py install' """ def run(self): sys.exit("Aborting building of eggs. Please use `pip install .` to install from source.") # Console scripts, available e.g. as 'pyvkfft-test' scripts = ['pyvkfft/scripts/pyvkfft_test.py'] console_scripts = [] for s in scripts: s1 = os.path.splitext(os.path.split(s)[1])[0] s0 = os.path.splitext(s)[0] console_scripts.append("%s = %s:main" % (s1.replace('_', '-'), s0.replace('/', '.'))) setup(name="pyvkfft", version=__version__, description="Python wrapper for the CUDA and OpenCL backends of VkFFT," "providing GPU FFT for PyCUDA, PyOpenCL and CuPy", long_description=long_description, ext_modules=ext_modules, packages=find_packages(exclude=exclude_packages), include_package_data=True, author="Vincent Favre-Nicolin", author_email="favre@esrf.fr", url="https://github.com/vincefn/pyvkfft", project_urls={ "Bug Tracker": "https://github.com/vincefn/pyvkfft/issues", "VkFFT project": "https://github.com/DTolm/VkFFT", }, classifiers=[ "Programming Language :: Python :: 3", "License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)", "Operating System :: OS Independent", "Environment :: GPU", ], cmdclass={'build_ext': build_ext_custom, 'sdist_vkfft': sdist_vkfft, 'bdist_egg': bdist_egg if 'bdist_egg' in sys.argv else bdist_egg_disabled }, install_requires=install_requires, test_suite="test", entry_points={'console_scripts': console_scripts}, ) pyvkfft-2022.1.1/src/0000755000076500000240000000000014202465377015022 5ustar vincentstaff00000000000000pyvkfft-2022.1.1/src/vkFFT.h0000644000076500000240000522070014202465372016154 0ustar vincentstaff00000000000000// This file is part of VkFFT, a Vulkan Fast Fourier Transform library // // Copyright (C) 2020 - present Dmitrii Tolmachev // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in // all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN // THE SOFTWARE. #ifndef VKFFT_H #define VKFFT_H #include #include #include #include #include #include #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include #if(VKFFT_BACKEND==0) #include "vulkan/vulkan.h" #include "glslang_c_interface.h" #elif(VKFFT_BACKEND==1) #include #include #include #include #include #elif(VKFFT_BACKEND==2) #include #include #include #include #elif(VKFFT_BACKEND==3) #ifndef CL_USE_DEPRECATED_OPENCL_1_2_APIS #define CL_USE_DEPRECATED_OPENCL_1_2_APIS #endif #ifdef __APPLE__ #include #else #include #endif #endif typedef struct { //WHDCN layout //required parameters: uint64_t FFTdim; //FFT dimensionality (1, 2 or 3) uint64_t size[3]; // WHD -system dimensions #if(VKFFT_BACKEND==0) VkPhysicalDevice* physicalDevice;//pointer to Vulkan physical device, obtained from vkEnumeratePhysicalDevices VkDevice* device;//pointer to Vulkan device, created with vkCreateDevice VkQueue* queue;//pointer to Vulkan queue, created with vkGetDeviceQueue VkCommandPool* commandPool;//pointer to Vulkan command pool, created with vkCreateCommandPool VkFence* fence;//pointer to Vulkan fence, created with vkCreateFence uint64_t isCompilerInitialized;//specify if glslang compiler has been intialized before (0 - off, 1 - on). Default 0 #elif(VKFFT_BACKEND==1) CUdevice* device;//pointer to CUDA device, obtained from cuDeviceGet //CUcontext* context;//pointer to CUDA context, obtained from cuDeviceGet cudaStream_t* stream;//pointer to streams (can be more than 1), where to execute the kernels uint64_t num_streams;//try to submit CUDA kernels in multiple streams for asynchronous execution. Default 1 #elif(VKFFT_BACKEND==2) hipDevice_t* device;//pointer to HIP device, obtained from hipDeviceGet //hipCtx_t* context;//pointer to HIP context, obtained from hipDeviceGet hipStream_t* stream;//pointer to streams (can be more than 1), where to execute the kernels uint64_t num_streams;//try to submit HIP kernels in multiple streams for asynchronous execution. Default 1 #elif(VKFFT_BACKEND==3) cl_platform_id* platform; cl_device_id* device; cl_context* context; #endif //data parameters: uint64_t userTempBuffer; //buffer allocated by app automatically if needed to reorder Four step algorithm. Setting to non zero value enables manual user allocation (0 - off, 1 - on) uint64_t bufferNum;//multiple buffer sequence storage is Vulkan only. Default 1 uint64_t tempBufferNum;//multiple buffer sequence storage is Vulkan only. Default 1, buffer allocated by app automatically if needed to reorder Four step algorithm. Setting to non zero value enables manual user allocation uint64_t inputBufferNum;//multiple buffer sequence storage is Vulkan only. Default 1, if isInputFormatted is enabled uint64_t outputBufferNum;//multiple buffer sequence storage is Vulkan only. Default 1, if isOutputFormatted is enabled uint64_t kernelNum;//multiple buffer sequence storage is Vulkan only. Default 1, if performConvolution is enabled //sizes are obligatory in Vulkan backend, optional in others uint64_t* bufferSize;//array of buffers sizes in bytes uint64_t* tempBufferSize;//array of temp buffers sizes in bytes. Default set to bufferSize sum, buffer allocated by app automatically if needed to reorder Four step algorithm. Setting to non zero value enables manual user allocation uint64_t* inputBufferSize;//array of input buffers sizes in bytes, if isInputFormatted is enabled uint64_t* outputBufferSize;//array of output buffers sizes in bytes, if isOutputFormatted is enabled uint64_t* kernelSize;//array of kernel buffers sizes in bytes, if performConvolution is enabled #if(VKFFT_BACKEND==0) VkBuffer* buffer;//pointer to array of buffers (or one buffer) used for computations VkBuffer* tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same sum size or bigger as buffer (can be split in multiple). Default 0. Setting to non zero value enables manual user allocation VkBuffer* inputBuffer;//pointer to array of input buffers (or one buffer) used to read data from if isInputFormatted is enabled VkBuffer* outputBuffer;//pointer to array of output buffers (or one buffer) used for write data to if isOutputFormatted is enabled VkBuffer* kernel;//pointer to array of kernel buffers (or one buffer) used for read kernel data from if performConvolution is enabled #elif(VKFFT_BACKEND==1) void** buffer;//pointer to device buffer used for computations void** tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same size as buffer. Default 0. Setting to non zero value enables manual user allocation void** inputBuffer;//pointer to device buffer used to read data from if isInputFormatted is enabled void** outputBuffer;//pointer to device buffer used to read data from if isOutputFormatted is enabled void** kernel;//pointer to device buffer used to read kernel data from if performConvolution is enabled #elif(VKFFT_BACKEND==2) void** buffer;//pointer to device buffer used for computations void** tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same size as buffer. Default 0. Setting to non zero value enables manual user allocation void** inputBuffer;//pointer to device buffer used to read data from if isInputFormatted is enabled void** outputBuffer;//pointer to device buffer used to read data from if isOutputFormatted is enabled void** kernel;//pointer to device buffer used to read kernel data from if performConvolution is enabled #elif(VKFFT_BACKEND==3) cl_mem* buffer;//pointer to device buffer used for computations cl_mem* tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same size as buffer. Default 0. Setting to non zero value enables manual user allocation cl_mem* inputBuffer;//pointer to device buffer used to read data from if isInputFormatted is enabled cl_mem* outputBuffer;//pointer to device buffer used to read data from if isOutputFormatted is enabled cl_mem* kernel;//pointer to device buffer used to read kernel data from if performConvolution is enabled #endif uint64_t bufferOffset;//specify if VkFFT has to offset the first element position inside the buffer. In bytes. Default 0 uint64_t tempBufferOffset;//specify if VkFFT has to offset the first element position inside the temp buffer. In bytes. Default 0 uint64_t inputBufferOffset;//specify if VkFFT has to offset the first element position inside the input buffer. In bytes. Default 0 uint64_t outputBufferOffset;//specify if VkFFT has to offset the first element position inside the output buffer. In bytes. Default 0 uint64_t kernelOffset;//specify if VkFFT has to offset the first element position inside the kernel. In bytes. Default 0 uint64_t specifyOffsetsAtLaunch;//specify if offsets will be selected with launch parameters VkFFTLaunchParams (0 - off, 1 - on). Default 0 //optional: (default 0 if not stated otherwise) uint64_t coalescedMemory;//in bytes, for Nvidia and AMD is equal to 32, Intel is equal 64, scaled for half precision. Gonna work regardles, but if specified by user correctly, the performance will be higher. uint64_t aimThreads;//aim at this many threads per block. Default 128 uint64_t numSharedBanks;//how many banks shared memory has. Default 32 uint64_t inverseReturnToInputBuffer;//return data to the input buffer in inverse transform (0 - off, 1 - on). isInputFormatted must be enabled uint64_t numberBatches;// N - used to perform multiple batches of initial data. Default 1 uint64_t useUint64;//use 64-bit addressing mode in generated kernels uint64_t omitDimension[3];//disable FFT for this dimension (0 - FFT enabled, 1 - FFT disabled). Default 0. Doesn't work for R2C dimension 0 for now. Doesn't work with convolutions. uint64_t fixMaxRadixBluestein;//controls the padding of sequences in Bluestein convolution. If specified, padded sequence will be made of up to fixMaxRadixBluestein primes. Default: 2 for CUDA and Vulkan/OpenCL/HIP up to 1048576 combined dimension FFT system, 7 for Vulkan/OpenCL/HIP past after. Min = 2, Max = 13. uint64_t performBandwidthBoost;//try to reduce coalsesced number by a factor of X to get bigger sequence in one upload for strided axes. Default: -1 for DCT, 2 for Bluestein's algorithm (or -1 if DCT), 0 otherwise uint64_t doublePrecision; //perform calculations in double precision (0 - off, 1 - on). uint64_t halfPrecision; //perform calculations in half precision (0 - off, 1 - on) uint64_t halfPrecisionMemoryOnly; //use half precision only as input/output buffer. Input/Output have to be allocated as half, buffer/tempBuffer have to be allocated as float (out of place mode only). Specify isInputFormatted and isOutputFormatted to use (0 - off, 1 - on) uint64_t doublePrecisionFloatMemory; //use FP64 precision for all calculations, while all memory storage is done in FP32. uint64_t performR2C; //perform R2C/C2R decomposition (0 - off, 1 - on) uint64_t performDCT; //perform DCT transformation (X - DCT type, 1-4) uint64_t disableMergeSequencesR2C; //disable merging of two real sequences to reduce calculations (0 - off, 1 - on) uint64_t normalize; //normalize inverse transform (0 - off, 1 - on) uint64_t disableReorderFourStep; // disables unshuffling of Four step algorithm. Requires tempbuffer allocation (0 - off, 1 - on) uint64_t useLUT; //switches from calculating sincos to using precomputed LUT tables (0 - off, 1 - on). Configured by initialization routine uint64_t makeForwardPlanOnly; //generate code only for forward FFT (0 - off, 1 - on) uint64_t makeInversePlanOnly; //generate code only for inverse FFT (0 - off, 1 - on) uint64_t bufferStride[3];//buffer strides - default set to x - x*y - x*y*z values uint64_t isInputFormatted; //specify if input buffer is padded - 0 - padded, 1 - not padded. For example if it is not padded for R2C if out-of-place mode is selected (only if numberBatches==1 and numberKernels==1) uint64_t isOutputFormatted; //specify if output buffer is padded - 0 - padded, 1 - not padded. For example if it is not padded for R2C if out-of-place mode is selected (only if numberBatches==1 and numberKernels==1) uint64_t inputBufferStride[3];//input buffer strides. Used if isInputFormatted is enabled. Default set to bufferStride values uint64_t outputBufferStride[3];//output buffer strides. Used if isInputFormatted is enabled. Default set to bufferStride values uint64_t considerAllAxesStrided;//will create plan for nonstrided axis similar as a strided axis - used with disableReorderFourStep to get the same layout for Bluestein kernel (0 - off, 1 - on) uint64_t keepShaderCode;//will keep shader code and print all executed shaders during the plan execution in order (0 - off, 1 - on) uint64_t printMemoryLayout;//will print order of buffers used in shaders (0 - off, 1 - on) uint64_t saveApplicationToString;//will save all compiled binaries to VkFFTApplication.saveApplicationString (will be allocated by VkFFT, deallocated with deleteVkFFT call). (0 - off, 1 - on) uint64_t loadApplicationFromString;//will load all binaries from loadApplicationString instead of recompiling them (must be allocated by user, must contain what saveApplicationToString call generated previously in VkFFTApplication.saveApplicationString). (0 - off, 1 - on). Mutually exclusive with saveApplicationToString void* loadApplicationString;//memory array (uint32_t* for Vulkan, char* for CUDA/HIP/OpenCL) through which user can load VkFFT binaries, must be provided by user if loadApplicationFromString = 1. //optional zero padding control parameters: (default 0 if not stated otherwise) uint64_t performZeropadding[3]; // don't read some data/perform computations if some input sequences are zeropadded for each axis (0 - off, 1 - on) uint64_t fft_zeropad_left[3];//specify start boundary of zero block in the system for each axis uint64_t fft_zeropad_right[3];//specify end boundary of zero block in the system for each axis uint64_t frequencyZeroPadding; //set to 1 if zeropadding of frequency domain, default 0 - spatial zeropadding //optional convolution control parameters: (default 0 if not stated otherwise) uint64_t performConvolution; //perform convolution in this application (0 - off, 1 - on). Disables reorderFourStep parameter uint64_t conjugateConvolution;//0 off, 1 - conjugation of the sequence FFT is currently done on, 2 - conjugation of the convolution kernel uint64_t crossPowerSpectrumNormalization;//normalize the FFT x kernel multiplication in frequency domain uint64_t coordinateFeatures; // C - coordinate, or dimension of features vector. In matrix convolution - size of vector uint64_t matrixConvolution; //if equal to 2 perform 2x2, if equal to 3 perform 3x3 matrix-vector convolution. Overrides coordinateFeatures uint64_t symmetricKernel; //specify if kernel in 2x2 or 3x3 matrix convolution is symmetric uint64_t numberKernels;// N - only used in convolution step - specify how many kernels were initialized before. Expands one input to multiple (batched) output uint64_t kernelConvolution;// specify if this application is used to create kernel for convolution, so it has the same properties. performConvolution has to be set to 0 for kernel creation //register overutilization (experimental): (default 0 if not stated otherwise) uint64_t registerBoost; //specify if register file size is bigger than shared memory and can be used to extend it X times (on Nvidia 256KB register file can be used instead of 32KB of shared memory, set this constant to 4 to emulate 128KB of shared memory). Default 1 uint64_t registerBoostNonPow2; //specify if register overutilization should be used on non power of 2 sequences (0 - off, 1 - on) uint64_t registerBoost4Step; //specify if register file overutilization should be used in big sequences (>2^14), same definition as registerBoost. Default 1 //not used techniques: uint64_t swapTo3Stage4Step; //specify at which power of 2 to switch from 2 upload to 3 upload 4-step FFT, in case if making max sequence size lower than coalesced sequence helps to combat TLB misses. Default 0 - disabled. Must be at least 17 uint64_t devicePageSize;//in KB, the size of a page on the GPU. Setting to 0 disables local buffer split in pages uint64_t localPageSize;//in KB, the size to split page into if sequence spans multiple devicePageSize pages //automatically filled based on device info (still can be reconfigured by user): uint64_t maxComputeWorkGroupCount[3]; // maxComputeWorkGroupCount from VkPhysicalDeviceLimits uint64_t maxComputeWorkGroupSize[3]; // maxComputeWorkGroupCount from VkPhysicalDeviceLimits uint64_t maxThreadsNum; //max number of threads from VkPhysicalDeviceLimits uint64_t sharedMemorySizeStatic; //available for static allocation shared memory size, in bytes uint64_t sharedMemorySize; //available for allocation shared memory size, in bytes uint64_t sharedMemorySizePow2; //power of 2 which is less or equal to sharedMemorySize, in bytes uint64_t warpSize; //number of threads per warp/wavefront. uint64_t halfThreads;//Intel fix uint64_t allocateTempBuffer; //buffer allocated by app automatically if needed to reorder Four step algorithm. Parameter to check if it has been allocated uint64_t reorderFourStep; // unshuffle Four step algorithm. Requires tempbuffer allocation (0 - off, 1 - on). Default 1. int64_t maxCodeLength; //specify how big can be buffer used for code generation (in char). Default 4000000 chars. int64_t maxTempLength; //specify how big can be buffer used for intermediate string sprintfs be (in char). Default 5000 chars. If code segfaults for some reason - try increasing this number. #if(VKFFT_BACKEND==0) VkDeviceMemory tempBufferDeviceMemory;//Filled at app creation VkCommandBuffer* commandBuffer;//Filled at app execution VkMemoryBarrier* memory_barrier;//Filled at app creation #elif(VKFFT_BACKEND==1) cudaEvent_t* stream_event;//Filled at app creation uint64_t streamCounter;//Filled at app creation uint64_t streamID;//Filled at app creation #elif(VKFFT_BACKEND==2) hipEvent_t* stream_event;//Filled at app creation uint64_t streamCounter;//Filled at app creation uint64_t streamID;//Filled at app creation #elif(VKFFT_BACKEND==3) cl_command_queue* commandQueue; #endif } VkFFTConfiguration;//parameters specified at plan creation typedef struct { #if(VKFFT_BACKEND==0) VkCommandBuffer* commandBuffer;//commandBuffer to which FFT is appended VkBuffer* buffer;//pointer to array of buffers (or one buffer) used for computations VkBuffer* tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same sum size or bigger as buffer (can be split in multiple). Default 0. Setting to non zero value enables manual user allocation VkBuffer* inputBuffer;//pointer to array of input buffers (or one buffer) used to read data from if isInputFormatted is enabled VkBuffer* outputBuffer;//pointer to array of output buffers (or one buffer) used for write data to if isOutputFormatted is enabled VkBuffer* kernel;//pointer to array of kernel buffers (or one buffer) used for read kernel data from if performConvolution is enabled #elif(VKFFT_BACKEND==1) void** buffer;//pointer to device buffer used for computations void** tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same size as buffer. Default 0. Setting to non zero value enables manual user allocation void** inputBuffer;//pointer to device buffer used to read data from if isInputFormatted is enabled void** outputBuffer;//pointer to device buffer used to read data from if isOutputFormatted is enabled void** kernel;//pointer to device buffer used to read kernel data from if performConvolution is enabled #elif(VKFFT_BACKEND==2) void** buffer;//pointer to device buffer used for computations void** tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same size as buffer. Default 0. Setting to non zero value enables manual user allocation void** inputBuffer;//pointer to device buffer used to read data from if isInputFormatted is enabled void** outputBuffer;//pointer to device buffer used to read data from if isOutputFormatted is enabled void** kernel;//pointer to device buffer used to read kernel data from if performConvolution is enabled #elif(VKFFT_BACKEND==3) cl_command_queue* commandQueue;//commandBuffer to which FFT is appended cl_mem* buffer;//pointer to device buffer used for computations cl_mem* tempBuffer;//needed if reorderFourStep is enabled to transpose the array. Same size as buffer. Default 0. Setting to non zero value enables manual user allocation cl_mem* inputBuffer;//pointer to device buffer used to read data from if isInputFormatted is enabled cl_mem* outputBuffer;//pointer to device buffer used to read data from if isOutputFormatted is enabled cl_mem* kernel;//pointer to device buffer used to read kernel data from if performConvolution is enabled #endif //following parameters can be specified during kernels launch, if specifyOffsetsAtLaunch parameter was enabled during the initializeVkFFT call uint64_t bufferOffset;//specify if VkFFT has to offset the first element position inside the buffer. In bytes. Default 0 uint64_t tempBufferOffset;//specify if VkFFT has to offset the first element position inside the temp buffer. In bytes. Default 0 uint64_t inputBufferOffset;//specify if VkFFT has to offset the first element position inside the input buffer. In bytes. Default 0 uint64_t outputBufferOffset;//specify if VkFFT has to offset the first element position inside the output buffer. In bytes. Default 0 uint64_t kernelOffset;//specify if VkFFT has to offset the first element position inside the kernel. In bytes. Default 0 } VkFFTLaunchParams;//parameters specified at plan execution typedef enum VkFFTResult { VKFFT_SUCCESS = 0, VKFFT_ERROR_MALLOC_FAILED = 1, VKFFT_ERROR_INSUFFICIENT_CODE_BUFFER = 2, VKFFT_ERROR_INSUFFICIENT_TEMP_BUFFER = 3, VKFFT_ERROR_PLAN_NOT_INITIALIZED = 4, VKFFT_ERROR_NULL_TEMP_PASSED = 5, VKFFT_ERROR_INVALID_PHYSICAL_DEVICE = 1001, VKFFT_ERROR_INVALID_DEVICE = 1002, VKFFT_ERROR_INVALID_QUEUE = 1003, VKFFT_ERROR_INVALID_COMMAND_POOL = 1004, VKFFT_ERROR_INVALID_FENCE = 1005, VKFFT_ERROR_ONLY_FORWARD_FFT_INITIALIZED = 1006, VKFFT_ERROR_ONLY_INVERSE_FFT_INITIALIZED = 1007, VKFFT_ERROR_INVALID_CONTEXT = 1008, VKFFT_ERROR_INVALID_PLATFORM = 1009, VKFFT_ERROR_ENABLED_saveApplicationToString = 1010, VKFFT_ERROR_EMPTY_FFTdim = 2001, VKFFT_ERROR_EMPTY_size = 2002, VKFFT_ERROR_EMPTY_bufferSize = 2003, VKFFT_ERROR_EMPTY_buffer = 2004, VKFFT_ERROR_EMPTY_tempBufferSize = 2005, VKFFT_ERROR_EMPTY_tempBuffer = 2006, VKFFT_ERROR_EMPTY_inputBufferSize = 2007, VKFFT_ERROR_EMPTY_inputBuffer = 2008, VKFFT_ERROR_EMPTY_outputBufferSize = 2009, VKFFT_ERROR_EMPTY_outputBuffer = 2010, VKFFT_ERROR_EMPTY_kernelSize = 2011, VKFFT_ERROR_EMPTY_kernel = 2012, VKFFT_ERROR_EMPTY_applicationString = 2013, VKFFT_ERROR_UNSUPPORTED_RADIX = 3001, VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH = 3002, VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_R2C = 3003, VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_DCT = 3004, VKFFT_ERROR_UNSUPPORTED_FFT_OMIT = 3005, VKFFT_ERROR_FAILED_TO_ALLOCATE = 4001, VKFFT_ERROR_FAILED_TO_MAP_MEMORY = 4002, VKFFT_ERROR_FAILED_TO_ALLOCATE_COMMAND_BUFFERS = 4003, VKFFT_ERROR_FAILED_TO_BEGIN_COMMAND_BUFFER = 4004, VKFFT_ERROR_FAILED_TO_END_COMMAND_BUFFER = 4005, VKFFT_ERROR_FAILED_TO_SUBMIT_QUEUE = 4006, VKFFT_ERROR_FAILED_TO_WAIT_FOR_FENCES = 4007, VKFFT_ERROR_FAILED_TO_RESET_FENCES = 4008, VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_POOL = 4009, VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_SET_LAYOUT = 4010, VKFFT_ERROR_FAILED_TO_ALLOCATE_DESCRIPTOR_SETS = 4011, VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE_LAYOUT = 4012, VKFFT_ERROR_FAILED_SHADER_PREPROCESS = 4013, VKFFT_ERROR_FAILED_SHADER_PARSE = 4014, VKFFT_ERROR_FAILED_SHADER_LINK = 4015, VKFFT_ERROR_FAILED_SPIRV_GENERATE = 4016, VKFFT_ERROR_FAILED_TO_CREATE_SHADER_MODULE = 4017, VKFFT_ERROR_FAILED_TO_CREATE_INSTANCE = 4018, VKFFT_ERROR_FAILED_TO_SETUP_DEBUG_MESSENGER = 4019, VKFFT_ERROR_FAILED_TO_FIND_PHYSICAL_DEVICE = 4020, VKFFT_ERROR_FAILED_TO_CREATE_DEVICE = 4021, VKFFT_ERROR_FAILED_TO_CREATE_FENCE = 4022, VKFFT_ERROR_FAILED_TO_CREATE_COMMAND_POOL = 4023, VKFFT_ERROR_FAILED_TO_CREATE_BUFFER = 4024, VKFFT_ERROR_FAILED_TO_ALLOCATE_MEMORY = 4025, VKFFT_ERROR_FAILED_TO_BIND_BUFFER_MEMORY = 4026, VKFFT_ERROR_FAILED_TO_FIND_MEMORY = 4027, VKFFT_ERROR_FAILED_TO_SYNCHRONIZE = 4028, VKFFT_ERROR_FAILED_TO_COPY = 4029, VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM = 4030, VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM = 4031, VKFFT_ERROR_FAILED_TO_GET_CODE_SIZE = 4032, VKFFT_ERROR_FAILED_TO_GET_CODE = 4033, VKFFT_ERROR_FAILED_TO_DESTROY_PROGRAM = 4034, VKFFT_ERROR_FAILED_TO_LOAD_MODULE = 4035, VKFFT_ERROR_FAILED_TO_GET_FUNCTION = 4036, VKFFT_ERROR_FAILED_TO_SET_DYNAMIC_SHARED_MEMORY = 4037, VKFFT_ERROR_FAILED_TO_MODULE_GET_GLOBAL = 4038, VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL = 4039, VKFFT_ERROR_FAILED_TO_EVENT_RECORD = 4040, VKFFT_ERROR_FAILED_TO_ADD_NAME_EXPRESSION = 4041, VKFFT_ERROR_FAILED_TO_INITIALIZE = 4042, VKFFT_ERROR_FAILED_TO_SET_DEVICE_ID = 4043, VKFFT_ERROR_FAILED_TO_GET_DEVICE = 4044, VKFFT_ERROR_FAILED_TO_CREATE_CONTEXT = 4045, VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE = 4046, VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG = 4047, VKFFT_ERROR_FAILED_TO_CREATE_COMMAND_QUEUE = 4048, VKFFT_ERROR_FAILED_TO_RELEASE_COMMAND_QUEUE = 4049, VKFFT_ERROR_FAILED_TO_ENUMERATE_DEVICES = 4050, VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE = 4051, VKFFT_ERROR_FAILED_TO_CREATE_EVENT = 4052 } VkFFTResult; typedef struct { uint64_t size[3]; uint64_t localSize[3]; uint64_t sourceFFTSize; uint64_t fftDim; uint64_t inverse; uint64_t actualInverse; uint64_t inverseBluestein; uint64_t zeropad[2]; uint64_t zeropadBluestein[2]; uint64_t axis_id; uint64_t axis_upload_id; uint64_t numAxisUploads; uint64_t registers_per_thread; uint64_t registers_per_thread_per_radix[14]; uint64_t min_registers_per_thread; uint64_t readToRegisters; uint64_t writeFromRegisters; uint64_t LUT; uint64_t useBluesteinFFT; uint64_t reverseBluesteinMultiUpload; uint64_t BluesteinConvolutionStep; uint64_t BluesteinPreMultiplication; uint64_t BluesteinPostMultiplication; uint64_t startDCT3LUT; uint64_t startDCT4LUT; uint64_t performR2C; uint64_t performR2CmultiUpload; uint64_t performDCT; uint64_t performBandwidthBoost; uint64_t frequencyZeropadding; uint64_t performZeropaddingFull[3]; // don't do read/write if full sequence is omitted uint64_t performZeropaddingInput[3]; // don't read if input is zeropadded (0 - off, 1 - on) uint64_t performZeropaddingOutput[3]; // don't write if output is zeropadded (0 - off, 1 - on) uint64_t fft_zeropad_left_full[3]; uint64_t fft_zeropad_left_read[3]; uint64_t fft_zeropad_left_write[3]; uint64_t fft_zeropad_right_full[3]; uint64_t fft_zeropad_right_read[3]; uint64_t fft_zeropad_right_write[3]; uint64_t fft_zeropad_Bluestein_left_read[3]; uint64_t fft_zeropad_Bluestein_left_write[3]; uint64_t fft_zeropad_Bluestein_right_read[3]; uint64_t fft_zeropad_Bluestein_right_write[3]; uint64_t inputStride[5]; uint64_t outputStride[5]; uint64_t fft_dim_full; uint64_t stageStartSize; uint64_t firstStageStartSize; uint64_t fft_dim_x; uint64_t dispatchZactualFFTSize; uint64_t numStages; uint64_t stageRadix[20]; uint64_t inputOffset; uint64_t kernelOffset; uint64_t outputOffset; uint64_t reorderFourStep; uint64_t pushConstantsStructSize; uint64_t performWorkGroupShift[3]; uint64_t performPostCompilationInputOffset; uint64_t performPostCompilationOutputOffset; uint64_t performPostCompilationKernelOffset; uint64_t inputBufferBlockNum; uint64_t inputBufferBlockSize; uint64_t outputBufferBlockNum; uint64_t outputBufferBlockSize; uint64_t kernelBlockNum; uint64_t kernelBlockSize; uint64_t numCoordinates; uint64_t matrixConvolution; //if equal to 2 perform 2x2, if equal to 3 perform 3x3 matrix-vector convolution. Overrides coordinateFeatures uint64_t numBatches; uint64_t numKernels; uint64_t conjugateConvolution; uint64_t crossPowerSpectrumNormalization; uint64_t usedSharedMemory; uint64_t sharedMemSize; uint64_t sharedMemSizePow2; uint64_t normalize; uint64_t complexSize; uint64_t inputNumberByteSize; uint64_t outputNumberByteSize; uint64_t kernelNumberByteSize; uint64_t maxStageSumLUT; uint64_t unroll; uint64_t convolutionStep; uint64_t symmetricKernel; uint64_t supportAxis; uint64_t cacheShuffle; uint64_t registerBoost; uint64_t warpSize; uint64_t numSharedBanks; uint64_t resolveBankConflictFirstStages; uint64_t sharedStrideBankConflictFirstStages; uint64_t sharedStrideReadWriteConflict; uint64_t maxSharedStride; uint64_t axisSwapped; uint64_t mergeSequencesR2C; uint64_t numBuffersBound[6]; uint64_t convolutionBindingID; uint64_t LUTBindingID; uint64_t BluesteinConvolutionBindingID; uint64_t BluesteinMultiplicationBindingID; uint64_t performOffsetUpdate; uint64_t performBufferSetUpdate; uint64_t useUint64; char** regIDs; char* disableThreadsStart; char* disableThreadsEnd; char sdataID[50]; char inoutID[50]; char combinedID[50]; char gl_LocalInvocationID_x[50]; char gl_LocalInvocationID_y[50]; char gl_LocalInvocationID_z[50]; char gl_GlobalInvocationID_x[200]; char gl_GlobalInvocationID_y[200]; char gl_GlobalInvocationID_z[200]; char tshuffle[50]; char sharedStride[50]; char gl_WorkGroupSize_x[50]; char gl_WorkGroupSize_y[50]; char gl_WorkGroupSize_z[50]; char gl_WorkGroupID_x[50]; char gl_WorkGroupID_y[50]; char gl_WorkGroupID_z[50]; char tempReg[50]; char stageInvocationID[50]; char blockInvocationID[50]; char temp[50]; char w[50]; char iw[50]; char locID[13][40]; char* code0; char* output; char* tempStr; int64_t tempLen; int64_t currentLen; int64_t maxCodeLength; int64_t maxTempLength; const char* oldLocale; } VkFFTSpecializationConstantsLayout; typedef struct { uint32_t dataUint32[10]; uint64_t dataUint64[10]; //specify what can be in layout uint64_t performWorkGroupShift[3]; uint64_t workGroupShift[3]; uint64_t performPostCompilationInputOffset; uint64_t inputOffset; uint64_t performPostCompilationOutputOffset; uint64_t outputOffset; uint64_t performPostCompilationKernelOffset; uint64_t kernelOffset; uint64_t structSize; } VkFFTPushConstantsLayout; typedef struct { uint64_t numBindings; uint64_t axisBlock[4]; uint64_t groupedBatch; VkFFTSpecializationConstantsLayout specializationConstants; VkFFTPushConstantsLayout pushConstants; uint64_t updatePushConstants; #if(VKFFT_BACKEND==0) VkBuffer* inputBuffer; VkBuffer* outputBuffer; VkDescriptorPool descriptorPool; VkDescriptorSetLayout descriptorSetLayout; VkDescriptorSet descriptorSet; VkPipelineLayout pipelineLayout; VkPipeline pipeline; VkDeviceMemory bufferLUTDeviceMemory; VkBuffer bufferLUT; VkDeviceMemory* bufferBluesteinDeviceMemory; VkDeviceMemory* bufferBluesteinFFTDeviceMemory; VkBuffer* bufferBluestein; VkBuffer* bufferBluesteinFFT; #elif(VKFFT_BACKEND==1) void** inputBuffer; void** outputBuffer; CUmodule VkFFTModule; CUfunction VkFFTKernel; void* bufferLUT; CUdeviceptr consts_addr; void** bufferBluestein; void** bufferBluesteinFFT; #elif(VKFFT_BACKEND==2) void** inputBuffer; void** outputBuffer; hipModule_t VkFFTModule; hipFunction_t VkFFTKernel; void* bufferLUT; hipDeviceptr_t consts_addr; void** bufferBluestein; void** bufferBluesteinFFT; #elif(VKFFT_BACKEND==3) cl_mem* inputBuffer; cl_mem* outputBuffer; cl_program program; cl_kernel kernel; cl_mem bufferLUT; cl_mem* bufferBluestein; cl_mem* bufferBluesteinFFT; #endif void* binary; uint64_t binarySize; uint64_t bufferLUTSize; uint64_t referenceLUT; } VkFFTAxis; typedef struct { uint64_t actualFFTSizePerAxis[3][3]; uint64_t numAxisUploads[3]; uint64_t axisSplit[3][4]; VkFFTAxis axes[3][4]; uint64_t multiUploadR2C; uint64_t actualPerformR2CPerAxis[3]; // automatically specified, shows if R2C is actually performed or inside FFT or as a separate step VkFFTAxis R2Cdecomposition; VkFFTAxis inverseBluesteinAxes[3][4]; } VkFFTPlan; typedef struct { VkFFTConfiguration configuration; VkFFTPlan* localFFTPlan; VkFFTPlan* localFFTPlan_inverse; //additional inverse plan uint64_t actualNumBatches; uint64_t firstAxis; uint64_t lastAxis; //Bluestein buffers reused among plans uint64_t useBluesteinFFT[3]; #if(VKFFT_BACKEND==0) VkDeviceMemory bufferBluesteinDeviceMemory[3]; VkDeviceMemory bufferBluesteinFFTDeviceMemory[3]; VkDeviceMemory bufferBluesteinIFFTDeviceMemory[3]; VkBuffer bufferBluestein[3]; VkBuffer bufferBluesteinFFT[3]; VkBuffer bufferBluesteinIFFT[3]; #elif(VKFFT_BACKEND==1) void* bufferBluestein[3]; void* bufferBluesteinFFT[3]; void* bufferBluesteinIFFT[3]; #elif(VKFFT_BACKEND==2) void* bufferBluestein[3]; void* bufferBluesteinFFT[3]; void* bufferBluesteinIFFT[3]; #elif(VKFFT_BACKEND==3) cl_mem bufferBluestein[3]; cl_mem bufferBluesteinFFT[3]; cl_mem bufferBluesteinIFFT[3]; #endif uint64_t bufferBluesteinSize[3]; void* applicationBluesteinString[3]; uint64_t applicationBluesteinStringSize[3]; uint64_t currentApplicationStringPos; uint64_t applicationStringSize;//size of saveApplicationString in bytes void* saveApplicationString;//memory array(uint32_t* for Vulkan, char* for CUDA/HIP/OpenCL) through which user can access VkFFT generated binaries. (will be allocated by VkFFT, deallocated with deleteVkFFT call) } VkFFTApplication; static inline VkFFTResult VkAppendLine(VkFFTSpecializationConstantsLayout* sc) { //appends code line stored in tempStr to generated code if (sc->tempLen < 0) return VKFFT_ERROR_INSUFFICIENT_TEMP_BUFFER; if (sc->currentLen + sc->tempLen > sc->maxCodeLength) return VKFFT_ERROR_INSUFFICIENT_CODE_BUFFER; sc->currentLen += sprintf(sc->output + sc->currentLen, "%s", sc->tempStr); return VKFFT_SUCCESS; }; static inline VkFFTResult VkAppendLineFromInput(VkFFTSpecializationConstantsLayout* sc, const char* in) { //appends code line stored in tempStr to generated code if (sc->currentLen + (int64_t)strlen(in) > sc->maxCodeLength) return VKFFT_ERROR_INSUFFICIENT_CODE_BUFFER; sc->currentLen += sprintf(sc->output + sc->currentLen, "%s", in); return VKFFT_SUCCESS; }; static inline VkFFTResult appendLicense(VkFFTSpecializationConstantsLayout* sc) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ // This file is part of VkFFT, a Vulkan Fast Fourier Transform library\n\ //\n\ // Copyright (C) 2020 - present Dmitrii Tolmachev \n\ //\n\ // Permission is hereby granted, free of charge, to any person obtaining a copy\n\ // of this software and associated documentation files (the \"Software\"), to deal\n\ // in the Software without restriction, including without limitation the rights\n\ // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n\ // copies of the Software, and to permit persons to whom the Software is\n\ // furnished to do so, subject to the following conditions:\n\ //\n\ // The above copyright notice and this permission notice shall be included in\n\ // all copies or substantial portions of the Software.\n\ //\n\ // THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n\ // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n\ // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n\ // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n\ // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n\ // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n\ // THE SOFTWARE.\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult VkMovComplex(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s;\n", out, in); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkMovReal(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s;\n", out, in); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkSharedStore(VkFFTSpecializationConstantsLayout* sc, const char* id, const char* in) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s] = %s;\n", id, in); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkSharedLoad(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* id) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s];\n", out, id); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkAddReal(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s + %s;\n", out, in_1, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkAddComplex(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x + %s.x;\n\ %s.y = %s.y + %s.y;\n", out, in_1, in_2, out, in_1, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkAddComplexInv(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = - %s.x - %s.x;\n\ %s.y = - %s.y - %s.y;\n", out, in_1, in_2, out, in_1, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkSubComplex(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x - %s.x;\n\ %s.y = %s.y - %s.y;\n", out, in_1, in_2, out, in_1, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkSubReal(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s - %s;\n", out, in_1, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkFMAComplex(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_num, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = fma(%s.x, %s, %s.x);\n\ %s.y = fma(%s.y, %s, %s.y);\n", out, in_1, in_num, in_2, out, in_1, in_num, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkFMAReal(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_num, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = fma(%s, %s, %s);\n", out, in_1, in_num, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkMulComplex(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2, const char* temp) { VkFFTResult res = VKFFT_SUCCESS; if (strcmp(out, in_1) && strcmp(out, in_2)) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x * %s.x - %s.y * %s.y;\n\ %s.y = %s.y * %s.x + %s.x * %s.y;\n", out, in_1, in_2, in_1, in_2, out, in_1, in_2, in_1, in_2); } else { if (temp) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x * %s.x - %s.y * %s.y;\n\ %s.y = %s.y * %s.x + %s.x * %s.y;\n\ %s = %s;\n", temp, in_1, in_2, in_1, in_2, temp, in_1, in_2, in_1, in_2, out, temp); } else return VKFFT_ERROR_NULL_TEMP_PASSED; } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkMulComplexConj(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2, const char* temp) { VkFFTResult res = VKFFT_SUCCESS; if (strcmp(out, in_1) && strcmp(out, in_2)) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x * %s.x + %s.y * %s.y;\n\ %s.y = %s.y * %s.x - %s.x * %s.y;\n", out, in_1, in_2, in_1, in_2, out, in_1, in_2, in_1, in_2); } else { if (temp) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x * %s.x + %s.y * %s.y;\n\ %s.y = %s.y * %s.x - %s.x * %s.y;\n\ %s = %s;\n", temp, in_1, in_2, in_1, in_2, temp, in_1, in_2, in_1, in_2, out, temp); } else return VKFFT_ERROR_NULL_TEMP_PASSED; } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkMulComplexNumber(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_num) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x * %s;\n\ %s.y = %s.y * %s;\n", out, in_1, in_num, out, in_1, in_num); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkMulComplexNumberImag(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_num, const char* temp) { VkFFTResult res = VKFFT_SUCCESS; if (strcmp(out, in_1)) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = - %s.y * %s;\n\ %s.y = %s.x * %s;\n", out, in_1, in_num, out, in_1, in_num); } else { if (temp) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = - %s.y * %s;\n\ %s.y = %s.x * %s;\n\ %s = %s;\n", temp, in_1, in_num, temp, in_1, in_num, out, temp); } else return VKFFT_ERROR_NULL_TEMP_PASSED; } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkDivComplexNumber(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_num) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x / %s;\n\ %s.y = %s.y / %s;\n", out, in_1, in_num, out, in_1, in_num); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkMulReal(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s * %s;\n", out, in_1, in_2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkShuffleComplex(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2, const char* temp) { VkFFTResult res = VKFFT_SUCCESS; if (strcmp(out, in_2)) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x - %s.y;\n\ %s.y = %s.y + %s.x;\n", out, in_1, in_2, out, in_1, in_2); } else { if (temp) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x - %s.y;\n\ %s.y = %s.x + %s.y;\n\ %s = %s;\n", temp, in_1, in_2, temp, in_1, in_2, out, temp); } else return VKFFT_ERROR_NULL_TEMP_PASSED; } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkShuffleComplexInv(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_2, const char* temp) { VkFFTResult res = VKFFT_SUCCESS; if (strcmp(out, in_2)) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x + %s.y;\n\ %s.y = %s.y - %s.x;\n", out, in_1, in_2, out, in_1, in_2); } else { if (temp) { sc->tempLen = sprintf(sc->tempStr, "\ %s.x = %s.x + %s.y;\n\ %s.y = %s.x - %s.y;\n\ %s = %s;\n", temp, in_1, in_2, temp, in_1, in_2, out, temp); } else return VKFFT_ERROR_NULL_TEMP_PASSED; } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkModReal(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_num) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s %% %s;\n", out, in_1, in_num); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkDivReal(VkFFTSpecializationConstantsLayout* sc, const char* out, const char* in_1, const char* in_num) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s / %s;\n", out, in_1, in_num); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; }; static inline VkFFTResult VkPermute(VkFFTSpecializationConstantsLayout* sc, const uint64_t* permute, const uint64_t num_elem, const uint64_t type, char** regIDs) { VkFFTResult res = VKFFT_SUCCESS; char temp_ID[13][20]; if (type == 0) { for (uint64_t i = 0; i < num_elem; i++) sprintf(temp_ID[i], "%s", sc->locID[i]); for (uint64_t i = 0; i < num_elem; i++) sprintf(sc->locID[i], "%s", temp_ID[permute[i]]); } if (type == 1) { for (uint64_t i = 0; i < num_elem; i++) sprintf(temp_ID[i], "%s", regIDs[i]); for (uint64_t i = 0; i < num_elem; i++) sprintf(regIDs[i], "%s", temp_ID[permute[i]]); } return res; }; static inline VkFFTResult initializeVkFFT(VkFFTApplication* app, VkFFTConfiguration inputLaunchConfiguration); static inline VkFFTResult VkFFTAppend(VkFFTApplication* app, int inverse, VkFFTLaunchParams* launchParams); static inline VkFFTResult appendVersion(VkFFTSpecializationConstantsLayout* sc) { VkFFTResult res = VKFFT_SUCCESS; #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "#version 450\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif return res; } static inline VkFFTResult appendExtensions(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeInputMemory, const char* floatTypeOutputMemory, const char* floatTypeKernelMemory) { VkFFTResult res = VKFFT_SUCCESS; #if(VKFFT_BACKEND==0) //sc->tempLen = sprintf(sc->tempStr, "#extension GL_EXT_debug_printf : require\n\n"); //res = VkAppendLine(sc); //if (res != VKFFT_SUCCESS) return res; if ((!strcmp(floatType, "double")) || (sc->useUint64)) { sc->tempLen = sprintf(sc->tempStr, "\ #extension GL_ARB_gpu_shader_fp64 : enable\n\ #extension GL_ARB_gpu_shader_int64 : enable\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((!strcmp(floatTypeInputMemory, "half")) || (!strcmp(floatTypeOutputMemory, "half")) || (!strcmp(floatTypeKernelMemory, "half"))) { sc->tempLen = sprintf(sc->tempStr, "#extension GL_EXT_shader_16bit_storage : require\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #elif(VKFFT_BACKEND==1) #elif(VKFFT_BACKEND==2) #ifdef VKFFT_OLD_ROCM sc->tempLen = sprintf(sc->tempStr, "\ #include \n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif #elif(VKFFT_BACKEND==3) if ((!strcmp(floatType, "double")) || (sc->useUint64)) { sc->tempLen = sprintf(sc->tempStr, "\ #pragma OPENCL EXTENSION cl_khr_fp64 : enable\n\ #pragma OPENCL EXTENSION cl_khr_int64 : enable\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #endif return res; } static inline VkFFTResult appendLayoutVkFFT(VkFFTSpecializationConstantsLayout* sc) { VkFFTResult res = VKFFT_SUCCESS; #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "layout (local_size_x = %" PRIu64 ", local_size_y = %" PRIu64 ", local_size_z = %" PRIu64 ") in;\n", sc->localSize[0], sc->localSize[1], sc->localSize[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) #elif(VKFFT_BACKEND==2) #elif(VKFFT_BACKEND==3) #endif return res; } static inline VkFFTResult appendConstant(VkFFTSpecializationConstantsLayout* sc, const char* type, const char* name, const char* defaultVal, const char* LFending) { VkFFTResult res = VKFFT_SUCCESS; #if(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, "__constant %s %s = %s%s;\n", type, name, defaultVal, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #else sc->tempLen = sprintf(sc->tempStr, "const %s %s = %s%s;\n", type, name, defaultVal, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif return res; } static inline VkFFTResult appendPushConstant(VkFFTSpecializationConstantsLayout* sc, const char* type, const char* name) { VkFFTResult res = VKFFT_SUCCESS; sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", type, name); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult appendBarrierVkFFT(VkFFTSpecializationConstantsLayout* sc, uint64_t numTab) { VkFFTResult res = VKFFT_SUCCESS; char tabs[100]; for (uint64_t i = 0; i < numTab; i++) sprintf(tabs, " "); #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "%sbarrier();\n\n", tabs); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) sc->tempLen = sprintf(sc->tempStr, "%s__syncthreads();\n\n", tabs); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==2) sc->tempLen = sprintf(sc->tempStr, "%s__syncthreads();\n\n", tabs); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, "%sbarrier(CLK_LOCAL_MEM_FENCE);\n\n", tabs); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif return res; } static inline VkFFTResult appendPushConstantsVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType) { VkFFTResult res = VKFFT_SUCCESS; if (sc->pushConstantsStructSize == 0) return res; #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "layout(push_constant) uniform PushConsts\n{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) sc->tempLen = sprintf(sc->tempStr, " typedef struct {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==2) sc->tempLen = sprintf(sc->tempStr, " typedef struct {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, " typedef struct {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif if (sc->performWorkGroupShift[0]) { res = appendPushConstant(sc, uintType, "workGroupShiftX"); if (res != VKFFT_SUCCESS) return res; } if (sc->performWorkGroupShift[1]) { res = appendPushConstant(sc, uintType, "workGroupShiftY"); if (res != VKFFT_SUCCESS) return res; } if (sc->performWorkGroupShift[2]) { res = appendPushConstant(sc, uintType, "workGroupShiftZ"); if (res != VKFFT_SUCCESS) return res; } if (sc->performPostCompilationInputOffset) { res = appendPushConstant(sc, uintType, "inputOffset"); if (res != VKFFT_SUCCESS) return res; } if (sc->performPostCompilationOutputOffset) { res = appendPushConstant(sc, uintType, "outputOffset"); if (res != VKFFT_SUCCESS) return res; } if (sc->performPostCompilationKernelOffset) { res = appendPushConstant(sc, uintType, "kernelOffset"); if (res != VKFFT_SUCCESS) return res; } #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "} consts;\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) sc->tempLen = sprintf(sc->tempStr, " }PushConsts;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " __constant__ PushConsts consts;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==2) sc->tempLen = sprintf(sc->tempStr, " }PushConsts;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " __constant__ PushConsts consts;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, " }PushConsts;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif return res; } static inline VkFFTResult appendConstantsVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType) { VkFFTResult res = VKFFT_SUCCESS; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #endif res = appendConstant(sc, floatType, "loc_PI", "3.1415926535897932384626433832795", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "loc_SQRT1_2", "0.70710678118654752440084436210485", LFending); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult appendSinCos20(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType) { VkFFTResult res = VKFFT_SUCCESS; char functionDefinitions[100] = ""; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); sprintf(functionDefinitions, "__device__ static __inline__ "); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); sprintf(functionDefinitions, "__device__ static __inline__ "); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); sprintf(functionDefinitions, "static __inline__ "); #endif res = appendConstant(sc, floatType, "loc_2_PI", "0.63661977236758134307553505349006", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "loc_PI_2", "1.5707963267948966192313216916398", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a1", "0.99999999999999999999962122687403772", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a3", "-0.166666666666666666637194166219637268", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a5", "0.00833333333333333295212653322266277182", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a7", "-0.000198412698412696489459896530659927773", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a9", "2.75573192239364018847578909205399262e-6", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a11", "-2.50521083781017605729370231280411712e-8", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a13", "1.60590431721336942356660057796782021e-10", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a15", "-7.64712637907716970380859898835680587e-13", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "a17", "2.81018528153898622636194976499656274e-15", LFending); if (res != VKFFT_SUCCESS) return res; res = appendConstant(sc, floatType, "ab", "-7.97989713648499642889739108679114937e-18", LFending); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s%s sincos_20(double x)\n\ {\n\ //minimax coefs for sin for 0..pi/2 range\n\ double y = abs(x * loc_2_PI);\n\ double q = floor(y);\n\ int quadrant = int(q);\n\ double t = (quadrant & 1) != 0 ? 1 - y + q : y - q;\n\ t *= loc_PI_2;\n\ double t2 = t * t;\n\ double r = fma(fma(fma(fma(fma(fma(fma(fma(fma(ab, t2, a17), t2, a15), t2, a13), t2, a11), t2, a9), t2, a7), t2, a5), t2, a3), t2 * t, t);\n\ %s cos_sin;\n\ cos_sin.x = ((quadrant == 0) || (quadrant == 3)) ? sqrt(1 - r * r) : -sqrt(1 - r * r);\n\ r = x < 0 ? -r : r;\n\ cos_sin.y = (quadrant & 2) != 0 ? -r : r;\n\ return cos_sin;\n\ }\n\n", functionDefinitions, vecType, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult appendConversion(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeDifferent) { VkFFTResult res = VKFFT_SUCCESS; #if(VKFFT_BACKEND!=0) char functionDefinitions[100] = ""; char vecType[30]; char vecTypeDifferent[30]; #endif #if(VKFFT_BACKEND==0) #elif(VKFFT_BACKEND==1) sprintf(functionDefinitions, "__device__ static __inline__ "); #elif(VKFFT_BACKEND==2) sprintf(functionDefinitions, "__device__ static __inline__ "); #elif(VKFFT_BACKEND==3) sprintf(functionDefinitions, "static __inline__ "); #endif #if(VKFFT_BACKEND!=0) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatTypeDifferent, "half")) sprintf(vecTypeDifferent, "f16vec2"); if (!strcmp(floatTypeDifferent, "float")) sprintf(vecTypeDifferent, "float2"); if (!strcmp(floatTypeDifferent, "double")) sprintf(vecTypeDifferent, "double2"); sc->tempLen = sprintf(sc->tempStr, "\ %s%s conv_%s(%s input)\n\ {\n\ %s ret_val;\n\ ret_val.x = (%s) input.x;\n\ ret_val.y = (%s) input.y;\n\ return ret_val;\n\ }\n\n", functionDefinitions, vecType, vecType, vecTypeDifferent, vecType, floatType, floatType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s%s conv_%s(%s input)\n\ {\n\ %s ret_val;\n\ ret_val.x = (%s) input.x;\n\ ret_val.y = (%s) input.y;\n\ return ret_val;\n\ }\n\n", functionDefinitions, vecTypeDifferent, vecTypeDifferent, vecType, vecTypeDifferent, floatTypeDifferent, floatTypeDifferent); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif return res; } static inline VkFFTResult appendInputLayoutVkFFT(VkFFTSpecializationConstantsLayout* sc, uint64_t id, const char* floatTypeMemory, uint64_t inputType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; switch (inputType) { case 0: case 1: case 2: case 3: case 4: case 6: { #if(VKFFT_BACKEND==0) if (!strcmp(floatTypeMemory, "half")) { sc->inputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->inputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "vec2"); } if (!strcmp(floatTypeMemory, "double")) { sc->inputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "dvec2"); } if (sc->inputBufferBlockNum == 1) { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataIn{\n\ %s inputs[%" PRIu64 "];\n\ };\n\n", id, vecType, sc->inputBufferBlockSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataIn{\n\ %s inputs[%" PRIu64 "];\n\ } inputBlocks[%" PRIu64 "];\n\n", id, vecType, sc->inputBufferBlockSize, sc->inputBufferBlockNum); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #elif(VKFFT_BACKEND==1) if (!strcmp(floatTypeMemory, "half")) { sc->inputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->inputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->inputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #elif(VKFFT_BACKEND==2) if (!strcmp(floatTypeMemory, "half")) { sc->inputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->inputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->inputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #elif(VKFFT_BACKEND==3) if (!strcmp(floatTypeMemory, "half")) { sc->inputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->inputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->inputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #endif break; } case 5: case 110: case 111: case 120: case 121: case 130: case 131: case 140: case 141: case 142: case 143: case 144: case 145: { if (!strcmp(floatTypeMemory, "half")) { sc->inputNumberByteSize = 2; sprintf(vecType, "float16_t"); } if (!strcmp(floatTypeMemory, "float")) { sc->inputNumberByteSize = sizeof(float); sprintf(vecType, "float"); } if (!strcmp(floatTypeMemory, "double")) { sc->inputNumberByteSize = sizeof(double); sprintf(vecType, "double"); } #if(VKFFT_BACKEND==0) if (sc->inputBufferBlockNum == 1) { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataIn{\n\ %s inputs[%" PRIu64 "];\n\ };\n\n", id, vecType, 2 * sc->inputBufferBlockSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataIn{\n\ %s inputs[%" PRIu64 "];\n\ } inputBlocks[%" PRIu64 "];\n\n", id, vecType, 2 * sc->inputBufferBlockSize, sc->inputBufferBlockNum); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #endif break; } } return res; } static inline VkFFTResult appendOutputLayoutVkFFT(VkFFTSpecializationConstantsLayout* sc, uint64_t id, const char* floatTypeMemory, uint64_t outputType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; switch (outputType) { case 0: case 1: case 2: case 3: case 4: case 5: { #if(VKFFT_BACKEND==0) if (!strcmp(floatTypeMemory, "half")) { sc->outputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->outputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "vec2"); } if (!strcmp(floatTypeMemory, "double")) { sc->outputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "dvec2"); } if (sc->outputBufferBlockNum == 1) { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataOut{\n\ %s outputs[%" PRIu64 "];\n\ };\n\n", id, vecType, sc->outputBufferBlockSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataOut{\n\ %s outputs[%" PRIu64 "];\n\ } outputBlocks[%" PRIu64 "];\n\n", id, vecType, sc->outputBufferBlockSize, sc->outputBufferBlockNum); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #elif(VKFFT_BACKEND==1) if (!strcmp(floatTypeMemory, "half")) { sc->outputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->outputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->outputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #elif(VKFFT_BACKEND==2) if (!strcmp(floatTypeMemory, "half")) { sc->outputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->outputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->outputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #elif(VKFFT_BACKEND==3) if (!strcmp(floatTypeMemory, "half")) { sc->outputNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->outputNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->outputNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #endif break; } case 6: case 110: case 111: case 120: case 121: case 130: case 131: case 140: case 141: case 142: case 143: case 144: case 145: { if (!strcmp(floatTypeMemory, "half")) { sc->outputNumberByteSize = 2; sprintf(vecType, "float16_t"); } if (!strcmp(floatTypeMemory, "float")) { sc->outputNumberByteSize = sizeof(float); sprintf(vecType, "float"); } if (!strcmp(floatTypeMemory, "double")) { sc->outputNumberByteSize = sizeof(double); sprintf(vecType, "double"); } #if(VKFFT_BACKEND==0) if (sc->outputBufferBlockNum == 1) { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataOut{\n\ %s outputs[%" PRIu64 "];\n\ };\n\n", id, vecType, 2 * sc->outputBufferBlockSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer DataOut{\n\ %s outputs[%" PRIu64 "];\n\ } outputBlocks[%" PRIu64 "];\n\n", id, vecType, 2 * sc->outputBufferBlockSize, sc->outputBufferBlockNum); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #endif break; } } return res; } static inline VkFFTResult appendKernelLayoutVkFFT(VkFFTSpecializationConstantsLayout* sc, uint64_t id, const char* floatTypeMemory) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; #if(VKFFT_BACKEND==0) if (!strcmp(floatTypeMemory, "half")) { sc->kernelNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->kernelNumberByteSize = 2 * sizeof(float); sprintf(vecType, "vec2"); } if (!strcmp(floatTypeMemory, "double")) { sc->kernelNumberByteSize = 2 * sizeof(double); sprintf(vecType, "dvec2"); } if (sc->kernelBlockNum == 1) { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer Kernel_FFT{\n\ %s kernel_obj[%" PRIu64 "];\n\ };\n\n", id, vecType, sc->kernelBlockSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") buffer Kernel_FFT{\n\ %s kernel_obj[%" PRIu64 "];\n\ } kernelBlocks[%" PRIu64 "];\n\n", id, vecType, sc->kernelBlockSize, sc->kernelBlockNum); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #elif(VKFFT_BACKEND==1) if (!strcmp(floatTypeMemory, "half")) { sc->kernelNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->kernelNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->kernelNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #elif(VKFFT_BACKEND==2) if (!strcmp(floatTypeMemory, "half")) { sc->kernelNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->kernelNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->kernelNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #elif(VKFFT_BACKEND==3) if (!strcmp(floatTypeMemory, "half")) { sc->kernelNumberByteSize = 2 * 2; sprintf(vecType, "f16vec2"); } if (!strcmp(floatTypeMemory, "float")) { sc->kernelNumberByteSize = 2 * sizeof(float); sprintf(vecType, "float2"); } if (!strcmp(floatTypeMemory, "double")) { sc->kernelNumberByteSize = 2 * sizeof(double); sprintf(vecType, "double2"); } #endif return res; } static inline VkFFTResult appendLUTLayoutVkFFT(VkFFTSpecializationConstantsLayout* sc, uint64_t id, const char* floatType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") readonly buffer DataLUT {\n\ %s twiddleLUT[];\n\ };\n", id, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #endif return res; } static inline VkFFTResult appendBluesteinLayoutVkFFT(VkFFTSpecializationConstantsLayout* sc, uint64_t id, const char* floatType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; uint64_t loc_id = id; #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (sc->BluesteinConvolutionStep) { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") readonly buffer DataBluesteinConvolutionKernel {\n\ %s BluesteinConvolutionKernel[];\n\ };\n", loc_id, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; loc_id++; } if (sc->BluesteinPreMultiplication || sc->BluesteinPostMultiplication) { sc->tempLen = sprintf(sc->tempStr, "\ layout(std430, binding = %" PRIu64 ") readonly buffer DataBluesteinMultiplication {\n\ %s BluesteinMultiplication[];\n\ };\n", loc_id, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; loc_id++; } #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #endif return res; } static inline VkFFTResult indexInputVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* uintType, uint64_t inputType, const char* index_x, const char* index_y, const char* coordinate, const char* batchID) { VkFFTResult res = VKFFT_SUCCESS; switch (inputType % 1000) { case 0: case 2: case 3: case 4:case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: {//single_c2c + single_c2c_strided char inputOffset[30] = ""; if (sc->inputOffset > 0) { sprintf(inputOffset, "%" PRIu64 " + ", sc->inputOffset / sc->inputNumberByteSize); } else { if (sc->performPostCompilationInputOffset) { if (inputType < 1000) sprintf(inputOffset, "consts.inputOffset + "); else sprintf(inputOffset, "consts.kernelOffset + "); } } char shiftX[500] = ""; if (sc->inputStride[0] == 1) sprintf(shiftX, "(%s)", index_x); else sprintf(shiftX, "(%s) * %" PRIu64 "", index_x, sc->inputStride[0]); char shiftY[500] = ""; uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->size[1] > 1) { if (sc->numAxisUploads == 1) { if (sc->axisSwapped) { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + (%s + consts.workGroupShiftY) * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[0] * sc->inputStride[1]); else sprintf(shiftY, " + %s * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[0] * sc->inputStride[1]); } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + (%s + consts.workGroupShiftY) * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[1] * sc->inputStride[1]); else sprintf(shiftY, " + %s * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[1] * sc->inputStride[1]); } } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + (%s + consts.workGroupShiftY) * %" PRIu64 "", sc->gl_WorkGroupID_y, sc->inputStride[1]); else sprintf(shiftY, " + %s * %" PRIu64 "", sc->gl_WorkGroupID_y, sc->inputStride[1]); } } char shiftZ[500] = ""; if (sc->size[2] > 1) { if (sc->numCoordinates * sc->matrixConvolution * sc->numBatches > 1) { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + ((%s + consts.workGroupShiftZ * %s) %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->dispatchZactualFFTSize, sc->inputStride[2]); else sprintf(shiftZ, " + (%s %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, sc->inputStride[2]); } else { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + (%s + consts.workGroupShiftZ * %s) * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->inputStride[2]); else sprintf(shiftZ, " + %s * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->inputStride[2]); } } char shiftCoordinate[500] = ""; uint64_t maxCoordinate = sc->numCoordinates * sc->matrixConvolution; if (sc->numCoordinates * sc->matrixConvolution > 1) { sprintf(shiftCoordinate, " + ((%s / %" PRIu64 ") %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, maxCoordinate, sc->inputStride[3]); } if ((sc->matrixConvolution > 1) && (sc->convolutionStep)) { maxCoordinate = 1; sprintf(shiftCoordinate, " + %s * %" PRIu64 "", coordinate, sc->inputStride[3]); } char shiftBatch[500] = ""; if ((sc->numBatches > 1) || (sc->numKernels > 1)) { if (sc->convolutionStep && (sc->numKernels > 1)) { sprintf(shiftBatch, " + %s * %" PRIu64 "", batchID, sc->inputStride[4]); } else sprintf(shiftBatch, " + (%s / %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize * maxCoordinate, sc->inputStride[4]); } sc->tempLen = sprintf(sc->tempStr, "%s%s%s%s%s%s", inputOffset, shiftX, shiftY, shiftZ, shiftCoordinate, shiftBatch); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 1: case 111: case 121: case 131: case 141: case 143: case 145: {//grouped_c2c char inputOffset[30] = ""; if (sc->inputOffset > 0) { sprintf(inputOffset, "%" PRIu64 " + ", sc->inputOffset / sc->inputNumberByteSize); } else { if (sc->performPostCompilationInputOffset) { if (inputType < 1000) sprintf(inputOffset, "consts.inputOffset + "); else sprintf(inputOffset, "consts.kernelOffset + "); } } char shiftX[500] = ""; if (sc->inputStride[0] == 1) sprintf(shiftX, "(%s)", index_x); else sprintf(shiftX, "(%s) * %" PRIu64 "", index_x, sc->inputStride[0]); char shiftY[500] = ""; if (index_y) sprintf(shiftY, " + (%s) * %" PRIu64 "", index_y, sc->inputStride[1]); char shiftZ[500] = ""; if (sc->size[2] > 1) { if (sc->numCoordinates * sc->matrixConvolution * sc->numBatches > 1) { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + ((%s + consts.workGroupShiftZ * %s) %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->dispatchZactualFFTSize, sc->inputStride[2]); else sprintf(shiftZ, " + (%s %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, sc->inputStride[2]); } else { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + (%s + consts.workGroupShiftZ * %s) * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->inputStride[2]); else sprintf(shiftZ, " + %s * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->inputStride[2]); } } char shiftCoordinate[500] = ""; uint64_t maxCoordinate = sc->numCoordinates * sc->matrixConvolution; if (sc->numCoordinates * sc->matrixConvolution > 1) { sprintf(shiftCoordinate, " + ((%s / %" PRIu64 ") %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, maxCoordinate, sc->inputStride[3]); } if ((sc->matrixConvolution > 1) && (sc->convolutionStep)) { maxCoordinate = 1; sprintf(shiftCoordinate, " + %s * %" PRIu64 "", coordinate, sc->inputStride[3]); } char shiftBatch[500] = ""; if ((sc->numBatches > 1) || (sc->numKernels > 1)) { if (sc->convolutionStep && (sc->numKernels > 1)) { sprintf(shiftBatch, " + %s * %" PRIu64 "", batchID, sc->inputStride[4]); } else sprintf(shiftBatch, " + (%s / %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize * maxCoordinate, sc->inputStride[4]); } sc->tempLen = sprintf(sc->tempStr, "%s%s%s%s%s%s", inputOffset, shiftX, shiftY, shiftZ, shiftCoordinate, shiftBatch); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } } return res; } static inline VkFFTResult indexOutputVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* uintType, uint64_t outputType, const char* index_x, const char* index_y, const char* coordinate, const char* batchID) { VkFFTResult res = VKFFT_SUCCESS; switch (outputType % 1000) {//single_c2c + single_c2c_strided case 0: case 2: case 3: case 4: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: { char outputOffset[30] = ""; if (sc->outputOffset > 0) { sprintf(outputOffset, "%" PRIu64 " + ", sc->outputOffset / sc->outputNumberByteSize); } else { if (sc->performPostCompilationOutputOffset) { if (outputType < 1000) sprintf(outputOffset, "consts.outputOffset + "); else sprintf(outputOffset, "consts.kernelOffset + "); } } char shiftX[500] = ""; if (sc->numAxisUploads == 1) sprintf(shiftX, "(%s)", index_x); else sprintf(shiftX, "(%s) * %" PRIu64 "", index_x, sc->outputStride[0]); char shiftY[500] = ""; uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->size[1] > 1) { if (sc->numAxisUploads == 1) { if (sc->axisSwapped) { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + (%s + consts.workGroupShiftY) * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[0] * sc->outputStride[1]); else sprintf(shiftY, " + %s * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[0] * sc->outputStride[1]); } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + (%s + consts.workGroupShiftY) * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[1] * sc->outputStride[1]); else sprintf(shiftY, " + %s * %" PRIu64 "", sc->gl_WorkGroupID_y, mult * sc->localSize[1] * sc->outputStride[1]); } } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + (%s + consts.workGroupShiftY) * %" PRIu64 "", sc->gl_WorkGroupID_y, sc->outputStride[1]); else sprintf(shiftY, " + %s * %" PRIu64 "", sc->gl_WorkGroupID_y, sc->outputStride[1]); } } char shiftZ[500] = ""; if (sc->size[2] > 1) { if (sc->numCoordinates * sc->matrixConvolution * sc->numBatches > 1) { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + ((%s + consts.workGroupShiftZ * %s) %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->dispatchZactualFFTSize, sc->outputStride[2]); else sprintf(shiftZ, " + (%s %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, sc->outputStride[2]); } else { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + (%s + consts.workGroupShiftZ * %s) * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->outputStride[2]); else sprintf(shiftZ, " + %s * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->outputStride[2]); } } char shiftCoordinate[500] = ""; uint64_t maxCoordinate = sc->numCoordinates * sc->matrixConvolution; if (sc->numCoordinates * sc->matrixConvolution > 1) { sprintf(shiftCoordinate, " + ((%s / %" PRIu64 ") %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, maxCoordinate, sc->outputStride[3]); } if ((sc->matrixConvolution > 1) && (sc->convolutionStep)) { maxCoordinate = 1; sprintf(shiftCoordinate, " + %s * %" PRIu64 "", coordinate, sc->outputStride[3]); } char shiftBatch[500] = ""; if ((sc->numBatches > 1) || (sc->numKernels > 1)) { if (sc->convolutionStep && (sc->numKernels > 1)) { sprintf(shiftBatch, " + %s * %" PRIu64 "", batchID, sc->outputStride[4]); } else sprintf(shiftBatch, " + (%s / %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize * maxCoordinate, sc->outputStride[4]); } sc->tempLen = sprintf(sc->tempStr, "%s%s%s%s%s%s", outputOffset, shiftX, shiftY, shiftZ, shiftCoordinate, shiftBatch); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 1: case 111: case 121: case 131: case 141: case 143: case 145: {//grouped_c2c char outputOffset[30] = ""; if (sc->outputOffset > 0) { sprintf(outputOffset, "%" PRIu64 " + ", sc->outputOffset / sc->outputNumberByteSize); } else { if (sc->performPostCompilationOutputOffset) { if (outputType < 1000) sprintf(outputOffset, "consts.outputOffset + "); else sprintf(outputOffset, "consts.kernelOffset + "); } } char shiftX[500] = ""; if (sc->numAxisUploads == 1) sprintf(shiftX, "(%s)", index_x); else sprintf(shiftX, "(%s) * %" PRIu64 "", index_x, sc->outputStride[0]); char shiftY[500] = ""; if (index_y) sprintf(shiftY, " + (%s) * %" PRIu64 "", index_y, sc->outputStride[1]); char shiftZ[500] = ""; if (sc->size[2] > 1) { if (sc->numCoordinates * sc->matrixConvolution * sc->numBatches > 1) { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + ((%s + consts.workGroupShiftZ * %s) %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->dispatchZactualFFTSize, sc->outputStride[2]); else sprintf(shiftZ, " + (%s %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, sc->outputStride[2]); } else { if (sc->performWorkGroupShift[2]) sprintf(shiftZ, " + (%s + consts.workGroupShiftZ * %s) * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z, sc->outputStride[2]); else sprintf(shiftZ, " + %s * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->outputStride[2]); } } char shiftCoordinate[500] = ""; uint64_t maxCoordinate = sc->numCoordinates * sc->matrixConvolution; if (sc->numCoordinates * sc->matrixConvolution > 1) { sprintf(shiftCoordinate, " + ((%s / %" PRIu64 ") %% %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize, maxCoordinate, sc->outputStride[3]); } if ((sc->matrixConvolution > 1) && (sc->convolutionStep)) { maxCoordinate = 1; sprintf(shiftCoordinate, " + %s * %" PRIu64 "", coordinate, sc->outputStride[3]); } char shiftBatch[500] = ""; if ((sc->numBatches > 1) || (sc->numKernels > 1)) { if (sc->convolutionStep && (sc->numKernels > 1)) { sprintf(shiftBatch, " + %s * %" PRIu64 "", batchID, sc->outputStride[4]); } else sprintf(shiftBatch, " + (%s / %" PRIu64 ") * %" PRIu64 "", sc->gl_GlobalInvocationID_z, sc->dispatchZactualFFTSize * maxCoordinate, sc->outputStride[4]); } sc->tempLen = sprintf(sc->tempStr, "%s%s%s%s%s%s", outputOffset, shiftX, shiftY, shiftZ, shiftCoordinate, shiftBatch); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } } return res; } static inline VkFFTResult inlineRadixKernelVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t radix, uint64_t stageSize, double stageAngle, char** regID) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); char cosDef[20] = "cos"; char sinDef[20] = "sin"; if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "native_cos"; char sinDef[20] = "native_sin"; //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #endif char* temp = sc->temp; //sprintf(temp, "loc_0"); char* w = sc->w; //sprintf(w, "w"); char* iw = sc->iw; //sprintf(iw, "iw"); char convolutionInverse[30] = ""; if (sc->convolutionStep) sprintf(convolutionInverse, ", %s inverse", uintType); switch (radix) { case 2: { /*if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, "void radix2(inout %s temp_0, inout %s temp_1, %s LUTId) {\n", vecType, vecType, uintType); } else { sc->tempLen = sprintf(sc->tempStr, "void radix2(inout %s temp_0, inout %s temp_1, %s angle) {\n", vecType, vecType, floatType); }*/ /*VkAppendLine(sc, " {\n"); sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, temp); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " {\n\ %s temp;\n", vecType);*/ if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle);\n", w, cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle);\n", w, sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle);\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkMulComplex(sc, temp, regID[1], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[1], regID[0], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[0], regID[0], temp); if (res != VKFFT_SUCCESS) return res; /*VkAppendLine(sc, " }\n"); sc->tempLen = sprintf(sc->tempStr, "\ temp.x = temp%s.x * w.x - temp%s.y * w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\ }\n", regID[1], regID[1], regID[1], regID[1], regID[1], regID[0], regID[0], regID[0]);*/ break; } case 3: { /* if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, "void radix3(inout %s temp_0, inout %s temp_1, inout %s temp_2, %s LUTId) {\n", vecType, vecType, vecType, uintType); } else { sc->tempLen = sprintf(sc->tempStr, "void radix3(inout %s temp_0, inout %s temp_1, inout %s temp_2, %s angle) {\n", vecType, vecType, vecType, floatType); }*/ char* tf[2]; //VkAppendLine(sc, " {\n"); for (uint64_t i = 0; i < 2; i++) { tf[i] = (char*)malloc(sizeof(char) * 50); if (!tf[i]) { for (uint64_t j = 0; j < i; j++) { free(tf[j]); tf[j] = 0; } return VKFFT_ERROR_MALLOC_FAILED; } } sprintf(tf[0], "-0.5%s", LFending); sprintf(tf[1], "-0.8660254037844386467637231707529%s", LFending); /*for (uint64_t i = 0; i < 3; i++) { sc->locID[i] = (char*)malloc(sizeof(char) * 50); sprintf(sc->locID[i], "loc_%" PRIu64 "", i); sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, sc->locID[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; }*/ if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 4.0 / 3.0, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 4.0 / 3.0, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 4.0 / 3.0, 4.0 / 3.0); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 4.0 / 3.0, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkMulComplex(sc, sc->locID[2], regID[2], w, 0); /*sc->tempLen = sprintf(sc->tempStr, "\ loc_2.x = temp%s.x * w.x - temp%s.y * w.y;\n\ loc_2.y = temp%s.y * w.x + temp%s.x * w.y;\n", regID[2], regID[2], regID[2], regID[2]);*/ if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId+%" PRIu64 "];\n", w, stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 / 3.0, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 / 3.0, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 / 3.0, 2.0 / 3.0); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s=sincos_20(angle*%.17e%s);\n", w, 2.0 / 3.0, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkMulComplex(sc, sc->locID[1], regID[1], w, 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_1.x = temp%s.x * w.x - temp%s.y * w.y;\n\ loc_1.y = temp%s.y * w.x + temp%s.x * w.y;\n", regID[1], regID[1], regID[1], regID[1]);*/ res = VkAddComplex(sc, regID[1], sc->locID[1], sc->locID[2]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[2], sc->locID[1], sc->locID[2]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = loc_1 + loc_2;\n\ temp%s = loc_1 - loc_2;\n", regID[1], regID[2]);*/ res = VkAddComplex(sc, sc->locID[0], regID[0], regID[1]); if (res != VKFFT_SUCCESS) return res; res = VkFMAComplex(sc, sc->locID[1], regID[1], tf[0], regID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[2], regID[2], tf[1]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[0], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_0 = temp%s + temp%s;\n\ loc_1 = temp%s - 0.5 * temp%s;\n\ loc_2 = -0.8660254037844386467637231707529 * temp%s;\n\ temp%s = loc_0;\n", regID[0], regID[1], regID[0], regID[1], regID[2], regID[0]);*/ if (stageAngle < 0) { res = VkShuffleComplex(sc, regID[1], sc->locID[1], sc->locID[2], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplexInv(sc, regID[2], sc->locID[1], sc->locID[2], 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s.x = loc_1.x - loc_2.y; \n\ temp%s.y = loc_1.y + loc_2.x; \n\ temp%s.x = loc_1.x + loc_2.y; \n\ temp%s.y = loc_1.y - loc_2.x; \n", regID[1], regID[1], regID[2], regID[2]);*/ } else { res = VkShuffleComplexInv(sc, regID[1], sc->locID[1], sc->locID[2], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplex(sc, regID[2], sc->locID[1], sc->locID[2], 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s.x = loc_1.x + loc_2.y; \n\ temp%s.y = loc_1.y - loc_2.x; \n\ temp%s.x = loc_1.x - loc_2.y; \n\ temp%s.y = loc_1.y + loc_2.x; \n", regID[1], regID[1], regID[2], regID[2]);*/ } //VkAppendLine(sc, " }\n"); for (uint64_t i = 0; i < 2; i++) { free(tf[i]); tf[i] = 0; //free(sc->locID[i]); } //free(sc->locID[2]); break; } case 4: { /*if (sc->LUT) sc->tempLen = sprintf(sc->tempStr, "void radix4(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, %s LUTId%s) {\n", vecType, vecType, vecType, vecType, uintType, convolutionInverse); else sc->tempLen = sprintf(sc->tempStr, "void radix4(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, %s angle%s) {\n", vecType, vecType, vecType, vecType, floatType, convolutionInverse); */ //VkAppendLine(sc, " {\n"); //sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, temp); //res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle);\n", w, cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle);\n", w, sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle);\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkMulComplex(sc, temp, regID[2], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[2], regID[0], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[0], regID[0], temp); if (res != VKFFT_SUCCESS) return res; res = VkMulComplex(sc, temp, regID[3], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[3], regID[1], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[1], regID[1], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x=temp%s.x*w.x-temp%s.y*w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n\ temp.x=temp%s.x*w.x-temp%s.y*w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n\ //DIF 2nd stage with angle\n", regID[2], regID[2], regID[2], regID[2], regID[2], regID[0], regID[0], regID[0], regID[3], regID[3], regID[3], regID[3], regID[3], regID[1], regID[1], regID[1]);*/ if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s=twiddleLUT[LUTId+%" PRIu64 "];\n", w, stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(0.5%s*angle);\n", w, cosDef, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(0.5%s*angle);\n", w, sinDef, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s=normalize(%s + %s(1.0, 0.0));\n", w, w, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkMulComplex(sc, temp, regID[1], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[1], regID[0], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[0], regID[0], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x = temp%s.x * w.x - temp%s.y * w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n", regID[1], regID[1], regID[1], regID[1], regID[1], regID[0], regID[0], regID[0]);*/ if (stageAngle < 0) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.x;", temp, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.x;\n", w, temp); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(w.y, -w.x);\n\n", vecType); } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.x;", temp, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.x;\n", w, temp); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(-w.y, w.x);\n\n", vecType); } res = VkMulComplex(sc, temp, regID[3], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[3], regID[2], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[2], regID[2], temp); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, temp, regID[1]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[1], regID[2]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[2], temp); if (res != VKFFT_SUCCESS) return res; /*VkAppendLine(sc, " }\n"); sc->tempLen = sprintf(sc->tempStr, "\ temp.x = temp%s.x * w.x - temp%s.y * w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n\ temp = temp%s;\n\ temp%s = temp%s;\n\ temp%s = temp;\n\ }\n", regID[3], regID[3], regID[3], regID[3], regID[3], regID[2], regID[2], regID[2], regID[1], regID[1], regID[2], regID[2]);*/ break; } case 5: { /*if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, "void radix5(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, inout %s temp_4, %s LUTId) {\n", vecType, vecType, vecType, vecType, vecType, uintType); } else { sc->tempLen = sprintf(sc->tempStr, "void radix5(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, inout %s temp_4, %s angle) {\n", vecType, vecType, vecType, vecType, vecType, floatType); }*/ char* tf[5]; //VkAppendLine(sc, " {\n"); for (uint64_t i = 0; i < 5; i++) { tf[i] = (char*)malloc(sizeof(char) * 50); if (!tf[i]) { for (uint64_t j = 0; j < i; j++) { free(tf[j]); tf[j] = 0; } return VKFFT_ERROR_MALLOC_FAILED; } } sprintf(tf[0], "-0.5%s", LFending); sprintf(tf[1], "1.538841768587626701285145288018455%s", LFending); sprintf(tf[2], "-0.363271264002680442947733378740309%s", LFending); sprintf(tf[3], "-0.809016994374947424102293417182819%s", LFending); sprintf(tf[4], "-0.587785252292473129168705954639073%s", LFending); /*for (uint64_t i = 0; i < 5; i++) { sc->locID[i] = (char*)malloc(sizeof(char) * 50); sprintf(sc->locID[i], "loc_%" PRIu64 "", i); sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, sc->locID[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; }*/ /*sc->tempLen = sprintf(sc->tempStr, " {\n\ %s loc_0;\n %s loc_1;\n %s loc_2;\n %s loc_3;\n %s loc_4;\n", vecType, vecType, vecType, vecType, vecType);*/ for (uint64_t i = radix - 1; i > 0; i--) { if (i == radix - 1) { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId+%" PRIu64 "];\n", w, (radix - 1 - i) * stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = VkMulComplex(sc, sc->locID[i], regID[i], w, 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_%" PRIu64 ".x = temp%s.x * w.x - temp%s.y * w.y;\n\ loc_%" PRIu64 ".y = temp%s.y * w.x + temp%s.x * w.y;\n", i, regID[i], regID[i], i, regID[i], regID[i]);*/ } res = VkAddComplex(sc, regID[1], sc->locID[1], sc->locID[4]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[2], sc->locID[2], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[3], sc->locID[2], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[4], sc->locID[1], sc->locID[4]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[3], regID[1], regID[2]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[4], regID[3], regID[4]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = loc_1 + loc_4;\n\ temp%s = loc_2 + loc_3;\n\ temp%s = loc_2 - loc_3;\n\ temp%s = loc_1 - loc_4;\n\ loc_3 = temp%s - temp%s;\n\ loc_4 = temp%s + temp%s;\n", regID[1], regID[2], regID[3], regID[4], regID[1], regID[2], regID[3], regID[4]);*/ res = VkAddComplex(sc, sc->locID[0], regID[0], regID[1]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[0], sc->locID[0], regID[2]); if (res != VKFFT_SUCCESS) return res; res = VkFMAComplex(sc, sc->locID[1], regID[1], tf[0], regID[0]); if (res != VKFFT_SUCCESS) return res; res = VkFMAComplex(sc, sc->locID[2], regID[2], tf[0], regID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[3], regID[3], tf[1]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[4], regID[4], tf[2]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[3], sc->locID[3], tf[3]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[4], sc->locID[4], tf[4]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_0 = temp%s + temp%s + temp%s;\n\ loc_1 = temp%s - 0.5 * temp%s;\n\ loc_2 = temp%s - 0.5 * temp%s;\n\ temp%s *= 1.538841768587626701285145288018455;\n\ temp%s *= -0.363271264002680442947733378740309;\n\ loc_3 *= -0.809016994374947424102293417182819;\n\ loc_4 *= -0.587785252292473129168705954639073;\n", regID[0], regID[1], regID[2], regID[0], regID[1], regID[0], regID[2], regID[3], regID[4]);*/ res = VkSubComplex(sc, sc->locID[1], sc->locID[1], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[2], sc->locID[2], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[3], regID[3], sc->locID[4]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[4], sc->locID[4], regID[4]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[0], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_1 -= loc_3;\n\ loc_2 += loc_3;\n\ loc_3 = temp%s+loc_4;\n\ loc_4 += temp%s;\n\ temp%s = loc_0;\n", regID[3], regID[4], regID[0]);*/ if (stageAngle < 0) { res = VkShuffleComplex(sc, regID[1], sc->locID[1], sc->locID[4], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplex(sc, regID[2], sc->locID[2], sc->locID[3], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplexInv(sc, regID[3], sc->locID[2], sc->locID[3], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplexInv(sc, regID[4], sc->locID[1], sc->locID[4], 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s.x = loc_1.x - loc_4.y; \n\ temp%s.y = loc_1.y + loc_4.x; \n\ temp%s.x = loc_2.x - loc_3.y; \n\ temp%s.y = loc_2.y + loc_3.x; \n\ temp%s.x = loc_2.x + loc_3.y; \n\ temp%s.y = loc_2.y - loc_3.x; \n\ temp%s.x = loc_1.x + loc_4.y; \n\ temp%s.y = loc_1.y - loc_4.x; \n", regID[1], regID[1], regID[2], regID[2], regID[3], regID[3], regID[4], regID[4]);*/ } else { res = VkShuffleComplexInv(sc, regID[1], sc->locID[1], sc->locID[4], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplexInv(sc, regID[2], sc->locID[2], sc->locID[3], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplex(sc, regID[3], sc->locID[2], sc->locID[3], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplex(sc, regID[4], sc->locID[1], sc->locID[4], 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s.x = loc_1.x + loc_4.y; \n\ temp%s.y = loc_1.y - loc_4.x; \n\ temp%s.x = loc_2.x + loc_3.y; \n\ temp%s.y = loc_2.y - loc_3.x; \n\ temp%s.x = loc_2.x - loc_3.y; \n\ temp%s.y = loc_2.y + loc_3.x; \n\ temp%s.x = loc_1.x - loc_4.y; \n\ temp%s.y = loc_1.y + loc_4.x; \n", regID[1], regID[1], regID[2], regID[2], regID[3], regID[3], regID[4], regID[4]);*/ } //VkAppendLine(sc, " }\n"); for (uint64_t i = 0; i < 5; i++) { free(tf[i]); tf[i] = 0; //free(sc->locID[i]); } break; } case 7: { /*if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, "void radix5(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, inout %s temp_4, %s LUTId) {\n", vecType, vecType, vecType, vecType, vecType, uintType); } else { sc->tempLen = sprintf(sc->tempStr, "void radix5(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, inout %s temp_4, %s angle) {\n", vecType, vecType, vecType, vecType, vecType, floatType); }*/ char* tf[8]; //VkAppendLine(sc, " {\n"); for (uint64_t i = 0; i < 8; i++) { tf[i] = (char*)malloc(sizeof(char) * 50); if (!tf[i]) { for (uint64_t j = 0; j < i; j++) { free(tf[j]); tf[j] = 0; } return VKFFT_ERROR_MALLOC_FAILED; } } sprintf(tf[0], "-1.16666666666666651863693004997913%s", LFending); sprintf(tf[1], "0.79015646852540022404554065360571%s", LFending); sprintf(tf[2], "0.05585426728964774240049351305970%s", LFending); sprintf(tf[3], "0.73430220123575240531721419756650%s", LFending); if (stageAngle < 0) { sprintf(tf[4], "0.44095855184409837868031445395900%s", LFending); sprintf(tf[5], "0.34087293062393136944265847887436%s", LFending); sprintf(tf[6], "-0.53396936033772524066165487965918%s", LFending); sprintf(tf[7], "0.87484229096165666561546458979137%s", LFending); } else { sprintf(tf[4], "-0.44095855184409837868031445395900%s", LFending); sprintf(tf[5], "-0.34087293062393136944265847887436%s", LFending); sprintf(tf[6], "0.53396936033772524066165487965918%s", LFending); sprintf(tf[7], "-0.87484229096165666561546458979137%s", LFending); } /*for (uint64_t i = 0; i < 7; i++) { sc->locID[i] = (char*)malloc(sizeof(char) * 50); sprintf(sc->locID[i], "loc_%" PRIu64 "", i); sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, sc->locID[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; }*/ for (uint64_t i = radix - 1; i > 0; i--) { if (i == radix - 1) { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId+%" PRIu64 "];\n\n", w, (radix - 1 - i) * stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = VkMulComplex(sc, sc->locID[i], regID[i], w, 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_%" PRIu64 ".x = temp%s.x * w.x - temp%s.y * w.y;\n\ loc_%" PRIu64 ".y = temp%s.y * w.x + temp%s.x * w.y;\n", i, regID[i], regID[i], i, regID[i], regID[i]);*/ } res = VkMovComplex(sc, sc->locID[0], regID[0]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[0], sc->locID[1], sc->locID[6]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[1], sc->locID[1], sc->locID[6]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[2], sc->locID[2], sc->locID[5]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[3], sc->locID[2], sc->locID[5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[4], sc->locID[4], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[5], sc->locID[4], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_0 = temp%s;\n\ temp%s = loc_1 + loc_6;\n\ temp%s = loc_1 - loc_6;\n\ temp%s = loc_2 + loc_5;\n\ temp%s = loc_2 - loc_5;\n\ temp%s = loc_4 + loc_3;\n\ temp%s = loc_4 - loc_3;\n", regID[0], regID[0], regID[1], regID[2], regID[3], regID[4], regID[5]);*/ res = VkAddComplex(sc, sc->locID[5], regID[1], regID[3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[5], sc->locID[5], regID[5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[1], regID[0], regID[2]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[1], sc->locID[1], regID[4]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[0], sc->locID[0], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_5 = temp%s + temp%s + temp%s;\n\ loc_1 = temp%s + temp%s + temp%s;\n\ loc_0 += loc_1;\n", regID[1], regID[3], regID[5], regID[0], regID[2], regID[4]);*/ res = VkSubComplex(sc, sc->locID[2], regID[0], regID[4]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[3], regID[4], regID[2]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[4], regID[2], regID[0]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_2 = temp%s - temp%s;\n\ loc_3 = temp%s - temp%s;\n\ loc_4 = temp%s - temp%s;\n", regID[0], regID[4], regID[4], regID[2], regID[2], regID[0]);*/ res = VkSubComplex(sc, regID[0], regID[1], regID[5]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[2], regID[5], regID[3]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[4], regID[3], regID[1]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = temp%s - temp%s;\n\ temp%s = temp%s - temp%s;\n\ temp%s = temp%s - temp%s;\n", regID[0], regID[1], regID[5], regID[2], regID[5], regID[3], regID[4], regID[3], regID[1]);*/ res = VkMulComplexNumber(sc, sc->locID[1], sc->locID[1], tf[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[2], sc->locID[2], tf[1]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[3], sc->locID[3], tf[2]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[4], sc->locID[4], tf[3]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[5], sc->locID[5], tf[4]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[0], regID[0], tf[5]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[2], regID[2], tf[6]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[4], regID[4], tf[7]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_1 *= -1.16666666666666651863693004997913;\n\ loc_2 *= 0.79015646852540022404554065360571;\n\ loc_3 *= 0.05585426728964774240049351305970;\n\ loc_4 *= 0.73430220123575240531721419756650;\n\ loc_5 *= 0.44095855184409837868031445395900;\n\ temp%s *= 0.34087293062393136944265847887436;\n\ temp%s *= -0.53396936033772524066165487965918;\n\ temp%s *= 0.87484229096165666561546458979137;\n", regID[0], regID[2], regID[4]);*/ res = VkSubComplex(sc, regID[5], regID[4], regID[2]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplexInv(sc, regID[6], regID[4], regID[0]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[4], regID[0], regID[2]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = temp%s - temp%s;\n\ temp%s = - temp%s - temp%s;\n\ temp%s = temp%s + temp%s;\n", regID[5], regID[4], regID[2], regID[6], regID[4], regID[0], regID[4], regID[0], regID[2]);*/ res = VkAddComplex(sc, regID[0], sc->locID[0], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[1], sc->locID[2], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[2], sc->locID[4], sc->locID[3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplexInv(sc, regID[3], sc->locID[2], sc->locID[4]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = loc_0 + loc_1;\n\ temp%s = loc_2 + loc_3;\n\ temp%s = loc_4 - loc_3;\n\ temp%s = - loc_2 - loc_4;\n", regID[0], regID[1], regID[2], regID[3]);*/ res = VkAddComplex(sc, sc->locID[1], regID[0], regID[1]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[2], regID[0], regID[2]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[3], regID[0], regID[3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[4], regID[4], sc->locID[5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[6], regID[6], sc->locID[5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[5], sc->locID[5], regID[5]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[0], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ loc_1 = temp%s + temp%s;\n\ loc_2 = temp%s + temp%s;\n\ loc_3 = temp%s + temp%s;\n\ loc_4 = temp%s + loc_5;\n\ loc_6 = temp%s + loc_5;\n\ loc_5 += temp%s;\n\ temp%s = loc_0;\n", regID[0], regID[1], regID[0], regID[2], regID[0], regID[3], regID[4], regID[6], regID[5], regID[0]);*/ res = VkShuffleComplexInv(sc, regID[1], sc->locID[1], sc->locID[4], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplexInv(sc, regID[2], sc->locID[3], sc->locID[6], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplex(sc, regID[3], sc->locID[2], sc->locID[5], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplexInv(sc, regID[4], sc->locID[2], sc->locID[5], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplex(sc, regID[5], sc->locID[3], sc->locID[6], 0); if (res != VKFFT_SUCCESS) return res; res = VkShuffleComplex(sc, regID[6], sc->locID[1], sc->locID[4], 0); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s.x = loc_1.x + loc_4.y; \n\ temp%s.y = loc_1.y - loc_4.x; \n\ temp%s.x = loc_3.x + loc_6.y; \n\ temp%s.y = loc_3.y - loc_6.x; \n\ temp%s.x = loc_2.x - loc_5.y; \n\ temp%s.y = loc_2.y + loc_5.x; \n\ temp%s.x = loc_2.x + loc_5.y; \n\ temp%s.y = loc_2.y - loc_5.x; \n\ temp%s.x = loc_3.x - loc_6.y; \n\ temp%s.y = loc_3.y + loc_6.x; \n\ temp%s.x = loc_1.x - loc_4.y; \n\ temp%s.y = loc_1.y + loc_4.x; \n", regID[1], regID[1], regID[2], regID[2], regID[3], regID[3], regID[4], regID[4], regID[5], regID[5], regID[6], regID[6]); VkAppendLine(sc, " }\n");*/ /*for (uint64_t i = 0; i < 7; i++) { free(sc->locID[i]); }*/ for (uint64_t i = 0; i < 8; i++) { free(tf[i]); tf[i] = 0; } break; } case 8: { /*if (sc->LUT) sc->tempLen = sprintf(sc->tempStr, "void radix8(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, inout %s temp_4, inout %s temp_5, inout %s temp_6, inout %s temp_7, %s LUTId%s) {\n", vecType, vecType, vecType, vecType, vecType, vecType, vecType, vecType, uintType, convolutionInverse); else sc->tempLen = sprintf(sc->tempStr, "void radix8(inout %s temp_0, inout %s temp_1, inout %s temp_2, inout %s temp_3, inout %s temp_4, inout %s temp_5, inout %s temp_6, inout %s temp_7, %s angle%s) {\n", vecType, vecType, vecType, vecType, vecType, vecType, vecType, vecType, floatType, convolutionInverse); */ //VkAppendLine(sc, " {\n"); /*sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, temp); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, iw); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res;*/ if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle);\n", w, cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle);\n", w, sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle);\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } for (uint64_t i = 0; i < 4; i++) { res = VkMulComplex(sc, temp, regID[i + 4], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[i + 4], regID[i], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[i], regID[i], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x=temp%s.x*w.x-temp%s.y*w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n", regID[i + 4], regID[i + 4], regID[i + 4], regID[i + 4], regID[i + 4], regID[i + 0], regID[i + 0], regID[i + 0]);*/ } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s=twiddleLUT[LUTId+%" PRIu64 "];\n\n", w, stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(0.5%s*angle);\n", w, cosDef, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(0.5%s*angle);\n", w, sinDef, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s=normalize(%s + %s(1.0, 0.0));\n", w, w, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } for (uint64_t i = 0; i < 2; i++) { res = VkMulComplex(sc, temp, regID[i + 2], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[i + 2], regID[i], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[i], regID[i], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x=temp%s.x*w.x-temp%s.y*w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n", regID[i + 2], regID[i + 2], regID[i + 2], regID[i + 2], regID[i + 2], regID[i + 0], regID[i + 0], regID[i + 0]);*/ } if (stageAngle < 0) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.y;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.x;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(w.y, -w.x);\n\n", vecType); } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = -%s.y;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.x;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " iw = %s(-w.y, w.x);\n\n", vecType); } for (uint64_t i = 4; i < 6; i++) { res = VkMulComplex(sc, temp, regID[i + 2], iw, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[i + 2], regID[i], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[i], regID[i], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x = temp%s.x * iw.x - temp%s.y * iw.y;\n\ temp.y = temp%s.y * iw.x + temp%s.x * iw.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n", regID[i + 2], regID[i + 2], regID[i + 2], regID[i + 2], regID[i + 2], regID[i + 0], regID[i + 0], regID[i + 0]);*/ } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s=twiddleLUT[LUTId+%" PRIu64 "];\n\n", w, 2 * stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(0.25%s*angle);\n", w, cosDef, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(0.25%s*angle);\n", w, sinDef, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(0.25*angle), sin(0.25*angle));\n\n", vecType); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s=normalize(%s + %s(1.0, 0.0));\n", w, w, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkMulComplex(sc, temp, regID[1], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[1], regID[0], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[0], regID[0], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x=temp%s.x*w.x-temp%s.y*w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n", regID[1], regID[1], regID[1], regID[1], regID[1], regID[0], regID[0], regID[0]);*/ if (stageAngle < 0) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.y;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.x;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(w.y, -w.x);\n\n", vecType); } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = -%s.y;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.x;\n", iw, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " iw = %s(-w.y, w.x);\n\n", vecType); } res = VkMulComplex(sc, temp, regID[3], iw, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[3], regID[2], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[2], regID[2], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x = temp%s.x * iw.x - temp%s.y * iw.y;\n\ temp.y = temp%s.y * iw.x + temp%s.x * iw.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n", regID[3], regID[3], regID[3], regID[3], regID[3], regID[2], regID[2], regID[2]);*/ if (stageAngle < 0) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.x * loc_SQRT1_2 + %s.y * loc_SQRT1_2;\n", iw, w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.y * loc_SQRT1_2 - %s.x * loc_SQRT1_2;\n\n", iw, w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.x * loc_SQRT1_2 - %s.y * loc_SQRT1_2;\n", iw, w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.y * loc_SQRT1_2 + %s.x * loc_SQRT1_2;\n\n", iw, w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkMulComplex(sc, temp, regID[5], iw, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[5], regID[4], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[4], regID[4], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x = temp%s.x * iw.x - temp%s.y * iw.y;\n\ temp.y = temp%s.y * iw.x + temp%s.x * iw.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n", regID[5], regID[5], regID[5], regID[5], regID[5], regID[4], regID[4], regID[4]);*/ if (stageAngle < 0) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.y;\n", w, iw); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.x;\n", w, iw); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(iw.y, -iw.x);\n\n", vecType); } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = -%s.y;\n", w, iw); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.x;\n", w, iw); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(-iw.y, iw.x);\n\n", vecType); } res = VkMulComplex(sc, temp, regID[7], w, 0); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[7], regID[6], temp); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[6], regID[6], temp); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, temp, regID[1]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[1], regID[4]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[4], temp); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, temp, regID[3]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[3], regID[6]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, regID[6], temp); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp.x = temp%s.x * w.x - temp%s.y * w.y;\n\ temp.y = temp%s.y * w.x + temp%s.x * w.y;\n\ temp%s = temp%s - temp;\n\ temp%s = temp%s + temp;\n\n\ temp = temp%s;\n\ temp%s = temp%s;\n\ temp%s = temp;\n\n\ temp = temp%s;\n\ temp%s = temp%s;\n\ temp%s = temp;\n\ }\n\n", regID[7], regID[7], regID[7], regID[7], regID[7], regID[6], regID[6], regID[6], regID[1], regID[1], regID[4], regID[4], regID[3], regID[3], regID[6], regID[6]); //VkAppendLine(sc, " }\n");*/ break; } case 11: { char* tf[20]; //char* tf2[4]; //char* tf2inv[4]; //VkAppendLine(sc, " {\n"); for (uint64_t i = 0; i < 20; i++) { tf[i] = (char*)malloc(sizeof(char) * 50); if (!tf[i]) { for (uint64_t j = 0; j < i; j++) { free(tf[j]); tf[j] = 0; } return VKFFT_ERROR_MALLOC_FAILED; } //tf2[i] = (char*)malloc(sizeof(char) * 50); //tf2inv[i] = (char*)malloc(sizeof(char) * 50); } sprintf(tf[0], "-1.10000000000000000e+00%s", LFending); sprintf(tf[2], "2.53097611605958783e-01%s", LFending); sprintf(tf[3], "-1.28820061077367898e+00%s", LFending); sprintf(tf[4], "3.04632239669212490e-01%s", LFending); sprintf(tf[5], "-3.91339615511917427e-01%s", LFending); sprintf(tf[6], "-2.87102225339285022e+00%s", LFending); sprintf(tf[7], "1.37490798661638380e+00%s", LFending); sprintf(tf[8], "8.17178135341212419e-01%s", LFending); sprintf(tf[9], "1.80074650644567891e+00%s", LFending); sprintf(tf[10], "-8.59492973614497502e-01%s", LFending); if (stageAngle < 0) { sprintf(tf[1], "3.31662479035539914e-01%s", LFending); sprintf(tf[11], "-2.37347045474827967e+00%s", LFending); sprintf(tf[12], "-2.48363930874935801e-02%s", LFending); sprintf(tf[13], "4.74017017512828764e-01%s", LFending); sprintf(tf[14], "7.42183927770612595e-01%s", LFending); sprintf(tf[15], "1.40647330909460866e+00%s", LFending); sprintf(tf[16], "-1.19136455219594772e+00%s", LFending); sprintf(tf[17], "7.08088885039503180e-01%s", LFending); sprintf(tf[18], "2.58908260614167995e-01%s", LFending); sprintf(tf[19], "-4.99299221941104307e-02%s", LFending); } else { sprintf(tf[1], "-3.31662479035539914e-01%s", LFending); sprintf(tf[11], "2.37347045474827967e+00%s", LFending); sprintf(tf[12], "2.48363930874935801e-02%s", LFending); sprintf(tf[13], "-4.74017017512828764e-01%s", LFending); sprintf(tf[14], "-7.42183927770612595e-01%s", LFending); sprintf(tf[15], "-1.40647330909460866e+00%s", LFending); sprintf(tf[16], "1.19136455219594772e+00%s", LFending); sprintf(tf[17], "-7.08088885039503180e-01%s", LFending); sprintf(tf[18], "-2.58908260614167995e-01%s", LFending); sprintf(tf[19], "4.99299221941104307e-02%s", LFending); } for (uint64_t i = radix - 1; i > 0; i--) { if (i == radix - 1) { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId+%" PRIu64 "];\n\n", w, (radix - 1 - i) * stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = VkMulComplex(sc, sc->locID[i], regID[i], w, 0); if (res != VKFFT_SUCCESS) return res; } res = VkMovComplex(sc, sc->locID[0], regID[0]); if (res != VKFFT_SUCCESS) return res; uint64_t permute[11] = { 0,1,9,4,3,5,10,2,7,8,6 }; res = VkPermute(sc, permute, 11, 0, 0); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 5; i++) { res = VkAddComplex(sc, regID[i + 1], sc->locID[i + 1], sc->locID[i + 6]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[i + 6], sc->locID[i + 1], sc->locID[i + 6]); if (res != VKFFT_SUCCESS) return res; } res = VkMovComplex(sc, sc->locID[1], regID[1]); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 4; i++) { res = VkAddComplex(sc, sc->locID[1], sc->locID[1], regID[i + 2]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[i + 3], regID[i + 1], regID[5]); if (res != VKFFT_SUCCESS) return res; } res = VkMovComplex(sc, sc->locID[2], regID[6]); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 4; i++) { res = VkAddComplex(sc, sc->locID[2], sc->locID[2], regID[i + 7]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[i + 7], regID[i + 6], regID[10]); if (res != VKFFT_SUCCESS) return res; } res = VkAddComplex(sc, regID[0], sc->locID[0], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[1], sc->locID[1], tf[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[2], sc->locID[2], tf[1], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < 2; k++) { res = VkAddComplex(sc, regID[k * 4 + 3], sc->locID[k * 4 + 3], sc->locID[k * 4 + 5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 4], sc->locID[k * 4 + 4], sc->locID[k * 4 + 6]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 5], sc->locID[k * 4 + 3], sc->locID[k * 4 + 4]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 6], sc->locID[k * 4 + 5], sc->locID[k * 4 + 6]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[1], regID[k * 4 + 3], regID[k * 4 + 4]); if (res != VKFFT_SUCCESS) return res; if (k == 0) { res = VkMulComplexNumber(sc, sc->locID[k * 4 + 3], sc->locID[k * 4 + 3], tf[k * 9 + 2]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[k * 4 + 4], sc->locID[k * 4 + 4], tf[k * 9 + 3]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[k * 4 + 5], regID[k * 4 + 5], tf[k * 9 + 4]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[k * 4 + 5], sc->locID[k * 4 + 5], tf[k * 9 + 5]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[k * 4 + 6], sc->locID[k * 4 + 6], tf[k * 9 + 6]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[k * 4 + 6], regID[k * 4 + 6], tf[k * 9 + 7]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[k * 4 + 3], regID[k * 4 + 3], tf[k * 9 + 8]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[k * 4 + 4], regID[k * 4 + 4], tf[k * 9 + 9]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[1], sc->locID[1], tf[k * 9 + 10]); if (res != VKFFT_SUCCESS) return res; } else { res = VkMulComplexNumberImag(sc, sc->locID[k * 4 + 3], sc->locID[k * 4 + 3], tf[k * 9 + 2], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[k * 4 + 4], sc->locID[k * 4 + 4], tf[k * 9 + 3], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[k * 4 + 5], regID[k * 4 + 5], tf[k * 9 + 4], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[k * 4 + 5], sc->locID[k * 4 + 5], tf[k * 9 + 5], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[k * 4 + 6], sc->locID[k * 4 + 6], tf[k * 9 + 6], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[k * 4 + 6], regID[k * 4 + 6], tf[k * 9 + 7], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[k * 4 + 3], regID[k * 4 + 3], tf[k * 9 + 8], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[k * 4 + 4], regID[k * 4 + 4], tf[k * 9 + 9], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[1], sc->locID[1], tf[k * 9 + 10], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; } res = VkAddComplex(sc, sc->locID[k * 4 + 3], sc->locID[k * 4 + 3], regID[k * 4 + 3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[k * 4 + 5], sc->locID[k * 4 + 5], regID[k * 4 + 3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[k * 4 + 4], sc->locID[k * 4 + 4], regID[k * 4 + 4]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[k * 4 + 6], sc->locID[k * 4 + 6], regID[k * 4 + 4]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 5], regID[k * 4 + 5], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 6], regID[k * 4 + 6], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 3], sc->locID[k * 4 + 3], regID[k * 4 + 5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 4], sc->locID[k * 4 + 4], regID[k * 4 + 5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 5], sc->locID[k * 4 + 5], regID[k * 4 + 6]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 4 + 6], sc->locID[k * 4 + 6], regID[k * 4 + 6]); if (res != VKFFT_SUCCESS) return res; } res = VkAddComplex(sc, regID[1], regID[0], regID[1]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, sc->locID[5], regID[1]); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 4; i++) { res = VkAddComplex(sc, sc->locID[i + 1], regID[1], regID[i + 3]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[5], sc->locID[5], regID[i + 3]); if (res != VKFFT_SUCCESS) return res; } res = VkMovComplex(sc, sc->locID[10], regID[2]); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 4; i++) { res = VkAddComplex(sc, sc->locID[i + 6], regID[2], regID[i + 7]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[10], sc->locID[10], regID[i + 7]); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < 5; i++) { res = VkAddComplex(sc, regID[i + 1], sc->locID[i + 1], sc->locID[i + 6]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[i + 6], sc->locID[i + 1], sc->locID[i + 6]); if (res != VKFFT_SUCCESS) return res; } uint64_t permute2[11] = { 0,10,1,8,7,9,4,2,3,6,5 }; res = VkPermute(sc, permute2, 11, 1, regID); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 20; i++) { free(tf[i]); tf[i] = 0; } break; } case 13: { char* tf[20]; //char* tf2[4]; //char* tf2inv[4]; //VkAppendLine(sc, " {\n"); for (uint64_t i = 0; i < 20; i++) { tf[i] = (char*)malloc(sizeof(char) * 50); if (!tf[i]) { for (uint64_t j = 0; j < i; j++) { free(tf[j]); tf[j] = 0; } return VKFFT_ERROR_MALLOC_FAILED; } //tf2[i] = (char*)malloc(sizeof(char) * 50); //tf2inv[i] = (char*)malloc(sizeof(char) * 50); } sprintf(tf[0], "-1.08333333333333333e+00%s", LFending); sprintf(tf[1], "-3.00462606288665890e-01%s", LFending); sprintf(tf[5], "1.00707406572753300e+00%s", LFending); sprintf(tf[6], "7.31245990975348148e-01%s", LFending); sprintf(tf[7], "-5.79440018900960419e-01%s", LFending); sprintf(tf[8], "5.31932498429674383e-01%s", LFending); sprintf(tf[9], "-5.08814921720397551e-01%s", LFending); sprintf(tf[10], "-7.70585890309231480e-03%s", LFending); if (stageAngle < 0) { sprintf(tf[2], "-7.49279330626139051e-01%s", LFending); sprintf(tf[3], "4.01002128321867324e-01%s", LFending); sprintf(tf[4], "1.74138601152135891e-01%s", LFending); sprintf(tf[11], "-2.51139331838956803e+00%s", LFending); sprintf(tf[12], "-1.82354640868242068e+00%s", LFending); sprintf(tf[13], "1.44497990902399609e+00%s", LFending); sprintf(tf[14], "-1.34405691517736958e+00%s", LFending); sprintf(tf[15], "-9.75932420775945109e-01%s", LFending); sprintf(tf[16], "7.73329778651104860e-01%s", LFending); sprintf(tf[17], "1.92772511678346858e+00%s", LFending); sprintf(tf[18], "1.39973941472918284e+00%s", LFending); sprintf(tf[19], "-1.10915484383755047e+00%s", LFending); } else { sprintf(tf[2], "7.49279330626139051e-01%s", LFending); sprintf(tf[3], "-4.01002128321867324e-01%s", LFending); sprintf(tf[4], "-1.74138601152135891e-01%s", LFending); sprintf(tf[11], "2.51139331838956803e+00%s", LFending); sprintf(tf[12], "1.82354640868242068e+00%s", LFending); sprintf(tf[13], "-1.44497990902399609e+00%s", LFending); sprintf(tf[14], "1.34405691517736958e+00%s", LFending); sprintf(tf[15], "9.75932420775945109e-01%s", LFending); sprintf(tf[16], "-7.73329778651104860e-01%s", LFending); sprintf(tf[17], "-1.92772511678346858e+00%s", LFending); sprintf(tf[18], "-1.39973941472918284e+00%s", LFending); sprintf(tf[19], "1.10915484383755047e+00%s", LFending); } for (uint64_t i = radix - 1; i > 0; i--) { if (i == radix - 1) { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId];\n", w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s = twiddleLUT[LUTId+%" PRIu64 "];\n\n", w, (radix - 1 - i) * stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " %s.y = -%s.y;\n", w, w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " %s.x = %s(angle*%.17e%s);\n", w, cosDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s(angle*%.17e%s);\n", w, sinDef, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " w = %s(cos(angle*%.17e), sin(angle*%.17e));\n\n", vecType, 2.0 * i / radix, 2.0 * i / radix); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " %s = sincos_20(angle*%.17e%s);\n", w, 2.0 * i / radix, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = VkMulComplex(sc, sc->locID[i], regID[i], w, 0); if (res != VKFFT_SUCCESS) return res; } res = VkMovComplex(sc, sc->locID[0], regID[0]); if (res != VKFFT_SUCCESS) return res; uint64_t permute[13] = { 0,1,3,9,5,2,6,12,10,4,8,11,7 }; res = VkPermute(sc, permute, 13, 0, 0); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 6; i++) { res = VkSubComplex(sc, regID[i + 7], sc->locID[i + 1], sc->locID[i + 7]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[i + 1], sc->locID[i + 1], sc->locID[i + 7]); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < 3; i++) { res = VkAddComplex(sc, regID[i + 1], sc->locID[i + 1], sc->locID[i + 4]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[i + 4], sc->locID[i + 1], sc->locID[i + 4]); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < 4; i++) { res = VkAddComplex(sc, sc->locID[i + 1], regID[i * 3 + 1], regID[i * 3 + 2]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[i * 2 + 5], regID[i * 3 + 1], regID[i * 3 + 3]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[i + 1], sc->locID[i + 1], regID[i * 3 + 3]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[i * 2 + 6], regID[i * 3 + 2], regID[i * 3 + 3]); if (res != VKFFT_SUCCESS) return res; } res = VkAddComplex(sc, regID[0], sc->locID[0], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[1], sc->locID[1], tf[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[2], sc->locID[2], tf[1]); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < 3; k++) { res = VkAddComplex(sc, regID[k * 2 + 4], sc->locID[k * 2 + 3], sc->locID[k * 2 + 4]); if (k == 0) { res = VkMulComplexNumberImag(sc, sc->locID[k * 2 + 3], sc->locID[k * 2 + 3], tf[k * 3 + 2], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[k * 2 + 4], sc->locID[k * 2 + 4], tf[k * 3 + 3], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[k * 2 + 4], regID[k * 2 + 4], tf[k * 3 + 4], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; } else { res = VkMulComplexNumber(sc, sc->locID[k * 2 + 3], sc->locID[k * 2 + 3], tf[k * 3 + 2]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, sc->locID[k * 2 + 4], sc->locID[k * 2 + 4], tf[k * 3 + 3]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, regID[k * 2 + 4], regID[k * 2 + 4], tf[k * 3 + 4]); if (res != VKFFT_SUCCESS) return res; } res = VkAddComplex(sc, regID[k * 2 + 3], sc->locID[k * 2 + 3], regID[k * 2 + 4]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[k * 2 + 4], sc->locID[k * 2 + 4], regID[k * 2 + 4]); if (res != VKFFT_SUCCESS) return res; } res = VkAddComplex(sc, regID[9], sc->locID[9], sc->locID[11]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[10], sc->locID[10], sc->locID[12]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[11], sc->locID[9], sc->locID[10]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[12], sc->locID[11], sc->locID[12]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[1], regID[9], regID[10]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[9], sc->locID[9], tf[11], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[10], sc->locID[10], tf[12], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[11], regID[11], tf[13], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[11], sc->locID[11], tf[14], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[12], sc->locID[12], tf[15], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[12], regID[12], tf[16], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[9], regID[9], tf[17], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, regID[10], regID[10], tf[18], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumberImag(sc, sc->locID[1], sc->locID[1], tf[19], sc->locID[0]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[9], sc->locID[9], regID[9]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[11], sc->locID[11], regID[9]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[10], sc->locID[10], regID[10]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[12], sc->locID[12], regID[10]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[11], regID[11], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[12], regID[12], sc->locID[1]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[9], sc->locID[9], regID[11]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[10], sc->locID[10], regID[11]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[11], sc->locID[11], regID[12]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[12], sc->locID[12], regID[12]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, regID[1], regID[0], regID[1]); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 4; i++) { res = VkAddComplex(sc, sc->locID[i * 3 + 1], regID[i + 1], regID[i * 2 + 5]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[i * 3 + 3], regID[i + 1], regID[i * 2 + 5]); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, sc->locID[i * 3 + 2], regID[i + 1], regID[i * 2 + 6]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[i * 3 + 3], sc->locID[i * 3 + 3], regID[i * 2 + 6]); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < 3; i++) { res = VkAddComplex(sc, regID[i + 1], sc->locID[i + 1], sc->locID[i + 4]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, sc->locID[i + 4], sc->locID[i + 1], sc->locID[i + 4]); if (res != VKFFT_SUCCESS) return res; res = VkMovComplex(sc, sc->locID[i + 1], regID[i + 1]); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < 6; i++) { res = VkAddComplex(sc, regID[i + 1], sc->locID[i + 1], sc->locID[i + 7]); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, regID[i + 7], sc->locID[i + 1], sc->locID[i + 7]); if (res != VKFFT_SUCCESS) return res; } uint64_t permute2[13] = { 0,12,1,10,5,3,2,8,9,11,4,7,6 }; res = VkPermute(sc, permute2, 13, 1, regID); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < 20; i++) { free(tf[i]); tf[i] = 0; } break; } } return res; } static inline VkFFTResult appendSharedMemoryVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t sharedType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char sharedDefinitions[20] = ""; uint64_t vecSize = 1; uint64_t maxSequenceSharedMemory = 0; //uint64_t maxSequenceSharedMemoryPow2 = 0; if (!strcmp(floatType, "float")) { #if(VKFFT_BACKEND==0) sprintf(vecType, "vec2"); sprintf(sharedDefinitions, "shared"); #elif(VKFFT_BACKEND==1) sprintf(vecType, "float2"); sprintf(sharedDefinitions, "__shared__"); #elif(VKFFT_BACKEND==2) sprintf(vecType, "float2"); sprintf(sharedDefinitions, "__shared__"); #elif(VKFFT_BACKEND==3) sprintf(vecType, "float2"); sprintf(sharedDefinitions, "__local"); #endif vecSize = 8; } if (!strcmp(floatType, "double")) { #if(VKFFT_BACKEND==0) sprintf(vecType, "dvec2"); sprintf(sharedDefinitions, "shared"); #elif(VKFFT_BACKEND==1) sprintf(vecType, "double2"); sprintf(sharedDefinitions, "__shared__"); #elif(VKFFT_BACKEND==2) sprintf(vecType, "double2"); sprintf(sharedDefinitions, "__shared__"); #elif(VKFFT_BACKEND==3) sprintf(vecType, "double2"); sprintf(sharedDefinitions, "__local"); #endif vecSize = 16; } maxSequenceSharedMemory = sc->sharedMemSize / vecSize; //maxSequenceSharedMemoryPow2 = sc->sharedMemSizePow2 / vecSize; uint64_t mergeR2C = (sc->mergeSequencesR2C && (sc->axis_id == 0)) ? 2 : 0; switch (sharedType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144://single_c2c + single_r2c { sc->resolveBankConflictFirstStages = 0; sc->sharedStrideBankConflictFirstStages = ((sc->fftDim > sc->numSharedBanks / 2) && ((sc->fftDim & (sc->fftDim - 1)) == 0)) ? sc->fftDim / sc->registerBoost * (sc->numSharedBanks / 2 + 1) / (sc->numSharedBanks / 2) : sc->fftDim / sc->registerBoost; sc->sharedStrideReadWriteConflict = ((sc->numSharedBanks / 2 <= sc->localSize[1])) ? sc->fftDim / sc->registerBoost + 1 : sc->fftDim / sc->registerBoost + (sc->numSharedBanks / 2) / sc->localSize[1]; if (sc->sharedStrideReadWriteConflict < sc->fftDim / sc->registerBoost + mergeR2C) sc->sharedStrideReadWriteConflict = sc->fftDim / sc->registerBoost + mergeR2C; sc->maxSharedStride = (sc->sharedStrideBankConflictFirstStages < sc->sharedStrideReadWriteConflict) ? sc->sharedStrideReadWriteConflict : sc->sharedStrideBankConflictFirstStages; sc->usedSharedMemory = vecSize * sc->localSize[1] * sc->maxSharedStride; sc->maxSharedStride = ((sc->sharedMemSize < sc->usedSharedMemory)) ? sc->fftDim / sc->registerBoost : sc->maxSharedStride; sc->sharedStrideBankConflictFirstStages = (sc->maxSharedStride == sc->fftDim / sc->registerBoost) ? sc->fftDim / sc->registerBoost : sc->sharedStrideBankConflictFirstStages; sc->sharedStrideReadWriteConflict = (sc->maxSharedStride == sc->fftDim / sc->registerBoost) ? sc->fftDim / sc->registerBoost : sc->sharedStrideReadWriteConflict; //sc->maxSharedStride += mergeR2C; //printf("%" PRIu64 " %" PRIu64 " %" PRIu64 " %" PRIu64 " %" PRIu64 "\n", sc->maxSharedStride, sc->sharedStrideBankConflictFirstStages, sc->sharedStrideReadWriteConflict, sc->localSize[1], sc->fftDim); sc->tempLen = sprintf(sc->tempStr, "%s sharedStride = %" PRIu64 ";\n", uintType, sc->sharedStrideReadWriteConflict); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];// sharedStride - fft size, gl_WorkGroupSize.y - grouped consecutive ffts\n\n", sharedDefinitions, vecType, sc->localSize[1] * sc->maxSharedStride); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];// sharedStride - fft size, gl_WorkGroupSize.y - grouped consecutive ffts\n\n", sharedDefinitions, vecType, sc->localSize[1] * sc->maxSharedStride); sc->tempLen = sprintf(sc->tempStr, "%s* sdata = (%s*)shared;\n\n", vecType, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[];// sharedStride - fft size, gl_WorkGroupSize.y - grouped consecutive ffts\n\n", sharedDefinitions, vecType); #elif(VKFFT_BACKEND==2) //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];// sharedStride - fft size, gl_WorkGroupSize.y - grouped consecutive ffts\n\n", sharedDefinitions, vecType, sc->localSize[1] * sc->maxSharedStride); sc->tempLen = sprintf(sc->tempStr, "%s* sdata = (%s*)shared;\n\n", vecType, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[];// sharedStride - fft size, gl_WorkGroupSize.y - grouped consecutive ffts\n\n", sharedDefinitions, vecType); #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];// sharedStride - fft size, gl_WorkGroupSize.y - grouped consecutive ffts\n\n", sharedDefinitions, vecType, sc->localSize[1] * sc->maxSharedStride); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif sc->usedSharedMemory = vecSize * sc->localSize[1] * sc->maxSharedStride; break; } case 1: case 2: case 111: case 121: case 131: case 141: case 143: case 145://grouped_c2c + single_c2c_strided { uint64_t shift = (sc->fftDim < (sc->numSharedBanks / 2)) ? (sc->numSharedBanks / 2) / sc->fftDim : 1; sc->sharedStrideReadWriteConflict = ((sc->axisSwapped) && ((sc->localSize[0] % 4) == 0)) ? sc->localSize[0] + shift : sc->localSize[0]; sc->maxSharedStride = ((maxSequenceSharedMemory < sc->sharedStrideReadWriteConflict* sc->fftDim / sc->registerBoost)) ? sc->localSize[0] : sc->sharedStrideReadWriteConflict; sc->sharedStrideReadWriteConflict = (sc->maxSharedStride == sc->localSize[0]) ? sc->localSize[0] : sc->sharedStrideReadWriteConflict; sc->tempLen = sprintf(sc->tempStr, "%s sharedStride = %" PRIu64 ";\n", uintType, sc->maxSharedStride); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];\n\n", sharedDefinitions, vecType, sc->maxSharedStride * (sc->fftDim + mergeR2C) / sc->registerBoost); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];\n\n", sharedDefinitions, vecType, sc->maxSharedStride * (sc->fftDim + mergeR2C) / sc->registerBoost); sc->tempLen = sprintf(sc->tempStr, "%s* sdata = (%s*)shared;\n\n", vecType, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[];\n\n", sharedDefinitions, vecType); #elif(VKFFT_BACKEND==2) //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];\n\n", sharedDefinitions, vecType, sc->maxSharedStride * (sc->fftDim + mergeR2C) / sc->registerBoost); sc->tempLen = sprintf(sc->tempStr, "%s* sdata = (%s*)shared;\n\n", vecType, vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[];\n\n", sharedDefinitions, vecType); #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, "%s %s sdata[%" PRIu64 "];\n\n", sharedDefinitions, vecType, sc->maxSharedStride * (sc->fftDim + mergeR2C) / sc->registerBoost); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif sc->usedSharedMemory = vecSize * sc->maxSharedStride * (sc->fftDim + mergeR2C) / sc->registerBoost; break; } } return res; } static inline VkFFTResult appendInitialization(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t initType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #endif //sc->tempLen = sprintf(sc->tempStr, " uint dum=gl_LocalInvocationID.x;\n"); uint64_t logicalStoragePerThread = sc->registers_per_thread * sc->registerBoost; uint64_t logicalRegistersPerThread = sc->registers_per_thread; if (sc->convolutionStep) { for (uint64_t i = 0; i < sc->registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " %s temp_%" PRIu64 ";\n", vecType, i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 ".x=0;\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 ".y=0;\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t j = 1; j < sc->matrixConvolution; j++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " %s temp_%" PRIu64 "_%" PRIu64 ";\n", vecType, i, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 "_%" PRIu64 ".x=0;\n", i, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 "_%" PRIu64 ".y=0;\n", i, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { for (uint64_t i = 0; i < sc->registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " %s temp_%" PRIu64 ";\n", vecType, i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 ".x=0;\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 ".y=0;\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } //sc->tempLen = sprintf(sc->tempStr, " uint dum=gl_LocalInvocationID.y;//gl_LocalInvocationID.x/gl_WorkGroupSize.x;\n"); //sc->tempLen = sprintf(sc->tempStr, " dum=dum/gl_LocalInvocationID.x-1;\n"); //sc->tempLen = sprintf(sc->tempStr, " dummy=dummy/gl_LocalInvocationID.x-1;\n"); sc->regIDs = (char**)malloc(sizeof(char*) * logicalStoragePerThread); if (!sc->regIDs) return VKFFT_ERROR_MALLOC_FAILED; for (uint64_t i = 0; i < logicalStoragePerThread; i++) { sc->regIDs[i] = (char*)malloc(sizeof(char) * 50); if (!sc->regIDs[i]) { for (uint64_t j = 0; j < i; j++) { free(sc->regIDs[j]); sc->regIDs[j] = 0; } free(sc->regIDs); sc->regIDs = 0; return VKFFT_ERROR_MALLOC_FAILED; } if (i < logicalRegistersPerThread) sprintf(sc->regIDs[i], "temp_%" PRIu64 "", i); else sprintf(sc->regIDs[i], "temp_%" PRIu64 "", i); //sprintf(sc->regIDs[i], "%" PRIu64 "[%" PRIu64 "]", i / logicalRegistersPerThread, i % logicalRegistersPerThread); //sprintf(sc->regIDs[i], "s[%" PRIu64 "]", i - logicalRegistersPerThread); } if (sc->registerBoost > 1) { //sc->tempLen = sprintf(sc->tempStr, " %s sort0;\n", vecType); //sc->tempLen = sprintf(sc->tempStr, " %s temps[%" PRIu64 "];\n", vecType, (sc->registerBoost -1)* logicalRegistersPerThread); for (uint64_t i = 1; i < sc->registerBoost; i++) { //sc->tempLen = sprintf(sc->tempStr, " %s temp%" PRIu64 "[%" PRIu64 "];\n", vecType, i, logicalRegistersPerThread); for (uint64_t j = 0; j < sc->registers_per_thread; j++) { sc->tempLen = sprintf(sc->tempStr, " %s temp_%" PRIu64 ";\n", vecType, j + i * sc->registers_per_thread); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 ".x=0;\n", j + i * sc->registers_per_thread); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_%" PRIu64 ".y=0;\n", j + i * sc->registers_per_thread); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } /*sc->tempLen = sprintf(sc->tempStr, "\ for(uint i=0; i<%" PRIu64 "; i++)\n\ temp%" PRIu64 "[i]=%s(dum, dum);\n", logicalRegistersPerThread, i, vecType);*/ } } sc->tempLen = sprintf(sc->tempStr, " %s w;\n", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " w.x=0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " w.y=0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(sc->w, "w"); uint64_t maxNonPow2Radix = 1; if (sc->fftDim % 3 == 0) maxNonPow2Radix = 3; if (sc->fftDim % 5 == 0) maxNonPow2Radix = 5; if (sc->fftDim % 7 == 0) maxNonPow2Radix = 7; if (sc->fftDim % 11 == 0) maxNonPow2Radix = 11; if (sc->fftDim % 13 == 0) maxNonPow2Radix = 13; for (uint64_t i = 0; i < maxNonPow2Radix; i++) { sprintf(sc->locID[i], "loc_%" PRIu64 "", i); sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, sc->locID[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x=0;\n", sc->locID[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y=0;\n", sc->locID[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sprintf(sc->temp, "%s", sc->locID[0]); uint64_t useRadix8 = 0; for (uint64_t i = 0; i < sc->numStages; i++) if (sc->stageRadix[i] == 8) useRadix8 = 1; if (useRadix8 == 1) { if (maxNonPow2Radix > 1) sprintf(sc->iw, "%s", sc->locID[1]); else { sc->tempLen = sprintf(sc->tempStr, " %s iw;\n", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " iw.x=0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " iw.y=0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(sc->iw, "iw"); } } //sc->tempLen = sprintf(sc->tempStr, " %s %s;\n", vecType, sc->tempReg); sc->tempLen = sprintf(sc->tempStr, " %s %s=0;\n", uintType, sc->stageInvocationID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s %s=0;\n", uintType, sc->blockInvocationID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s %s=0;\n", uintType, sc->sdataID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s %s=0;\n", uintType, sc->combinedID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s %s=0;\n", uintType, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " %s LUTId=0;\n", uintType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s angle=0;\n", floatType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (((sc->stageStartSize > 1) && (!((sc->stageStartSize > 1) && (!sc->reorderFourStep) && (sc->inverse)))) || (((sc->stageStartSize > 1) && (!sc->reorderFourStep) && (sc->inverse))) || (sc->performDCT)) { sc->tempLen = sprintf(sc->tempStr, " %s mult;\n", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->cacheShuffle) { sc->tempLen = sprintf(sc->tempStr, "\ %s tshuffle= ((%s>>1))%%(%" PRIu64 ");\n\ %s shuffle[%" PRIu64 "];\n", uintType, sc->gl_LocalInvocationID_x, sc->registers_per_thread, vecType, sc->registers_per_thread); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < sc->registers_per_thread; i++) { /*sc->tempLen = sprintf(sc->tempStr, "\ shuffle[%" PRIu64 "];\n", i, vecType);*/ sc->tempLen = sprintf(sc->tempStr, " shuffle[%" PRIu64 "].x = 0;\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " shuffle[%" PRIu64 "].y = 0;\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } return res; } static inline VkFFTResult appendZeropadStart(VkFFTSpecializationConstantsLayout* sc) { //return if sequence is full of zeros from the start VkFFTResult res = VKFFT_SUCCESS; if ((sc->frequencyZeropadding)) { switch (sc->axis_id) { case 0: { break; } case 1: { if (!sc->supportAxis) { char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idX, sc->fft_zeropad_left_full[0], idX, sc->fft_zeropad_right_full[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } break; } case 2: { if (!sc->supportAxis) { char idY[500] = ""; if (sc->performWorkGroupShift[1])//y axis is along z workgroup here sprintf(idY, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idY, "%s", sc->gl_GlobalInvocationID_z); char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idX, sc->fft_zeropad_left_full[0], idX, sc->fft_zeropad_right_full[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idY, sc->fft_zeropad_left_full[1], idY, sc->fft_zeropad_right_full[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { char idY[500] = ""; if (sc->performWorkGroupShift[1])//for support axes y is along x workgroup sprintf(idY, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idY, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idY, sc->fft_zeropad_left_full[1], idY, sc->fft_zeropad_right_full[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } break; } } } else { switch (sc->axis_id) { case 0: { char idY[500] = ""; if (sc->axisSwapped) { if (sc->performWorkGroupShift[1]) sprintf(idY, "(%s + (%s + consts.workGroupShiftY) * %" PRIu64 ")", sc->gl_LocalInvocationID_x, sc->gl_WorkGroupID_y, sc->localSize[0]); else sprintf(idY, "%s + %s * %" PRIu64 "", sc->gl_LocalInvocationID_x, sc->gl_WorkGroupID_y, sc->localSize[0]); char idZ[500] = ""; if (sc->performWorkGroupShift[2]) sprintf(idZ, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idZ, "%s", sc->gl_GlobalInvocationID_z); if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idY, sc->fft_zeropad_left_full[1], idY, sc->fft_zeropad_right_full[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idZ, sc->fft_zeropad_left_full[2], idZ, sc->fft_zeropad_right_full[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (sc->performWorkGroupShift[1]) sprintf(idY, "(%s + consts.workGroupShiftY * %s)", sc->gl_GlobalInvocationID_y, sc->gl_WorkGroupSize_y); else sprintf(idY, "%s", sc->gl_GlobalInvocationID_y); char idZ[500] = ""; if (sc->performWorkGroupShift[2]) sprintf(idZ, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idZ, "%s", sc->gl_GlobalInvocationID_z); if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idY, sc->fft_zeropad_left_full[1], idY, sc->fft_zeropad_right_full[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idZ, sc->fft_zeropad_left_full[2], idZ, sc->fft_zeropad_right_full[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } break; } case 1: { char idZ[500] = ""; if (sc->performWorkGroupShift[2]) sprintf(idZ, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idZ, "%s", sc->gl_GlobalInvocationID_z); if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idZ, sc->fft_zeropad_left_full[2], idZ, sc->fft_zeropad_right_full[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 2: { break; } } } return res; } static inline VkFFTResult appendZeropadEnd(VkFFTSpecializationConstantsLayout* sc) { //return if sequence is full of zeros from the start VkFFTResult res = VKFFT_SUCCESS; if ((sc->frequencyZeropadding)) { switch (sc->axis_id) { case 0: { break; } case 1: { if (!sc->supportAxis) { char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } break; } case 2: { if (!sc->supportAxis) { char idY[500] = ""; if (sc->performWorkGroupShift[1])//y axis is along z workgroup here sprintf(idY, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idY, "%s", sc->gl_GlobalInvocationID_z); char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { char idY[500] = ""; if (sc->performWorkGroupShift[1])//for support axes y is along x workgroup sprintf(idY, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idY, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } break; } } } else { switch (sc->axis_id) { case 0: { char idY[500] = ""; if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 1: { char idZ[500] = ""; if (sc->performWorkGroupShift[2]) sprintf(idZ, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idZ, "%s", sc->gl_GlobalInvocationID_z); if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 2: { break; } } } return res; } static inline VkFFTResult appendZeropadStartReadWriteStage(VkFFTSpecializationConstantsLayout* sc, uint64_t readStage) { //return if sequence is full of zeros from the start VkFFTResult res = VKFFT_SUCCESS; if ((sc->frequencyZeropadding)) { switch (sc->axis_id) { case 0: { break; } case 1: { if (!sc->supportAxis) { char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idX, sc->fft_zeropad_left_full[0], idX, sc->fft_zeropad_right_full[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } break; } case 2: { if (!sc->supportAxis) { char idY[500] = ""; if (sc->performWorkGroupShift[1])//y axis is along z workgroup here sprintf(idY, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idY, "%s", sc->gl_GlobalInvocationID_z); char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idX, sc->fft_zeropad_left_full[0], idX, sc->fft_zeropad_right_full[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idY, sc->fft_zeropad_left_full[1], idY, sc->fft_zeropad_right_full[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { char idY[500] = ""; if (sc->performWorkGroupShift[1])//for support axes y is along x workgroup sprintf(idY, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idY, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idY, sc->fft_zeropad_left_full[1], idY, sc->fft_zeropad_right_full[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } break; } } } else { switch (sc->axis_id) { case 0: { char idY[500] = ""; char idZ[500] = ""; uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (readStage) { sprintf(idY, "(%s/%" PRIu64 ") %% %" PRIu64 "", sc->inoutID, sc->inputStride[1], sc->inputStride[2] / sc->inputStride[1]); sprintf(idZ, "(%s/%" PRIu64 ") %% %" PRIu64 "", sc->inoutID, sc->inputStride[2], sc->inputStride[3] / sc->inputStride[2]); } else { sprintf(idY, "(%s/%" PRIu64 ") %% %" PRIu64 "", sc->inoutID, sc->outputStride[1], sc->outputStride[2] / sc->outputStride[1]); sprintf(idZ, "(%s/%" PRIu64 ") %% %" PRIu64 "", sc->inoutID, sc->outputStride[2], sc->outputStride[3] / sc->outputStride[2]); } if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idY, sc->fft_zeropad_left_full[1], idY, sc->fft_zeropad_right_full[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idZ, sc->fft_zeropad_left_full[2], idZ, sc->fft_zeropad_right_full[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 1: { char idZ[500] = ""; if (sc->performWorkGroupShift[2]) sprintf(idZ, "(%s + consts.workGroupShiftZ * %s)", sc->gl_GlobalInvocationID_z, sc->gl_WorkGroupSize_z); else sprintf(idZ, "%s", sc->gl_GlobalInvocationID_z); if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " if(!((%s >= %" PRIu64 ")&&(%s < %" PRIu64 "))) {\n", idZ, sc->fft_zeropad_left_full[2], idZ, sc->fft_zeropad_right_full[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 2: { break; } } } return res; } static inline VkFFTResult appendZeropadEndReadWriteStage(VkFFTSpecializationConstantsLayout* sc) { //return if sequence is full of zeros from the start VkFFTResult res = VKFFT_SUCCESS; if ((sc->frequencyZeropadding)) { switch (sc->axis_id) { case 0: { break; } case 1: { char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 2: { if (sc->performZeropaddingFull[0]) { if (sc->fft_zeropad_left_full[0] < sc->fft_zeropad_right_full[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } } } else { switch (sc->axis_id) { case 0: { if (sc->performZeropaddingFull[1]) { if (sc->fft_zeropad_left_full[1] < sc->fft_zeropad_right_full[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 1: { if (sc->performZeropaddingFull[2]) { if (sc->fft_zeropad_left_full[2] < sc->fft_zeropad_right_full[2]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 2: { break; } } } return res; } static inline VkFFTResult appendSetSMToZero(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeMemory, const char* uintType, uint64_t readType) { VkFFTResult res = VKFFT_SUCCESS; //appendZeropadStart(sc); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { switch (readType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } break; } case 1: case 2: case 111: case 121: case 131: case 141: case 143: case 145://single_c2c { sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].x=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].y=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } } } } //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult setReadToRegisters(VkFFTSpecializationConstantsLayout* sc, uint64_t readType) { VkFFTResult res = VKFFT_SUCCESS; switch (readType) { case 0: //single_c2c { if ((sc->localSize[1] > 1) || ((sc->performR2C) && (sc->actualInverse)) || (sc->localSize[0] * sc->stageRadix[0] * (sc->registers_per_thread_per_radix[sc->stageRadix[0]] / sc->stageRadix[0]) > sc->fftDim)) sc->readToRegisters = 0; else sc->readToRegisters = 1; break; } case 1: //grouped_c2c { if (sc->localSize[1] * sc->stageRadix[0] * (sc->registers_per_thread_per_radix[sc->stageRadix[0]] / sc->stageRadix[0]) > sc->fftDim) sc->readToRegisters = 0; else sc->readToRegisters = 1; break; } case 2: //single_c2c_strided { if (sc->localSize[1] * sc->stageRadix[0] * (sc->registers_per_thread_per_radix[sc->stageRadix[0]] / sc->stageRadix[0]) > sc->fftDim) sc->readToRegisters = 0; else sc->readToRegisters = 1; break; } case 5://single_r2c { if ((sc->axisSwapped) || (sc->localSize[1] > 1) || (sc->localSize[0] * sc->stageRadix[0] * (sc->registers_per_thread_per_radix[sc->stageRadix[0]] / sc->stageRadix[0]) > sc->fftDim)) sc->readToRegisters = 0; else sc->readToRegisters = 1; break; } case 6: //single_c2r { sc->readToRegisters = 1; break; } case 110: case 111: case 120: case 121: case 130: case 131: case 140: case 141: case 142: case 143: { sc->readToRegisters = 0; break; } case 144: case 145: { sc->readToRegisters = 1; break; } } return res; } static inline VkFFTResult appendReadDataVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeMemory, const char* uintType, uint64_t readType) { VkFFTResult res = VKFFT_SUCCESS; double double_PI = 3.1415926535897932384626433832795; char vecType[30]; char inputsStruct[20] = ""; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (sc->inputBufferBlockNum == 1) sprintf(inputsStruct, "inputs"); else sprintf(inputsStruct, ".inputs"); if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); char cosDef[20] = "cos"; char sinDef[20] = "sin"; #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); sprintf(inputsStruct, "inputs"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); sprintf(inputsStruct, "inputs"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); sprintf(inputsStruct, "inputs"); char cosDef[20] = "native_cos"; char sinDef[20] = "native_sin"; #endif char convTypeLeft[20] = ""; char convTypeRight[20] = ""; if ((!strcmp(floatType, "float")) && (strcmp(floatTypeMemory, "float"))) { if ((readType == 5) || (readType == 110) || (readType == 111) || (readType == 120) || (readType == 121) || (readType == 130) || (readType == 131) || (readType == 140) || (readType == 141) || (readType == 142) || (readType == 143) || (readType == 144) || (readType == 145)) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "float("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "(float)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "(float)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "(float)"); //sprintf(convTypeRight, ""); #endif } else { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "vec2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "conv_float2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "conv_float2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "conv_float2("); sprintf(convTypeRight, ")"); #endif } } if ((!strcmp(floatType, "double")) && (strcmp(floatTypeMemory, "double"))) { if ((readType == 5) || (readType == 110) || (readType == 111) || (readType == 120) || (readType == 121) || (readType == 130) || (readType == 131) || (readType == 140) || (readType == 141) || (readType == 142) || (readType == 143) || (readType == 144) || (readType == 145)) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "double("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "(double)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "(double)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "(double)"); //sprintf(convTypeRight, ""); #endif } else { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "dvec2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "conv_double2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "conv_double2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "conv_double2("); sprintf(convTypeRight, ")"); #endif } } char index_x[2000] = ""; char index_y[2000] = ""; char requestCoordinate[100] = ""; if (sc->convolutionStep) { if (sc->matrixConvolution > 1) { sprintf(requestCoordinate, "coordinate"); } } char requestBatch[100] = ""; if (sc->convolutionStep) { if (sc->numKernels > 1) { sprintf(requestBatch, "0");//if one buffer - multiple kernel convolution } } //appendZeropadStart(sc); switch (readType) { case 0://single_c2c { //sc->tempLen = sprintf(sc->tempStr, " return;\n"); char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->axisSwapped) { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_x); } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); } char shiftY2[100] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); if (sc->fftDim < sc->fft_dim_full) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s numActiveThreads = ((%s/%" PRIu64 ")==%" PRIu64 ") ? %" PRIu64 " : %" PRIu64 ";\n", uintType, sc->gl_WorkGroupID_x, sc->firstStageStartSize / sc->fftDim, ((uint64_t)floor(sc->fft_dim_full / ((double)sc->localSize[0] * sc->fftDim))) / (sc->firstStageStartSize / sc->fftDim), (sc->fft_dim_full - (sc->firstStageStartSize / sc->fftDim) * ((((uint64_t)floor(sc->fft_dim_full / ((double)sc->localSize[0] * sc->fftDim))) / (sc->firstStageStartSize / sc->fftDim)) * sc->localSize[0] * sc->fftDim)) / sc->min_registers_per_thread / (sc->firstStageStartSize / sc->fftDim), sc->localSize[0] * sc->localSize[1]);// sc->fft_dim_full, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize, sc->fft_dim_full / (sc->localSize[0] * sc->fftDim)); //sc->tempLen = sprintf(sc->tempStr, " if (numActiveThreads>%" PRIu64 ") numActiveThreads = %" PRIu64 ";\n", sc->localSize[0]* sc->localSize[1], sc->localSize[0]* sc->localSize[1]); //sprintf(sc->disableThreadsStart, " if((%s+%" PRIu64 "*%s)< numActiveThreads) {\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsStart, " if(%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize, sc->fft_dim_full); sc->tempLen = sprintf(sc->tempStr, " if((%s+%" PRIu64 "*%s)< numActiveThreads) {\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); } else { sprintf(sc->disableThreadsStart, " if(%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, sc->fft_dim_full); res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); } } else { sc->tempLen = sprintf(sc->tempStr, " { \n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->inputStride[0], sc->fftDim, sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s = %s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); } else { if (sc->axisSwapped) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")] = %s%s[%s]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")] = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride] = %s%s[%s]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride] = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); } } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x =0;%s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x =0;%s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { /* if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");\n", sc->fftDim, sc->fftDim, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize); */ if (sc->axisSwapped) { if ((sc->fft_dim_full - (sc->firstStageStartSize / sc->fftDim) * ((((uint64_t)floor(sc->fft_dim_full / ((double)sc->localSize[0] * sc->fftDim))) / (sc->firstStageStartSize / sc->fftDim)) * sc->localSize[0] * sc->fftDim)) / sc->min_registers_per_thread / (sc->firstStageStartSize / sc->fftDim) > sc->localSize[0]) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 "*numActiveThreads;\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread)); } else { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 "*numActiveThreads;\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread)); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 "*numActiveThreads;\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread)); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");\n", sc->fftDim, sc->fftDim, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " inoutID = %s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s = %s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")] = %s%s[inoutID]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")] = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")] = %s%s[inoutID]%s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")] = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")].x = 0;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")].y = 0;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 1://grouped_c2c { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); sprintf(sc->disableThreadsStart, " if (((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize, sc->size[sc->axis_id]); res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 "));\n", sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x); res = indexInputVkFFT(sc, uintType, readType, index_x, sc->inoutID, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s=%s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s=%sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s]=%s%s[%s]%s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s]=%sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].x=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].y=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].x=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].y=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 2://single_c2c_strided { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); //sc->tempLen = sprintf(sc->tempStr, " if(gl_GlobalInvolcationID.x%s >= %" PRIu64 ") return; \n", shiftX, sc->size[0] / axis->specializationConstants.fftDim); sprintf(sc->disableThreadsStart, " if (((%s%s) / %" PRIu64 ") * (%" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize * sc->fftDim, sc->fft_dim_full); res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s%s) %% (%" PRIu64 ") + %" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") * (%" PRIu64 ");\n", sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s=%s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s=%sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s]=%s%s[%s]%s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s]=%sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].x=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].y=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].x=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].y=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 5://single_r2c { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->inputStride[0], sc->fftDim, mult * sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, mult * sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.y = %s%s[(%s + %" PRIu64 ")]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, sc->inputStride[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.y = %sinputBlocks[(%s + %" PRIu64 ")/ %" PRIu64 "]%s[(%s + %" PRIu64 ") %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputStride[1], sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputStride[1], sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); else sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->axisSwapped) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = %s%s[%s]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y = %s%s[inoutID]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride+ (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = %s%s[inoutID]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = %s%s[inoutID]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } else { //Not implemented } break; } case 6: {//single_c2r //sc->tempLen = sprintf(sc->tempStr, " return;\n"); char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[100] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); if (sc->fftDim < sc->fft_dim_full) { if (sc->axisSwapped) sprintf(sc->disableThreadsStart, " if(%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize, sc->fft_dim_full); else sprintf(sc->disableThreadsStart, " if(%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, sc->fft_dim_full); res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); } else { sc->tempLen = sprintf(sc->tempStr, " { \n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { uint64_t num_in = (sc->axisSwapped) ? (uint64_t)ceil(mult * (sc->fftDim / 2 + 1) / (double)sc->localSize[1]) : (uint64_t)ceil(mult * (sc->fftDim / 2 + 1) / (double)sc->localSize[0]); //num_in =(uint64_t)ceil(num_in / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < num_in; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_in) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim / 2 + 1, sc->inputStride[0], sc->fftDim / 2 + 1, sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){\n", sc->fftDim / 2 + 1, sc->gl_WorkGroupID_y, shiftY2, mult * sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", mult * (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){\n", sc->fftDim / 2 + 1, sc->gl_WorkGroupID_y, shiftY2, mult * sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", mult * (sc->fftDim / 2 + 1) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (0) { //not enabled if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s = %s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (!sc->axisSwapped) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride] = %s%s[%s]%s;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1), convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride] = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1), convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")] = %s%s[%s]%s;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1), convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")] = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1), convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (0) { //not enabled sc->tempLen = sprintf(sc->tempStr, " %s.x =0;%s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = 0;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = 0;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", mult * (sc->fftDim / 2 + 1), mult * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { if (i < ((sc->fftDim / 2 + 1) / sc->localSize[1])) { sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%s+%" PRIu64 ") * sharedStride].x - sdata[%s + (%s+%" PRIu64 ") * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s + (%s+%" PRIu64 ") * sharedStride].y + sdata[%s + (%s+%" PRIu64 ") * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (i >= (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) { if (((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1))) > (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]))) * sc->localSize[1]) { if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(%" PRIu64 " > %s){\n", ((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1))) - (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]))) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%" PRIu64 "-%s) * sharedStride].x + sdata[%s + (%" PRIu64 "-%s) * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s + (%" PRIu64 "-%s) * sharedStride].y + sdata[%s + (%" PRIu64 "-%s) * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " if(%s < %" PRIu64 "){;\n", sc->gl_LocalInvocationID_y, (sc->fftDim / 2 + 1) % sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%s+%" PRIu64 ") * sharedStride].x - sdata[%s + (%s+%" PRIu64 ") * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s + (%s+%" PRIu64 ") * sharedStride].y + sdata[%s + (%s+%" PRIu64 ") * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%" PRIu64 "-%s) * sharedStride].x + sdata[%s + (%" PRIu64 "-%s) * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s + (%" PRIu64 "-%s) * sharedStride].y + sdata[%s + (%" PRIu64 "-%s) * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (i < ((sc->fftDim / 2 + 1) / sc->localSize[0])) { sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%s+%" PRIu64 ")].x - sdata[%s * sharedStride + (%s+%" PRIu64 ")].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s * sharedStride + (%s+%" PRIu64 ")].y + sdata[%s * sharedStride + (%s+%" PRIu64 ")].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (i >= (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) { if (((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1))) > (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0]))) * sc->localSize[0]) { if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(%" PRIu64 " > %s){\n", ((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1))) - (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0]))) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%" PRIu64 "-%s)].x + sdata[%s * sharedStride + (%" PRIu64 "-%s)].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s * sharedStride + (%" PRIu64 "-%s)].y + sdata[%s * sharedStride + (%" PRIu64 "-%s)].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " if(%s < %" PRIu64 "){;\n", sc->gl_LocalInvocationID_x, (sc->fftDim / 2 + 1) % sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%s+%" PRIu64 ")].x - sdata[%s * sharedStride + (%s+%" PRIu64 ")].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s * sharedStride + (%s+%" PRIu64 ")].y + sdata[%s * sharedStride + (%s+%" PRIu64 ")].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0] + (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%" PRIu64 "-%s)].x + sdata[%s * sharedStride + (%" PRIu64 "-%s)].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s * sharedStride + (%" PRIu64 "-%s)].y + sdata[%s * sharedStride + (%" PRIu64 "-%s)].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (int64_t)ceil(sc->fftDim / 2.0) + (1 - sc->fftDim % 2) + (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } else { if (sc->axisSwapped) { if (i < ((sc->fftDim / 2 + 1) / sc->localSize[1])) { sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%s+%" PRIu64 ") * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s + (%s+%" PRIu64 ") * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (i >= (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) { if (((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1))) > (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]))) * sc->localSize[1]) { if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(%" PRIu64 " > %s){\n", ((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1))) - (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]))) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%" PRIu64 "-%s) * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s + (%" PRIu64 "-%s) * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[1] - ((sc->fftDim / 2) % sc->localSize[1] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1])) * sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " if(%s < %" PRIu64 "){;\n", sc->gl_LocalInvocationID_y, (sc->fftDim / 2 + 1) % sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%s+%" PRIu64 ") * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s + (%s+%" PRIu64 ") * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s + (%" PRIu64 "-%s) * sharedStride].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 + (sc->fftDim / 2 + 1) % sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s + (%" PRIu64 "-%s) * sharedStride].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, (uint64_t)ceil(sc->fftDim / 2.0) - 1 + (sc->fftDim / 2 + 1) % sc->localSize[1], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (i < ((sc->fftDim / 2 + 1) / sc->localSize[0])) { sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%s+%" PRIu64 ")].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s * sharedStride + (%s+%" PRIu64 ")].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (i >= (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) { if (((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1))) > (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0]))) * sc->localSize[0]) { if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(%" PRIu64 " > %s){\n", ((uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1))) - (i - ((int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0]))) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%" PRIu64 "-%s)].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s * sharedStride + (%" PRIu64 "-%s)].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 - (sc->localSize[0] - ((sc->fftDim / 2) % sc->localSize[0] + 1)) - (i - (int64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0])) * sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " if(%s < %" PRIu64 "){;\n", sc->gl_LocalInvocationID_x, (sc->fftDim / 2 + 1) % sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%s+%" PRIu64 ")].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[%s * sharedStride + (%s+%" PRIu64 ")].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[%s * sharedStride + (%" PRIu64 "-%s)].x;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 + (sc->fftDim / 2 + 1) % sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = -sdata[%s * sharedStride + (%" PRIu64 "-%s)].y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_y, (uint64_t)ceil(sc->fftDim / 2.0) - 1 + (sc->fftDim / 2 + 1) % sc->localSize[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } } //sc->readToRegisters = 1; if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 110://DCT-I nonstrided { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } sc->fftDim = (sc->fftDim + 2) / 2; uint64_t num_in = (sc->axisSwapped) ? (uint64_t)ceil((sc->fftDim) / (double)sc->localSize[1]) : (uint64_t)ceil((sc->fftDim) / (double)sc->localSize[0]); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < num_in; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_in) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->inputStride[0], sc->fftDim, mult * sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, mult * sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[1] >= (sc->fftDim)) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[0] >= (sc->fftDim)) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s%s[%s]%s;\n", convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")>0)&&((combinedID %% %" PRIu64 ") < %" PRIu64 ")){\n", sc->fftDim, sc->fftDim, sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim - 2, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID] = sdata[sdataID];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")>0)&&((combinedID %% %" PRIu64 ") < %" PRIu64 ")){\n", sc->fftDim, sc->fftDim, sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim - 2, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID] = sdata[sdataID];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")>0)&&((combinedID %% %" PRIu64 ") < %" PRIu64 ")){\n", sc->fftDim, sc->fftDim, sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim - 2, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID] = sdata[sdataID];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")>0)&&((combinedID %% %" PRIu64 ") < %" PRIu64 ")){\n", sc->fftDim, sc->fftDim, sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim - 2, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID] = sdata[sdataID];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((1 + i + k * num_in) * sc->localSize[1] >= (sc->fftDim)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_in) * sc->localSize[0] >= (sc->fftDim)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } sc->fftDim = 2 * sc->fftDim - 2; if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 111://DCT-I strided { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftX2[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX2, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } sc->fftDim = (sc->fftDim + 2) / 2; uint64_t num_in = (uint64_t)ceil((sc->fftDim) / (double)sc->localSize[1]); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < num_in; i++) { //sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * mult * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); //res = VkAppendLine(sc); //if (res != VKFFT_SUCCESS) return res; if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ") / %" PRIu64 ";\n", sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[1], mult); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[1] >= (sc->fftDim)) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) sc->tempLen = sprintf(sc->tempStr, " //sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (%s + ((%s + %" PRIu64 ") %% %" PRIu64 ") * %" PRIu64 ") / %" PRIu64 ";\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], mult, sc->localSize[0], mult); else sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + %s;\n", sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sprintf(index_x, "(%s + %" PRIu64 " * ((%s %% %" PRIu64 ") + (%s%s) * %" PRIu64 ")) %% (%" PRIu64 ")", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, mult, sc->gl_WorkGroupID_x, shiftX, mult, sc->fft_dim_x); sprintf(index_y, "(%s/%" PRIu64 " + %" PRIu64 ")", sc->gl_LocalInvocationID_y, mult, (i + k * num_in) * sc->localSize[1]); } else { sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[1]); } res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " if ((%s %% 2) == 0) {\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")>0)&&((combinedID %% %" PRIu64 ") < %" PRIu64 ")){\n", sc->fftDim, sc->fftDim, sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + %s;\n", 2 * sc->fftDim - 2, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID] = sdata[sdataID];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " if ((%s %% 2) == 0) {\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")>0)&&((combinedID %% %" PRIu64 ") < %" PRIu64 ")){\n", sc->fftDim, sc->fftDim, sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + %s;\n", 2 * sc->fftDim - 2, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID] = sdata[sdataID];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[1] >= (sc->fftDim)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } sc->fftDim = 2 * sc->fftDim - 2; if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 120://DCT-II nonstrided { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->inputStride[0], sc->fftDim, mult * sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, mult * sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s%s[%s]%s;\n", convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 121://DCT-II strided { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftX2[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX2, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < mult * sc->min_registers_per_thread; i++) { //sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * mult * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); //res = VkAppendLine(sc); //if (res != VKFFT_SUCCESS) return res; if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ") / %" PRIu64 ";\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], mult); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (%s + ((%s + %" PRIu64 ") %% %" PRIu64 ") * %" PRIu64 ") / %" PRIu64 ";\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], mult, sc->localSize[0], mult); else sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + %s;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sprintf(index_x, "(%s + %" PRIu64 " * ((%s %% %" PRIu64 ") + (%s%s) * %" PRIu64 ")) %% (%" PRIu64 ")", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, mult, sc->gl_WorkGroupID_x, shiftX, mult, sc->fft_dim_x); sprintf(index_y, "(%s/%" PRIu64 " + %" PRIu64 ")", sc->gl_LocalInvocationID_y, mult, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); } else { sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); } res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " if ((%s %% 2) == 0) {\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " if ((%s %% 2) == 0) {\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 130://DCT-III nonstrided { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } uint64_t num_in = (sc->axisSwapped) ? (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]) : (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0]); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < num_in; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_in) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, mult * sc->inputStride[1]); } else { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, mult * sc->inputStride[1]); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim / 2 + 1), sc->gl_WorkGroupID_y, sc->localSize[0], (uint64_t)ceil(sc->size[sc->axis_id + 1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim / 2 + 1), sc->gl_WorkGroupID_y, sc->localSize[1], (uint64_t)ceil(sc->size[sc->axis_id + 1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID %% %" PRIu64 "];\n", sc->startDCT3LUT, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(%.17e%s * (combinedID %% %" PRIu64 ") );\n", cosDef, double_PI / 2 / sc->fftDim, LFending, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(%.17e%s * (combinedID %% %" PRIu64 ") );\n", sinDef, double_PI / 2 / sc->fftDim, LFending, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.y = %s%s[inoutID]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride ;\n", sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if (combinedID %% %" PRIu64 " > 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID %% %" PRIu64 ") + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, mult * sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[1], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[1], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.y = %s%s[inoutID]%s;\n", sc->regIDs[1], convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->regIDs[1], convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = ((%s.x+%s.y)*mult.x+(%s.x-%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = ((-%s.x+%s.y)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = ((%s.x-%s.y)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = ((%s.x+%s.y)*mult.x-(%s.x-%s.y)*mult.y);\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = (%s.x*mult.x-%s.y*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = (%s.y*mult.x+%s.x*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 131://DCT-III strided { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftX2[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX2, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; uint64_t num_in = (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < num_in; i++) { //sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * mult * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); //res = VkAppendLine(sc); //if (res != VKFFT_SUCCESS) return res; if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ") / %" PRIu64 ";\n", sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[1], mult); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if ((1 + i + k * num_in) * sc->localSize[1] >= (sc->fftDim / 2 + 1)) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sprintf(index_x, "(%s + %" PRIu64 " * ((%s %% %" PRIu64 ") + (%s%s) * %" PRIu64 ")) %% (%" PRIu64 ")", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, mult, sc->gl_WorkGroupID_x, shiftX, mult, sc->fft_dim_x); sprintf(index_y, "(%s/%" PRIu64 " + %" PRIu64 ")", sc->gl_LocalInvocationID_y, mult, (i + k * num_in) * sc->localSize[1]); } else { sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[1]); } res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) { } else { sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID];\n", sc->startDCT3LUT); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(%.17e%s * (combinedID) );\n", cosDef, double_PI / 2 / sc->fftDim, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(%.17e%s * (combinedID) );\n", sinDef, double_PI / 2 / sc->fftDim, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //sc->tempLen = sprintf(sc->tempStr, " printf(\" %%f - %%f \\n\", mult.x, mult.y);\n"); //res = VkAppendLine(sc); //if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) sc->tempLen = sprintf(sc->tempStr, " //sdataID = (combinedID) * sharedStride + (%s + ((%s + %" PRIu64 ") %% %" PRIu64 ") * %" PRIu64 ") / %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], mult, sc->localSize[0], mult); else sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID) * sharedStride + %s;\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (combinedID > 0){\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sprintf(index_x, "(%s + %" PRIu64 " * ((%s %% %" PRIu64 ") + (%s%s) * %" PRIu64 ")) %% (%" PRIu64 ")", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, mult, sc->gl_WorkGroupID_x, shiftX, mult, sc->fft_dim_x); sprintf(index_y, "(%" PRIu64 " - (%s/%" PRIu64 " + %" PRIu64 "))", sc->fftDim, sc->gl_LocalInvocationID_y, mult, (i + k * num_in) * sc->localSize[1]); } else { sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%" PRIu64 " - (%s + %" PRIu64 "))", sc->fftDim, sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[1]); } res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[1], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[1], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) { } else { sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = ((%s.x+%s.y)*mult.x-(%s.y-%s.x)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = ((%s.y-%s.x)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID) * sharedStride + %s;\n", sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = ((%s.x+%s.y)*mult.x-(%s.y-%s.x)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = -((%s.y-%s.x)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = ((%s.x)*mult.x-(%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = ((%s.y)*mult.x+(%s.x)*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_in) * sc->localSize[1] >= (sc->fftDim / 2 + 1)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 140://DCT-IV nonstrided cast to 8x FFT { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->axisSwapped) { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_x); } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); } char shiftY2[100] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); if (sc->fftDim < sc->fft_dim_full) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s numActiveThreads = ((%s/%" PRIu64 ")==%" PRIu64 ") ? %" PRIu64 " : %" PRIu64 ";\n", uintType, sc->gl_WorkGroupID_x, sc->firstStageStartSize / sc->fftDim, ((uint64_t)floor(sc->fft_dim_full / ((double)sc->localSize[0] * sc->fftDim))) / (sc->firstStageStartSize / sc->fftDim), (sc->fft_dim_full - (sc->firstStageStartSize / sc->fftDim) * ((((uint64_t)floor(sc->fft_dim_full / ((double)sc->localSize[0] * sc->fftDim))) / (sc->firstStageStartSize / sc->fftDim)) * sc->localSize[0] * sc->fftDim)) / sc->min_registers_per_thread / (sc->firstStageStartSize / sc->fftDim), sc->localSize[0] * sc->localSize[1]);// sc->fft_dim_full, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize, sc->fft_dim_full / (sc->localSize[0] * sc->fftDim)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsStart, " if(%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize, sc->fft_dim_full); sc->tempLen = sprintf(sc->tempStr, " if((%s+%" PRIu64 "*%s)< numActiveThreads) {\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); } else { sprintf(sc->disableThreadsStart, " if(%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, sc->fft_dim_full); res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); } } else { sc->tempLen = sprintf(sc->tempStr, " { \n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < (uint64_t)ceil(sc->min_registers_per_thread / 8.0); i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim / 8, sc->inputStride[0], sc->fftDim / 8, sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim / 8, sc->fftDim / 8, sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim / 8, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim / 8 * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim / 8, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim / 8 * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[2*(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(2*(combinedID %% %" PRIu64 ")+1) * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim - 2, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim - 1, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = - %s.x;\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim / 2 - 2, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim / 2 - 1, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " + 2*(combinedID %% %" PRIu64 ")) * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim / 2, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " + 2*(combinedID %% %" PRIu64 ")) * sharedStride + (combinedID / %" PRIu64 ")] = %s;\n", sc->fftDim / 2 + 1, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[2*(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(2*(combinedID %% %" PRIu64 ")+1) + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim - 2, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim - 1, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = - %s.x;\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim / 2 - 2, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " - 2*(combinedID %% %" PRIu64 ")) + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim / 2 - 1, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " + 2*(combinedID %% %" PRIu64 ")) + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim / 2, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(%" PRIu64 " + 2*(combinedID %% %" PRIu64 ")) + (combinedID / %" PRIu64 ") * sharedStride] = %s;\n", sc->fftDim / 2 + 1, sc->fftDim / 8, sc->fftDim / 8, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x =0;%s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } /*else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->axisSwapped) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 "*numActiveThreads;\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");\n", sc->fftDim, sc->fftDim, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " inoutID = %s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s = %s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")] = %s%s[inoutID]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")] = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->fftDim, sc->fftDim, convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")] = %s%s[inoutID]%s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")] = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")].x = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[(combinedID / %" PRIu64 ") + sharedStride*(combinedID %% %" PRIu64 ")].y = 0;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")].x = 0;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sharedStride*%s + (%s + %" PRIu64 ")].y = 0;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } }*/ sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 141://DCT-IV strided cast to 8x FFT { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); if (sc->fftDim != sc->fft_dim_full) { sprintf(sc->disableThreadsStart, " if (((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize, sc->size[sc->axis_id]); res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); } else { sprintf(sc->disableThreadsStart, "{\n"); res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; sprintf(sc->disableThreadsEnd, "}"); } sc->tempLen = sprintf(sc->tempStr, " %s.x = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < (uint64_t)ceil(sc->min_registers_per_thread / 8.0); i++) { if (sc->fftDim == sc->fft_dim_full) sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s + %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 "));\n", sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(inoutID < %" PRIu64 "){\n", sc->fftDim / 8); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x); res = indexInputVkFFT(sc, uintType, readType, index_x, sc->inoutID, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(2*(%s+%" PRIu64 "))+%s]=%s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(2*(%s+%" PRIu64 ")+1)+%s]=%s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%" PRIu64 " - 2*(%s+%" PRIu64 "))+%s]=%s;\n", sc->sharedStride, sc->fftDim - 2, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%" PRIu64 " - 2*(%s+%" PRIu64 "))+%s]=%s;\n", sc->sharedStride, sc->fftDim - 1, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = - %s.x;\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%" PRIu64 " - 2*(%s+%" PRIu64 "))+%s]=%s;\n", sc->sharedStride, sc->fftDim / 2 - 2, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%" PRIu64 " - 2*(%s+%" PRIu64 "))+%s]=%s;\n", sc->sharedStride, sc->fftDim / 2 - 1, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%" PRIu64 " + 2*(%s+%" PRIu64 "))+%s]=%s;\n", sc->sharedStride, sc->fftDim / 2, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%" PRIu64 " + 2*(%s+%" PRIu64 "))+%s]=%s;\n", sc->sharedStride, sc->fftDim / 2 + 1, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0; %s.y = 0;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].x=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[%s*(%s+%" PRIu64 ")+%s].y=0;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 142://DCT-IV nonstrided as 2xN/2 DCT-II { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } uint64_t maxBluesteinCutOff = 1; if (sc->zeropadBluestein[0]) { if (sc->axisSwapped) maxBluesteinCutOff = 2 * sc->fftDim * sc->localSize[0]; else maxBluesteinCutOff = 2 * sc->fftDim * sc->localSize[1]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < 2 * sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", 2 * sc->fftDim, sc->inputStride[0], 2 * sc->fftDim, sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", 2 * sc->fftDim, 2 * sc->fftDim, sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", 2 * sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", 2 * sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND!=3)//OpenCL is not handling barrier with thread-conditional writes to local memory - so this is a work-around if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #else if (i < sc->min_registers_per_thread) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.y = %s%s[%s]%s;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.y = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #endif #if(VKFFT_BACKEND!=3)//OpenCL is not handling barrier with thread-conditional writes to local memory - so this is a work-around: we do writes in a separate stage if (sc->axisSwapped) { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim, 2 * sc->fftDim); } else { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim, 2 * sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 0) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim, 2 * sc->fftDim); } else { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim, 2 * sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 0) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 1) {\n", 2 * sc->fftDim);//another OpenCL bugfix res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } #if(VKFFT_BACKEND==3)//OpenCL is not handling barrier with thread-conditional writes to local memory - so this is a work-around for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < 2 * sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", 2 * sc->fftDim, sc->inputStride[0], 2 * sc->fftDim, sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", 2 * sc->fftDim, 2 * sc->fftDim, sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", 2 * sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", 2 * sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim, 2 * sc->fftDim); } else { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim, 2 * sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (i < sc->min_registers_per_thread) { sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 0) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 0) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.y;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim, 2 * sc->fftDim); } else { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim, 2 * sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 0) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 1) {\n", 2 * sc->fftDim);//another OpenCL bugfix res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < 2 * sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", 2 * sc->fftDim, sc->inputStride[0], 2 * sc->fftDim, sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", 2 * sc->fftDim, 2 * sc->fftDim, sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", 2 * sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", 2 * sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim, 2 * sc->fftDim); } else { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim, 2 * sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (i < sc->min_registers_per_thread) { sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 1) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 1) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.y;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) * sharedStride + (combinedID / %" PRIu64 ");\n", 2 * sc->fftDim, 2 * sc->fftDim); } else { //sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID %% %" PRIu64 ")/2) + (combinedID / %" PRIu64 ") * sharedStride;\n", 2 * sc->fftDim, 2 * sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 0) {\n", 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (((combinedID %% %" PRIu64 ")%%2) == 1) {\n", 2 * sc->fftDim);//another OpenCL bugfix res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1]) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } #endif res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { if (sc->axisSwapped) maxBluesteinCutOff = sc->fftDim * sc->localSize[0]; else maxBluesteinCutOff = sc->fftDim * sc->localSize[1]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim); } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ")>0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[sdataID-sharedStride].y;\n", sc->w); } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[sdataID-1].y;\n", sc->w); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[sdataID].x;\n", sc->w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.x+%s.y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->w, sc->w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.x-%s.y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->w, sc->w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 2*sdata[sdataID].x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim - 1, sc->fftDim); } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim - 1, sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 2*sdata[sdataID].y;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } /*sc->tempLen = sprintf(sc->tempStr, " printf(\" %%f %%f %%d\\n\", %s.x, %s.y, %s);\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res;*/ } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim); } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ")>0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND!=3) if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim, sc->fftDim); } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim, sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.y;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID] = %s;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } /*sc->tempLen = sprintf(sc->tempStr, " printf(\" %%f %%f %%d\\n\", sdata[sdataID].x, sdata[sdataID].y, %s);\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res;*/ } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND==3) res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ")>0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim, sc->fftDim); } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim, sc->fftDim); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.y;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } /*sc->tempLen = sprintf(sc->tempStr, " printf(\" %%f %%f %%d\\n\", sdata[sdataID].x, sdata[sdataID].y, %s);\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res;*/ } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; #endif res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; uint64_t num_in = (sc->axisSwapped) ? (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]) : (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0]); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < num_in; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_in) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID %% %" PRIu64 "];\n", sc->startDCT3LUT, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(%.17e%s * (combinedID %% %" PRIu64 ") );\n", cosDef, double_PI / 2 / sc->fftDim, LFending, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(%.17e%s * (combinedID %% %" PRIu64 ") );\n", sinDef, double_PI / 2 / sc->fftDim, LFending, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride ;\n", sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = sdata[sdataID];\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (combinedID %% %" PRIu64 " > 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride ;\n", sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = sdata[inoutID];\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = ((%s.x+%s.y)*mult.x+(%s.x-%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = ((-%s.x+%s.y)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID].x = ((%s.x-%s.y)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID].y = ((%s.x+%s.y)*mult.x-(%s.x-%s.y)*mult.y);\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } \n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (combinedID %% %" PRIu64 " == 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = (%s.x*mult.x-%s.y*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = (%s.y*mult.x+%s.x*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 143://DCT-IV strided as 2xN/2 DCT-II { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftX2[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX2, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < 2 * sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", 2 * sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[1]); res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND!=3)//OpenCL is not handling barrier with thread-conditional writes to local memory - so this is a work-around if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[0], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[0], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #else if (i < sc->min_registers_per_thread) { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.x = %s%s[%s]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s.y = %s%s[%s]%s;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread], convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s.y = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread], convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } #endif #if(VKFFT_BACKEND!=3)//OpenCL is not handling barrier with thread-conditional writes to local memory - so this is a work-around: we do writes in a separate stage sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID / %" PRIu64 ")/2) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 0) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.x;\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID / %" PRIu64 ")/2) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 0) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } #if(VKFFT_BACKEND==3)//OpenCL is not handling barrier with thread-conditional writes to local memory - so this is a work-around for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < 2 * sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", 2 * sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[1]); res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID / %" PRIu64 ")/2) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (i < sc->min_registers_per_thread) { sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 0) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 0) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.y;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID / %" PRIu64 ")/2) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 0) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < 2 * sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", 2 * sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[1]); res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID / %" PRIu64 ")/2) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (i < sc->min_registers_per_thread) { sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 1) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 1) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.y;\n", sc->regIDs[i - sc->min_registers_per_thread + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = ((combinedID / %" PRIu64 ")/2) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((combinedID / %" PRIu64 ")%%2 == 0) {\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } #endif res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID / %" PRIu64 ") * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((combinedID / %" PRIu64 ")>0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = sdata[sdataID-sharedStride].y;\n", sc->w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = sdata[sdataID].x;\n", sc->w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = %s.x+%s.y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->w, sc->w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.x-%s.y;\n", sc->regIDs[i + k * sc->registers_per_thread], sc->w, sc->w); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 2*sdata[sdataID].x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 ") * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->fftDim - 1, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 2*sdata[sdataID].y;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //sc->tempLen = sprintf(sc->tempStr, " printf(\" %%f %%f\\n\", %s.x, %s.y);\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); //res = VkAppendLine(sc); //if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID / %" PRIu64 ") * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((combinedID / %" PRIu64 ")>0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s.x;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND!=3) sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID / %" PRIu64 ") * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.y;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #endif sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID] = %s;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; #if(VKFFT_BACKEND==3) res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if((combinedID / %" PRIu64 ")>0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = (%" PRIu64 " - combinedID / %" PRIu64 ") * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s.y;\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; #endif res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; uint64_t num_in = (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]); for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < num_in; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_in) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_in) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID / %" PRIu64 "];\n", sc->startDCT3LUT, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(%.17e%s * (combinedID / %" PRIu64 ") );\n", cosDef, double_PI / 2 / sc->fftDim, LFending, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(%.17e%s * (combinedID / %" PRIu64 ") );\n", sinDef, double_PI / 2 / sc->fftDim, LFending, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID / %" PRIu64 ") * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[sdataID];\n", sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (combinedID / %" PRIu64 " > 0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID = (%" PRIu64 " - combinedID / %" PRIu64 ") * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[inoutID];\n", sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = ((%s.x+%s.y)*mult.x+(%s.x-%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = ((-%s.x+%s.y)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID].x = ((%s.x-%s.y)*mult.x+(%s.x+%s.y)*mult.y);\n", sc->regIDs[0], sc->regIDs[1], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[inoutID].y = ((%s.x+%s.y)*mult.x-(%s.x-%s.y)*mult.y);\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[0], sc->regIDs[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " } else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = (%s.x*mult.x-%s.y*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = (%s.y*mult.x+%s.x*mult.y);\n", sc->regIDs[0], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if ((1 + i + k * num_in) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 144://odd DCT-IV nonstrided as N FFT { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->inputStride[0], sc->fftDim, mult * sc->inputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, mult * sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->inputStride[1], sc->fft_zeropad_left_read[sc->axis_id], sc->inputStride[1], sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; indexInputVkFFT(sc, uintType, readType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s%s[%s]%s;\n", convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride ;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " inoutID += %" PRIu64 ";\n", sc->inputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %s%s[inoutID]%s;\n", convTypeLeft, inputsStruct, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride ;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " inoutID = %" PRIu64 " + 4 * (combinedID %% %" PRIu64 ");\n", sc->fftDim / 2, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (inoutID < %" PRIu64 ") sdataID = inoutID;\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = %" PRIu64 " - inoutID;\n", 2 * sc->fftDim, sc->fftDim, 2 * sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = inoutID - %" PRIu64 ";\n", 3 * sc->fftDim, 2 * sc->fftDim, 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = %" PRIu64 " - inoutID;\n", 4 * sc->fftDim, 3 * sc->fftDim, 4 * sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (inoutID >= %" PRIu64 ") sdataID = inoutID - %" PRIu64 ";\n", 4 * sc->fftDim, 4 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = sdataID + %s * sharedStride;\n", sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[sdataID];\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")){ \n\ %s.x = -%s.x;\n\ %s.y = -%s.y;}\n", 2 * sc->fftDim, sc->fftDim, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")){ \n\ %s.x = -%s.x;\n\ %s.y = -%s.y;}\n", 3 * sc->fftDim, 2 * sc->fftDim, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " inoutID = %" PRIu64 " + 4 * combinedID;\n", sc->fftDim / 2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (inoutID < %" PRIu64 ") sdataID = inoutID;\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = %" PRIu64 " - inoutID;\n", 2 * sc->fftDim, sc->fftDim, 2 * sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = inoutID - %" PRIu64 ";\n", 3 * sc->fftDim, 2 * sc->fftDim, 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = %" PRIu64 " - inoutID;\n", 4 * sc->fftDim, 3 * sc->fftDim, 4 * sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (inoutID >= %" PRIu64 ") sdataID = inoutID - %" PRIu64 ";\n", 4 * sc->fftDim, 4 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = sdataID * sharedStride + %s;\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[sdataID];\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")){ \n\ %s.x = -%s.x;\n\ %s.y = -%s.y;}\n", 2 * sc->fftDim, sc->fftDim, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")){ \n\ %s.x = -%s.x;\n\ %s.y = -%s.y;}\n", 3 * sc->fftDim, 2 * sc->fftDim, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } case 145://odd DCT-IV strided as N FFT { char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftX2[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX2, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[0]) { res = appendSetSMToZero(sc, floatType, floatTypeMemory, uintType, readType); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->fftDim = sc->fft_zeropad_Bluestein_left_read[sc->axis_id]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < mult * sc->min_registers_per_thread; i++) { //sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * mult * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); //res = VkAppendLine(sc); //if (res != VKFFT_SUCCESS) return res; if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ") / %" PRIu64 ";\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], mult); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + %s;\n", sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sprintf(index_x, "(%s + %" PRIu64 " * ((%s %% %" PRIu64 ") + (%s%s) * %" PRIu64 ")) %% (%" PRIu64 ")", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, mult, sc->gl_WorkGroupID_x, shiftX, mult, sc->fft_dim_x); sprintf(index_y, "(%s/%" PRIu64 " + %" PRIu64 ")", sc->gl_LocalInvocationID_y, mult, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); } else { sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); } res = indexInputVkFFT(sc, uintType, readType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 1); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %s%s[%s]%s;\n", convTypeLeft, inputsStruct, sc->inoutID, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = %sinputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "]%s;\n", convTypeLeft, sc->inoutID, sc->inputBufferBlockSize, inputsStruct, sc->inoutID, sc->inputBufferBlockSize, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[0]) { sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].x = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdata[sdataID].y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " inoutID = %" PRIu64 " + 4 * combinedID;\n", sc->fftDim / 2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (inoutID < %" PRIu64 ") sdataID = inoutID;\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = %" PRIu64 " - inoutID;\n", 2 * sc->fftDim, sc->fftDim, 2 * sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = inoutID - %" PRIu64 ";\n", 3 * sc->fftDim, 2 * sc->fftDim, 2 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")) sdataID = %" PRIu64 " - inoutID;\n", 4 * sc->fftDim, 3 * sc->fftDim, 4 * sc->fftDim - 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if (inoutID >= %" PRIu64 ") sdataID = inoutID - %" PRIu64 ";\n", 4 * sc->fftDim, 4 * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = sdataID * sharedStride + %s;\n", sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[sdataID];\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")){ \n\ %s.x = -%s.x;\n\ %s.y = -%s.y;}\n", 2 * sc->fftDim, sc->fftDim, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((inoutID < %" PRIu64 ")&&(inoutID >= %" PRIu64 ")){ \n\ %s.x = -%s.x;\n\ %s.y = -%s.y;}\n", 3 * sc->fftDim, 2 * sc->fftDim, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[0]) sc->fftDim = sc->fft_dim_full; } else { //Not implemented } break; } } return res; } static inline VkFFTResult appendReorder4StepRead(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t reorderType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); char cosDef[20] = "cos"; char sinDef[20] = "sin"; if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "native_cos"; char sinDef[20] = "native_sin"; //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #endif uint64_t logicalRegistersPerThread = sc->registers_per_thread_per_radix[sc->stageRadix[0]];// (sc->registers_per_thread % sc->stageRadix[sc->numStages - 1] == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; switch (reorderType) { case 1: {//grouped_c2c char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); if ((sc->stageStartSize > 1) && (!sc->reorderFourStep) && (sc->inverse)) { if (!sc->readToRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } /*if (sc->localSize[1] * sc->stageRadix[0] * (sc->registers_per_thread_per_radix[sc->stageRadix[0]] / sc->stageRadix[0]) > sc->fftDim) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->readToRegisters = 0; } else sc->readToRegisters = 1;*/ res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < sc->fftDim / sc->localSize[1]; i++) { uint64_t id = (i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 "+(((%s%s)/%" PRIu64 ") %% (%" PRIu64 "))+%" PRIu64 "*(%s+%" PRIu64 ")];\n", sc->maxStageSumLUT, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " mult.y = -mult.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " angle = 2 * loc_PI * ((((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")) * (%s + %" PRIu64 ")) / %f%s;\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1], (double)(sc->stageStartSize * sc->fftDim), LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(angle);\n", cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(angle);\n", sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " mult = %s(cos(angle), sin(angle));\n", vecType); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " mult = sincos_20(angle);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, "\ w.x = %s.x * mult.x - %s.y * mult.y;\n", sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.y = %s.y * mult.x + %s.x * mult.y;\n", sc->regIDs[id], sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = w.x;\n", sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ %s = %s*(%" PRIu64 "+%s) + %s;\n", sc->inoutID, sc->sharedStride, i * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ w.x = sdata[%s].x * mult.x - sdata[%s].y * mult.y;\n", sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].y = sdata[%s].y * mult.x + sdata[%s].x * mult.y;\n", sc->inoutID, sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].x = w.x;\n", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } break; } case 2: {//single_c2c_strided char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); if ((!sc->reorderFourStep) && (sc->inverse)) { if (!sc->readToRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } /*if (sc->localSize[1] * sc->stageRadix[0] * (sc->registers_per_thread_per_radix[sc->stageRadix[0]] / sc->stageRadix[0]) > sc->fftDim) { res = appendBarrierVkFFT(sc, 1); sc->readToRegisters = 0; } else sc->readToRegisters = 1;*/ res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < sc->fftDim / sc->localSize[1]; i++) { uint64_t id = (i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + ((%s%s) %% (%" PRIu64 ")) + (%s + %" PRIu64 ") * %" PRIu64 "];\n", sc->maxStageSumLUT, sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->stageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " mult.y = -mult.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " angle = 2 * loc_PI * ((((%s%s) %% (%" PRIu64 ")) * (%s + %" PRIu64 ")) / %f%s);\n", sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1], (double)(sc->stageStartSize * sc->fftDim), LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(angle);\n", cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(angle);\n", sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " mult = %s(cos(angle), sin(angle));\n", vecType); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " mult = sincos_20(angle);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->readToRegisters) { sc->tempLen = sprintf(sc->tempStr, "\ w.x = %s.x * mult.x - %s.y * mult.y;\n", sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.y = %s.y * mult.x + %s.x * mult.y;\n", sc->regIDs[id], sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = w.x;\n", sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ %s = %s*(%" PRIu64 "+%s) + %s;\n", sc->inoutID, sc->sharedStride, i * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ w.x = sdata[%s].x * mult.x - sdata[%s].y * mult.y;\n", sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].y = sdata[%s].y * mult.x + sdata[%s].x * mult.y;\n", sc->inoutID, sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].x = w.x;\n", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 1); break; } } return res; } static inline VkFFTResult appendReorder4StepWrite(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t reorderType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); char cosDef[20] = "cos"; char sinDef[20] = "sin"; if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "native_cos"; char sinDef[20] = "native_sin"; //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #endif uint64_t logicalRegistersPerThread = sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]];// (sc->registers_per_thread % sc->stageRadix[sc->numStages - 1] == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; switch (reorderType) { case 1: {//grouped_c2c char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); if ((sc->stageStartSize > 1) && (!((sc->stageStartSize > 1) && (!sc->reorderFourStep) && (sc->inverse)))) { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } /*if (sc->localSize[1] * sc->stageRadix[sc->numStages - 1] * (sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] / sc->stageRadix[sc->numStages - 1]) > sc->fftDim) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->writeFromRegisters = 0; } else sc->writeFromRegisters = 1;*/ res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < sc->fftDim / sc->localSize[1]; i++) { uint64_t id = (i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 "+(((%s%s)/%" PRIu64 ") %% (%" PRIu64 "))+%" PRIu64 "*(%s+%" PRIu64 ")];\n", sc->maxStageSumLUT, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " mult.y = -mult.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " angle = 2 * loc_PI * ((((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")) * (%s + %" PRIu64 ")) / %f%s;\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1], (double)(sc->stageStartSize * sc->fftDim), LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inverse) { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(angle);\n", cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(angle);\n", sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " mult = %s(cos(angle), sin(angle));\n", vecType); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " mult = sincos_20(angle);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(angle);\n", cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = -%s(angle);\n", sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " mult = %s(cos(angle), sin(angle));\n", vecType); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " mult = sincos_20(-angle);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->writeFromRegisters) { sc->tempLen = sprintf(sc->tempStr, "\ w.x = %s.x * mult.x - %s.y * mult.y;\n", sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.y = %s.y * mult.x + %s.x * mult.y;\n", sc->regIDs[id], sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = w.x;\n", sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ %s = %s*(%" PRIu64 "+%s) + %s;\n", sc->inoutID, sc->sharedStride, i * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ w.x = sdata[%s].x * mult.x - sdata[%s].y * mult.y;\n", sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].y = sdata[%s].y * mult.x + sdata[%s].x * mult.y;\n", sc->inoutID, sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].x = w.x;\n", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } break; } case 2: {//single_c2c_strided char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); if (!((!sc->reorderFourStep) && (sc->inverse))) { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } /*if (sc->localSize[1] * sc->stageRadix[sc->numStages - 1] * (sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] / sc->stageRadix[sc->numStages - 1]) > sc->fftDim) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; sc->writeFromRegisters = 0; } else sc->writeFromRegisters = 1;*/ res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < sc->fftDim / sc->localSize[1]; i++) { uint64_t id = (i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + ((%s%s) %% (%" PRIu64 ")) + (%s + %" PRIu64 ") * %" PRIu64 "];\n", sc->maxStageSumLUT, sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->stageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " mult.y = -mult.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " angle = 2 * loc_PI * ((((%s%s) %% (%" PRIu64 ")) * (%s + %" PRIu64 ")) / %f%s);\n", sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1], (double)(sc->stageStartSize * sc->fftDim), LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inverse) { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(angle);\n", cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(angle);\n", sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " mult = %s(cos(angle), sin(angle));\n", vecType); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " mult = sincos_20(angle);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(angle);\n", cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = -%s(angle);\n", sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " mult = %s(cos(angle), sin(angle));\n", vecType); } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " mult = sincos_20(-angle);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->writeFromRegisters) { sc->tempLen = sprintf(sc->tempStr, "\ w.x = %s.x * mult.x - %s.y * mult.y;\n", sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.y = %s.y * mult.x + %s.x * mult.y;\n", sc->regIDs[id], sc->regIDs[id], sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s.x = w.x;\n", sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ %s = %s*(%" PRIu64 "+%s) + %s;\n", sc->inoutID, sc->sharedStride, i * sc->localSize[1], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ w.x = sdata[%s].x * mult.x - sdata[%s].y * mult.y;\n", sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].y = sdata[%s].y * mult.x + sdata[%s].x * mult.y;\n", sc->inoutID, sc->inoutID, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s].x = w.x;\n", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 1); break; } } return res; } static inline VkFFTResult appendBluesteinMultiplication(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t strideType, uint64_t pre_or_post_multiplication) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); char cosDef[20] = "cos"; char sinDef[20] = "sin"; if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); char cosDef[20] = "native_cos"; char sinDef[20] = "native_sin"; //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #endif char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char requestCoordinate[100] = ""; char index_x[2000] = ""; char index_y[2000] = ""; char requestBatch[100] = ""; char separateRegisterStore[100] = ""; char kernelName[100] = ""; sprintf(kernelName, "BluesteinMultiplication"); if (!((sc->readToRegisters && (pre_or_post_multiplication == 0)) || (sc->writeFromRegisters && (pre_or_post_multiplication == 1)))) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { switch (strideType) { case 0: case 2: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: { if (sc->fftDim == sc->fft_dim_full) { sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";\n", sc->inoutID, sc->gl_LocalInvocationID_x, i * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sprintf(index_x, " (%s%s) %% (%" PRIu64 ") + %" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") * (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize, sc->gl_LocalInvocationID_y, (i)*sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize * sc->fftDim); sc->tempLen = sprintf(sc->tempStr, " %s = %s;\n", sc->inoutID, index_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexInput(%s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ")%s%s);\n", sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, requestCoordinate, requestBatch); } break; } case 1: case 111: case 121: case 131: case 141: case 143: case 145: { if (sc->fftDim == sc->fft_dim_full) { sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";\n", sc->inoutID, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 "));\n", sc->inoutID, sc->stageStartSize, sc->gl_LocalInvocationID_y, (i)*sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } break; } } if ((sc->zeropadBluestein[0]) && (pre_or_post_multiplication == 0)) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 ") < %" PRIu64 "){\n", sc->inoutID, sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((sc->zeropadBluestein[1]) && (pre_or_post_multiplication == 1)) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 ") < %" PRIu64 "){\n", sc->inoutID, sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " w = %s[%s];\n", kernelName, sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; uint64_t k = 0; if (!((sc->readToRegisters && (pre_or_post_multiplication == 0)) || (sc->writeFromRegisters && (pre_or_post_multiplication == 1)))) { if ((strideType == 0) || (strideType == 5) || (strideType == 6) || (strideType == 110) || (strideType == 120) || (strideType == 130) || (strideType == 140) || (strideType == 142) || (strideType == 144)) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[sharedStride * %s + %s + %" PRIu64 " * %s];\n", sc->regIDs[i], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i, sc->gl_WorkGroupSize_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s + (%s + %" PRIu64 " * %s)*sharedStride];\n", sc->regIDs[i], sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i, sc->gl_WorkGroupSize_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->inverseBluestein) res = VkMulComplex(sc, sc->regIDs[i], sc->regIDs[i], "w", sc->temp); else res = VkMulComplexConj(sc, sc->regIDs[i], sc->regIDs[i], "w", sc->temp); if (res != VKFFT_SUCCESS) return res; if (!((sc->readToRegisters && (pre_or_post_multiplication == 0)) || (sc->writeFromRegisters && (pre_or_post_multiplication == 1)))) { if ((strideType == 0) || (strideType == 5) || (strideType == 6) || (strideType == 110) || (strideType == 120) || (strideType == 130) || (strideType == 140) || (strideType == 142) || (strideType == 144)) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * %s + %s + %" PRIu64 " * %s] = %s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i, sc->gl_WorkGroupSize_x, sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s + (%s + %" PRIu64 " * %s)*sharedStride] = %s;\n", sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, i, sc->gl_WorkGroupSize_y, sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if ((sc->zeropadBluestein[0]) && (pre_or_post_multiplication == 0)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((sc->zeropadBluestein[1]) && (pre_or_post_multiplication == 1)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult appendRadixStageNonStrided(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t stageSize, uint64_t stageSizeSum, double stageAngle, uint64_t stageRadix) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #endif char convolutionInverse[10] = ""; if (sc->convolutionStep) { if (stageAngle < 0) sprintf(convolutionInverse, ", 0"); else sprintf(convolutionInverse, ", 1"); } uint64_t logicalStoragePerThread = sc->registers_per_thread_per_radix[stageRadix] * sc->registerBoost;// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; uint64_t logicalRegistersPerThread = sc->registers_per_thread_per_radix[stageRadix];// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; uint64_t logicalGroupSize = sc->fftDim / logicalStoragePerThread; if ((!((sc->readToRegisters == 1) && (stageSize == 1) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle > 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && ((sc->localSize[0] * logicalStoragePerThread > sc->fftDim) || (stageSize > 1) || ((sc->localSize[1] > 1) && (!(sc->performR2C && (sc->actualInverse)))) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle > 0)) || (sc->performDCT))) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t j = 0; j < logicalRegistersPerThread / stageRadix; j++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = (%s+ %" PRIu64 ") %% (%" PRIu64 ");\n", sc->stageInvocationID, sc->gl_LocalInvocationID_x, (j + k * logicalRegistersPerThread / stageRadix) * logicalGroupSize, stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) sc->tempLen = sprintf(sc->tempStr, " LUTId = stageInvocationID + %" PRIu64 ";\n", stageSizeSum); else sc->tempLen = sprintf(sc->tempStr, " angle = stageInvocationID * %.17e%s;\n", stageAngle, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if ((!((sc->readToRegisters == 1) && (stageSize == 1) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle > 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && ((sc->registerBoost == 1) && ((sc->localSize[0] * logicalStoragePerThread > sc->fftDim) || (stageSize > 1) || ((sc->localSize[1] > 1) && (!(sc->performR2C && (sc->actualInverse)))) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle > 0)) || (sc->performDCT)))) { //if(sc->readToRegisters==0){ for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + i * logicalRegistersPerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sc->tempLen = sprintf(sc->tempStr, "\ %s = %s + %" PRIu64 ";\n", sc->sdataID, sc->gl_LocalInvocationID_x, j * logicalGroupSize + i * sc->fftDim / stageRadix); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->resolveBankConflictFirstStages == 1) { sc->tempLen = sprintf(sc->tempStr, "\ %s = (%s / %" PRIu64 ") * %" PRIu64 " + %s %% %" PRIu64 ";", sc->sdataID, sc->sdataID, sc->numSharedBanks / 2, sc->numSharedBanks / 2 + 1, sc->sdataID, sc->numSharedBanks / 2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->localSize[1] > 1) { sc->tempLen = sprintf(sc->tempStr, "\ %s = %s + sharedStride * %s;\n", sc->sdataID, sc->sdataID, sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s];\n", sc->regIDs[id], sc->sdataID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } char** regID = (char**)malloc(sizeof(char*) * stageRadix); if (regID) { for (uint64_t i = 0; i < stageRadix; i++) { regID[i] = (char*)malloc(sizeof(char) * 50); if (!regID[i]) { for (uint64_t j = 0; j < i; j++) { free(regID[j]); regID[j] = 0; } free(regID); regID = 0; return VKFFT_ERROR_MALLOC_FAILED; } uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(regID[i], "%s", sc->regIDs[id]); /*if(j + i * logicalStoragePerThread / stageRadix < logicalRegistersPerThread) sprintf(regID[i], "%s", sc->regIDs[j + i * logicalStoragePerThread / stageRadix]); else sprintf(regID[i], "%" PRIu64 "[%" PRIu64 "]", (j + i * logicalStoragePerThread / stageRadix)/ logicalRegistersPerThread, (j + i * logicalStoragePerThread / stageRadix) % logicalRegistersPerThread);*/ } res = inlineRadixKernelVkFFT(sc, floatType, uintType, stageRadix, stageSize, stageAngle, regID); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(sc->regIDs[id], "%s", regID[i]); } for (uint64_t i = 0; i < stageRadix; i++) { free(regID[i]); regID[i] = 0; } free(regID); regID = 0; } else return VKFFT_ERROR_MALLOC_FAILED; } if ((stageSize == 1) && (sc->cacheShuffle)) { for (uint64_t i = 0; i < logicalRegistersPerThread; i++) { uint64_t id = i + k * logicalRegistersPerThread; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sc->tempLen = sprintf(sc->tempStr, "\ shuffle[%" PRIu64 "]=%s;\n", i, sc->regIDs[id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < logicalRegistersPerThread; i++) { uint64_t id = i + k * logicalRegistersPerThread; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sc->tempLen = sprintf(sc->tempStr, "\ %s=shuffle[(%" PRIu64 "+tshuffle)%%(%" PRIu64 ")];\n", sc->regIDs[id], i, logicalRegistersPerThread); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult appendRadixStageStrided(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t stageSize, uint64_t stageSizeSum, double stageAngle, uint64_t stageRadix) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #endif char convolutionInverse[10] = ""; if (sc->convolutionStep) { if (stageAngle < 0) sprintf(convolutionInverse, ", 0"); else sprintf(convolutionInverse, ", 1"); } uint64_t logicalStoragePerThread = sc->registers_per_thread_per_radix[stageRadix] * sc->registerBoost;// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; uint64_t logicalRegistersPerThread = sc->registers_per_thread_per_radix[stageRadix];// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; uint64_t logicalGroupSize = sc->fftDim / logicalStoragePerThread; if ((!((sc->readToRegisters == 1) && (stageSize == 1) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle > 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && (((sc->axis_id == 0) && (sc->axis_upload_id == 0) && (!(sc->performR2C && (sc->actualInverse)))) || (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) || (stageSize > 1) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle > 0)) || (sc->performDCT))) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t j = 0; j < logicalRegistersPerThread / stageRadix; j++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = (%s+ %" PRIu64 ") %% (%" PRIu64 ");\n", sc->stageInvocationID, sc->gl_LocalInvocationID_y, (j + k * logicalRegistersPerThread / stageRadix) * logicalGroupSize, stageSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) sc->tempLen = sprintf(sc->tempStr, " LUTId = stageInvocationID + %" PRIu64 ";\n", stageSizeSum); else sc->tempLen = sprintf(sc->tempStr, " angle = stageInvocationID * %.17e%s;\n", stageAngle, LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if ((!((sc->readToRegisters == 1) && (stageSize == 1) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle > 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && ((sc->registerBoost == 1) && (((sc->axis_id == 0) && (sc->axis_upload_id == 0) && (!(sc->performR2C && (sc->actualInverse)))) || (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) || (stageSize > 1) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle > 0)) || (sc->performDCT)))) { for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + i * logicalRegistersPerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s*(%s+%" PRIu64 ")+%s];\n", sc->regIDs[id], sc->sharedStride, sc->gl_LocalInvocationID_y, j * logicalGroupSize + i * sc->fftDim / stageRadix, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } char** regID = (char**)malloc(sizeof(char*) * stageRadix); if (regID) { for (uint64_t i = 0; i < stageRadix; i++) { regID[i] = (char*)malloc(sizeof(char) * 50); if (!regID[i]) { for (uint64_t j = 0; j < i; j++) { free(regID[j]); regID[j] = 0; } free(regID); regID = 0; return VKFFT_ERROR_MALLOC_FAILED; } uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(regID[i], "%s", sc->regIDs[id]); /*if (j + i * logicalStoragePerThread / stageRadix < logicalRegistersPerThread) sprintf(regID[i], "_%" PRIu64 "", j + i * logicalStoragePerThread / stageRadix); else sprintf(regID[i], "%" PRIu64 "[%" PRIu64 "]", (j + i * logicalStoragePerThread / stageRadix) / logicalRegistersPerThread, (j + i * logicalStoragePerThread / stageRadix) % logicalRegistersPerThread);*/ } res = inlineRadixKernelVkFFT(sc, floatType, uintType, stageRadix, stageSize, stageAngle, regID); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(sc->regIDs[id], "%s", regID[i]); } for (uint64_t i = 0; i < stageRadix; i++) { free(regID[i]); regID[i] = 0; } free(regID); regID = 0; } else return VKFFT_ERROR_MALLOC_FAILED; } } if (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; if (stageSize == 1) { sc->tempLen = sprintf(sc->tempStr, " %s = %" PRIu64 ";\n", sc->sharedStride, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } return res; } static inline VkFFTResult appendRadixStage(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t stageSize, uint64_t stageSizeSum, double stageAngle, uint64_t stageRadix, uint64_t shuffleType) { VkFFTResult res = VKFFT_SUCCESS; switch (shuffleType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: { res = appendRadixStageNonStrided(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, stageRadix); if (res != VKFFT_SUCCESS) return res; //appendBarrierVkFFT(sc, 1); break; } case 1: case 2: case 111: case 121: case 131: case 141: case 143: case 145: { res = appendRadixStageStrided(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, stageRadix); if (res != VKFFT_SUCCESS) return res; //appendBarrierVkFFT(sc, 1); break; } } return res; } static inline VkFFTResult appendRegisterBoostShuffle(VkFFTSpecializationConstantsLayout* sc, const char* floatType, uint64_t stageSize, uint64_t stageRadixPrev, uint64_t stageRadix, double stageAngle) { VkFFTResult res = VKFFT_SUCCESS; /*if (((sc->actualInverse) && (sc->normalize)) || ((sc->convolutionStep || sc->useBluesteinFFT) && (stageAngle > 0))) { uint64_t bluesteinInverseNormalize = 1; if ((sc->useBluesteinFFT) && (stageAngle > 0) && (stageSize == 1) && (sc->normalize) && (sc->axis_upload_id == 0)) bluesteinInverseNormalize = sc->bluesteinNormalizeSize; char stageNormalization[50] = ""; if ((stageSize == 1) && (sc->performDCT) && (sc->actualInverse)) { if (sc->performDCT == 4) sprintf(stageNormalization, "%" PRIu64 "", stageRadixPrev * stageRadix * 4 * bluesteinInverseNormalize); else sprintf(stageNormalization, "%" PRIu64 "", stageRadixPrev * stageRadix * 2 * bluesteinInverseNormalize); } else sprintf(stageNormalization, "%" PRIu64 "", stageRadixPrev * stageRadix * bluesteinInverseNormalize); uint64_t logicalRegistersPerThread = sc->registers_per_thread_per_radix[stageRadix];// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; for (uint64_t k = 0; k < sc->registerBoost; ++k) { for (uint64_t i = 0; i < logicalRegistersPerThread; i++) { res = VkDivComplexNumber(sc, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], stageNormalization); if (res != VKFFT_SUCCESS) return res; } } }*/ return res; } static inline VkFFTResult appendRadixShuffleNonStrided(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t stageSize, uint64_t stageSizeSum, double stageAngle, uint64_t stageRadix, uint64_t stageRadixNext) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #endif char stageNormalization[50] = ""; uint64_t normalizationValue = 1; if ((((sc->actualInverse) && (sc->normalize)) || (sc->convolutionStep && (stageAngle > 0))) && (stageSize == 1) && (sc->axis_upload_id == 0) && (!(sc->useBluesteinFFT && (stageAngle < 0)))) { if ((sc->performDCT) && (sc->actualInverse)) { if (sc->performDCT == 1) normalizationValue = (sc->sourceFFTSize - 1) * 2; else normalizationValue = sc->sourceFFTSize * 2; } else normalizationValue = sc->sourceFFTSize; } if (sc->useBluesteinFFT && (stageAngle > 0) && (stageSize == 1) && (sc->axis_upload_id == 0)) { normalizationValue *= sc->fft_dim_full; } if (normalizationValue != 1) { sprintf(stageNormalization, "%.17e%s", 1.0 / (double)(normalizationValue), LFending); } char tempNum[50] = ""; uint64_t logicalStoragePerThread = sc->registers_per_thread_per_radix[stageRadix] * sc->registerBoost;// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; uint64_t logicalStoragePerThreadNext = sc->registers_per_thread_per_radix[stageRadixNext] * sc->registerBoost;// (sc->registers_per_thread % stageRadixNext == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; uint64_t logicalRegistersPerThread = sc->registers_per_thread_per_radix[stageRadix];// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; uint64_t logicalRegistersPerThreadNext = sc->registers_per_thread_per_radix[stageRadixNext];// (sc->registers_per_thread % stageRadixNext == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; uint64_t logicalGroupSize = sc->fftDim / logicalStoragePerThread; uint64_t logicalGroupSizeNext = sc->fftDim / logicalStoragePerThreadNext; if ((!((sc->writeFromRegisters == 1) && (stageSize == sc->fftDim / stageRadix) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle < 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && (((sc->registerBoost == 1) && ((sc->localSize[0] * logicalStoragePerThread > sc->fftDim) || (stageSize < sc->fftDim / stageRadix) || ((sc->reorderFourStep) && (sc->fftDim < sc->fft_dim_full) && (sc->localSize[1] > 1)) || (sc->localSize[1] > 1) || ((sc->performR2C) && (!sc->actualInverse) && (sc->axis_id == 0)) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle < 0)))) || (sc->performDCT))) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //if ((sc->localSize[0] * logicalStoragePerThread > sc->fftDim) || (stageSize < sc->fftDim / stageRadix) || ((sc->reorderFourStep) && (sc->fftDim < sc->fft_dim_full) && (sc->localSize[1] > 1)) || (sc->localSize[1] > 1) || ((sc->performR2C) && (!sc->actualInverse) && (sc->axis_id == 0)) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle < 0)) || (sc->registerBoost > 1) || (sc->performDCT)) { if ((!((sc->writeFromRegisters == 1) && (stageSize == sc->fftDim / stageRadix) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle < 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && ((sc->localSize[0] * logicalStoragePerThread > sc->fftDim) || (stageSize < sc->fftDim / stageRadix) || ((sc->reorderFourStep) && (sc->fftDim < sc->fft_dim_full) && (sc->localSize[1] > 1)) || (sc->localSize[1] > 1) || ((sc->performR2C) && (!sc->actualInverse) && (sc->axis_id == 0)) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle < 0)) || (sc->registerBoost > 1) || (sc->performDCT))) { if (!((sc->registerBoost > 1) && (stageSize * stageRadix == sc->fftDim / sc->stageRadix[sc->numStages - 1]) && (sc->stageRadix[sc->numStages - 1] == sc->registerBoost))) { char** tempID; tempID = (char**)malloc(sizeof(char*) * sc->registers_per_thread * sc->registerBoost); if (tempID) { for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { tempID[i] = (char*)malloc(sizeof(char) * 50); if (!tempID[i]) { for (uint64_t j = 0; j < i; j++) { free(tempID[j]); tempID[j] = 0; } free(tempID); tempID = 0; return VKFFT_ERROR_MALLOC_FAILED; } } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t k = 0; k < sc->registerBoost; ++k) { uint64_t t = 0; if (k > 0) { res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } for (uint64_t j = 0; j < logicalRegistersPerThread / stageRadix; j++) { sprintf(tempNum, "%" PRIu64 "", j * logicalGroupSize); res = VkAddReal(sc, sc->stageInvocationID, sc->gl_LocalInvocationID_x, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkMovReal(sc, sc->blockInvocationID, sc->stageInvocationID); if (res != VKFFT_SUCCESS) return res; sprintf(tempNum, "%" PRIu64 "", stageSize); res = VkModReal(sc, sc->stageInvocationID, sc->stageInvocationID, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkSubReal(sc, sc->blockInvocationID, sc->blockInvocationID, sc->stageInvocationID); if (res != VKFFT_SUCCESS) return res; sprintf(tempNum, "%" PRIu64 "", stageRadix); res = VkMulReal(sc, sc->inoutID, sc->blockInvocationID, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkAddReal(sc, sc->inoutID, sc->inoutID, sc->stageInvocationID); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ stageInvocationID = (gl_LocalInvocationID.x + %" PRIu64 ") %% (%" PRIu64 ");\n\ blockInvocationID = (gl_LocalInvocationID.x + %" PRIu64 ") - stageInvocationID;\n\ inoutID = stageInvocationID + blockInvocationID * %" PRIu64 ";\n", j * logicalGroupSize, stageSize, j * logicalGroupSize, stageRadix);*/ if ((stageSize == 1) && (sc->cacheShuffle)) { for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(tempID[t + k * sc->registers_per_thread], "%s", sc->regIDs[id]); t++; sprintf(tempNum, "%" PRIu64 "", i); res = VkAddReal(sc, sc->sdataID, tempNum, sc->tshuffle); if (res != VKFFT_SUCCESS) return res; sprintf(tempNum, "%" PRIu64 "", logicalRegistersPerThread); res = VkModReal(sc, sc->sdataID, sc->sdataID, tempNum); if (res != VKFFT_SUCCESS) return res; sprintf(tempNum, "%" PRIu64 "", stageSize); res = VkMulReal(sc, sc->sdataID, sc->sdataID, tempNum); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[1] > 1) { res = VkMulReal(sc, sc->combinedID, sc->gl_LocalInvocationID_y, sc->sharedStride); if (res != VKFFT_SUCCESS) return res; res = VkAddReal(sc, sc->sdataID, sc->sdataID, sc->combinedID); if (res != VKFFT_SUCCESS) return res; } res = VkAddReal(sc, sc->sdataID, sc->sdataID, sc->inoutID); if (res != VKFFT_SUCCESS) return res; //sprintf(sc->sdataID, "sharedStride * gl_LocalInvocationID.y + inoutID + ((%" PRIu64 "+tshuffle) %% (%" PRIu64 "))*%" PRIu64 "", i, logicalRegistersPerThread, stageSize); if (strcmp(stageNormalization, "")) { res = VkMulComplexNumber(sc, sc->regIDs[id], sc->regIDs[id], stageNormalization); if (res != VKFFT_SUCCESS) return res; } res = VkSharedStore(sc, sc->sdataID, sc->regIDs[id]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * gl_LocalInvocationID.y + inoutID + ((%" PRIu64 "+tshuffle) %% (%" PRIu64 "))*%" PRIu64 "] = temp%s%s;\n", i, logicalRegistersPerThread, stageSize, sc->regIDs[id], stageNormalization);*/ } } else { for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(tempID[t + k * sc->registers_per_thread], "%s", sc->regIDs[id]); t++; sprintf(tempNum, "%" PRIu64 "", i * stageSize); res = VkAddReal(sc, sc->sdataID, sc->inoutID, tempNum); if (res != VKFFT_SUCCESS) return res; if ((stageSize <= sc->numSharedBanks / 2) && (sc->fftDim > sc->numSharedBanks / 2) && (sc->sharedStrideBankConflictFirstStages != sc->fftDim / sc->registerBoost) && ((sc->fftDim & (sc->fftDim - 1)) == 0) && (stageSize * stageRadix != sc->fftDim)) { if (sc->resolveBankConflictFirstStages == 0) { sc->resolveBankConflictFirstStages = 1; sc->tempLen = sprintf(sc->tempStr, "\ %s = %" PRIu64 ";", sc->sharedStride, sc->sharedStrideBankConflictFirstStages); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, "\ %s = (%s / %" PRIu64 ") * %" PRIu64 " + %s %% %" PRIu64 ";", sc->sdataID, sc->sdataID, sc->numSharedBanks / 2, sc->numSharedBanks / 2 + 1, sc->sdataID, sc->numSharedBanks / 2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->resolveBankConflictFirstStages == 1) { sc->resolveBankConflictFirstStages = 0; sc->tempLen = sprintf(sc->tempStr, "\ %s = %" PRIu64 ";", sc->sharedStride, sc->sharedStrideReadWriteConflict); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->localSize[1] > 1) { res = VkMulReal(sc, sc->combinedID, sc->gl_LocalInvocationID_y, sc->sharedStride); if (res != VKFFT_SUCCESS) return res; res = VkAddReal(sc, sc->sdataID, sc->sdataID, sc->combinedID); if (res != VKFFT_SUCCESS) return res; } //sprintf(sc->sdataID, "sharedStride * gl_LocalInvocationID.y + inoutID + %" PRIu64 "", i * stageSize); if (strcmp(stageNormalization, "")) { res = VkMulComplexNumber(sc, sc->regIDs[id], sc->regIDs[id], stageNormalization); if (res != VKFFT_SUCCESS) return res; } res = VkSharedStore(sc, sc->sdataID, sc->regIDs[id]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * gl_LocalInvocationID.y + inoutID + %" PRIu64 "] = temp%s%s;\n", i * stageSize, sc->regIDs[id], stageNormalization);*/ } } } for (uint64_t j = logicalRegistersPerThread; j < sc->registers_per_thread; j++) { sprintf(tempID[t + k * sc->registers_per_thread], "%s", sc->regIDs[t + k * sc->registers_per_thread]); t++; } t = 0; if (sc->registerBoost > 1) { if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[0] * logicalStoragePerThreadNext > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, logicalStoragePerThreadNext, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t j = 0; j < logicalRegistersPerThreadNext / stageRadixNext; j++) { for (uint64_t i = 0; i < stageRadixNext; i++) { uint64_t id = j + k * logicalRegistersPerThreadNext / stageRadixNext + i * logicalStoragePerThreadNext / stageRadixNext; id = (id / logicalRegistersPerThreadNext) * sc->registers_per_thread + id % logicalRegistersPerThreadNext; //resID[t + k * sc->registers_per_thread] = sc->regIDs[id]; sprintf(tempNum, "%" PRIu64 "", t * logicalGroupSizeNext); res = VkAddReal(sc, sc->sdataID, sc->gl_LocalInvocationID_x, tempNum); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[1] > 1) { res = VkMulReal(sc, sc->combinedID, sc->gl_LocalInvocationID_y, sc->sharedStride); if (res != VKFFT_SUCCESS) return res; res = VkAddReal(sc, sc->sdataID, sc->sdataID, sc->combinedID); if (res != VKFFT_SUCCESS) return res; } //sprintf(sc->sdataID, "sharedStride * gl_LocalInvocationID.y + gl_LocalInvocationID.x + %" PRIu64 "", t * logicalGroupSizeNext); res = VkSharedLoad(sc, tempID[t + k * sc->registers_per_thread], sc->sdataID); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = sdata[sharedStride * gl_LocalInvocationID.y + gl_LocalInvocationID.x + %" PRIu64 "];\n", tempID[t + k * sc->registers_per_thread], t * logicalGroupSizeNext);*/ t++; } } if (sc->localSize[0] * logicalStoragePerThreadNext > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { //printf("0 - %s\n", resID[i]); sprintf(sc->regIDs[i], "%s", tempID[i]); //sprintf(resID[i], "%s", tempID[i]); //printf("1 - %s\n", resID[i]); } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { free(tempID[i]); tempID[i] = 0; } free(tempID); tempID = 0; } else return VKFFT_ERROR_MALLOC_FAILED; } else { char** tempID; tempID = (char**)malloc(sizeof(char*) * sc->registers_per_thread * sc->registerBoost); if (tempID) { //resID = (char**)malloc(sizeof(char*) * sc->registers_per_thread * sc->registerBoost); for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { tempID[i] = (char*)malloc(sizeof(char) * 50); if (!tempID[i]) { for (uint64_t j = 0; j < i; j++) { free(tempID[j]); tempID[j] = 0; } free(tempID); tempID = 0; return VKFFT_ERROR_MALLOC_FAILED; } } for (uint64_t k = 0; k < sc->registerBoost; ++k) { for (uint64_t j = 0; j < logicalRegistersPerThread / stageRadix; j++) { for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(tempID[j + i * logicalRegistersPerThread / stageRadix + k * sc->registers_per_thread], "%s", sc->regIDs[id]); } } for (uint64_t j = logicalRegistersPerThread; j < sc->registers_per_thread; j++) { sprintf(tempID[j + k * sc->registers_per_thread], "%s", sc->regIDs[j + k * sc->registers_per_thread]); } } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { sprintf(sc->regIDs[i], "%s", tempID[i]); } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { free(tempID[i]); tempID[i] = 0; } free(tempID); tempID = 0; } else return VKFFT_ERROR_MALLOC_FAILED; } } else { res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (((sc->actualInverse) && (sc->normalize)) || ((sc->convolutionStep || sc->useBluesteinFFT) && (stageAngle > 0))) { for (uint64_t i = 0; i < logicalStoragePerThread; i++) { if (strcmp(stageNormalization, "")) { res = VkMulComplexNumber(sc, sc->regIDs[(i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread], sc->regIDs[(i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread], stageNormalization); } if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = temp%s%s;\n", sc->regIDs[(i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread], sc->regIDs[(i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread], stageNormalization);*/ } } if (sc->localSize[0] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } return res; } static inline VkFFTResult appendRadixShuffleStrided(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t stageSize, uint64_t stageSizeSum, double stageAngle, uint64_t stageRadix, uint64_t stageRadixNext) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #endif char tempNum[50] = ""; uint64_t logicalStoragePerThread = sc->registers_per_thread_per_radix[stageRadix] * sc->registerBoost;// (sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; uint64_t logicalStoragePerThreadNext = sc->registers_per_thread_per_radix[stageRadixNext] * sc->registerBoost;//(sc->registers_per_thread % stageRadixNext == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; uint64_t logicalRegistersPerThread = sc->registers_per_thread_per_radix[stageRadix];//(sc->registers_per_thread % stageRadix == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; uint64_t logicalRegistersPerThreadNext = sc->registers_per_thread_per_radix[stageRadixNext];//(sc->registers_per_thread % stageRadixNext == 0) ? sc->registers_per_thread : sc->min_registers_per_thread; uint64_t logicalGroupSize = sc->fftDim / logicalStoragePerThread; uint64_t logicalGroupSizeNext = sc->fftDim / logicalStoragePerThreadNext; char stageNormalization[50] = ""; uint64_t normalizationValue = 1; if ((((sc->actualInverse) && (sc->normalize)) || (sc->convolutionStep && (stageAngle > 0))) && (stageSize == 1) && (sc->axis_upload_id == 0) && (!(sc->useBluesteinFFT && (stageAngle < 0)))) { if ((sc->performDCT) && (sc->actualInverse)) { if (sc->performDCT == 1) normalizationValue = (sc->sourceFFTSize - 1) * 2; else normalizationValue = sc->sourceFFTSize * 2; } else normalizationValue = sc->sourceFFTSize; } if (sc->useBluesteinFFT && (stageAngle > 0) && (stageSize == 1) && (sc->axis_upload_id == 0)) { normalizationValue *= sc->fft_dim_full; } if (normalizationValue != 1) { sprintf(stageNormalization, "%.17e%s", 1.0 / (double)(normalizationValue), LFending); } if ((!((sc->writeFromRegisters == 1) && (stageSize == sc->fftDim / stageRadix) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle < 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && (((sc->axis_id == 0) && (sc->axis_upload_id == 0)) || (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) || (stageSize < sc->fftDim / stageRadix) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle < 0)) || (sc->performDCT))) { res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; } if (stageSize == sc->fftDim / stageRadix) { sc->tempLen = sprintf(sc->tempStr, " %s = %" PRIu64 ";\n", sc->sharedStride, sc->sharedStrideReadWriteConflict); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((!((sc->writeFromRegisters == 1) && (stageSize == sc->fftDim / stageRadix) && (!(((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) && (stageAngle < 0) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))))) && (((sc->axis_id == 0) && (sc->axis_upload_id == 0)) || (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) || (stageSize < sc->fftDim / stageRadix) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)) && (stageAngle < 0)) || (sc->performDCT))) { //if (sc->writeFromRegisters == 0) { //appendBarrierVkFFT(sc, 2); if (!((sc->registerBoost > 1) && (stageSize * stageRadix == sc->fftDim / sc->stageRadix[sc->numStages - 1]) && (sc->stageRadix[sc->numStages - 1] == sc->registerBoost))) { char** tempID; tempID = (char**)malloc(sizeof(char*) * sc->registers_per_thread * sc->registerBoost); if (tempID) { for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { tempID[i] = (char*)malloc(sizeof(char) * 50); if (!tempID[i]) { for (uint64_t j = 0; j < i; j++) { free(tempID[j]); tempID[j] = 0; } free(tempID); tempID = 0; return VKFFT_ERROR_MALLOC_FAILED; } } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t k = 0; k < sc->registerBoost; ++k) { uint64_t t = 0; if (k > 0) { res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } for (uint64_t j = 0; j < logicalRegistersPerThread / stageRadix; j++) { sprintf(tempNum, "%" PRIu64 "", j * logicalGroupSize); res = VkAddReal(sc, sc->stageInvocationID, sc->gl_LocalInvocationID_y, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkMovReal(sc, sc->blockInvocationID, sc->stageInvocationID); if (res != VKFFT_SUCCESS) return res; sprintf(tempNum, "%" PRIu64 "", stageSize); res = VkModReal(sc, sc->stageInvocationID, sc->stageInvocationID, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkSubReal(sc, sc->blockInvocationID, sc->blockInvocationID, sc->stageInvocationID); if (res != VKFFT_SUCCESS) return res; sprintf(tempNum, "%" PRIu64 "", stageRadix); res = VkMulReal(sc, sc->inoutID, sc->blockInvocationID, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkAddReal(sc, sc->inoutID, sc->inoutID, sc->stageInvocationID); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ stageInvocationID = (gl_LocalInvocationID.y + %" PRIu64 ") %% (%" PRIu64 ");\n\ blockInvocationID = (gl_LocalInvocationID.y + %" PRIu64 ") - stageInvocationID;\n\ inoutID = stageInvocationID + blockInvocationID * %" PRIu64 ";\n", j * logicalGroupSize, stageSize, j * logicalGroupSize, stageRadix);*/ for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(tempID[t + k * sc->registers_per_thread], "%s", sc->regIDs[id]); t++; sprintf(tempNum, "%" PRIu64 "", i * stageSize); res = VkAddReal(sc, sc->sdataID, sc->inoutID, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkMulReal(sc, sc->sdataID, sc->sharedStride, sc->sdataID); if (res != VKFFT_SUCCESS) return res; res = VkAddReal(sc, sc->sdataID, sc->sdataID, sc->gl_LocalInvocationID_x); if (res != VKFFT_SUCCESS) return res; //sprintf(sc->sdataID, "sharedStride * gl_LocalInvocationID.y + inoutID + %" PRIu64 "", i * stageSize); if (strcmp(stageNormalization, "")) { res = VkMulComplexNumber(sc, sc->regIDs[id], sc->regIDs[id], stageNormalization); if (res != VKFFT_SUCCESS) return res; } res = VkSharedStore(sc, sc->sdataID, sc->regIDs[id]); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ sdata[gl_WorkGroupSize.x*(inoutID+%" PRIu64 ")+gl_LocalInvocationID.x] = temp%s%s;\n", i * stageSize, sc->regIDs[id], stageNormalization);*/ } } for (uint64_t j = logicalRegistersPerThread; j < sc->registers_per_thread; j++) { sprintf(tempID[t + k * sc->registers_per_thread], "%s", sc->regIDs[t + k * sc->registers_per_thread]); t++; } t = 0; if (sc->registerBoost > 1) { if (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[1] * logicalStoragePerThreadNext > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, logicalStoragePerThreadNext, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t j = 0; j < logicalRegistersPerThreadNext / stageRadixNext; j++) { for (uint64_t i = 0; i < stageRadixNext; i++) { uint64_t id = j + k * logicalRegistersPerThreadNext / stageRadixNext + i * logicalRegistersPerThreadNext / stageRadixNext; id = (id / logicalRegistersPerThreadNext) * sc->registers_per_thread + id % logicalRegistersPerThreadNext; sprintf(tempNum, "%" PRIu64 "", t * logicalGroupSizeNext); res = VkAddReal(sc, sc->sdataID, sc->gl_LocalInvocationID_y, tempNum); if (res != VKFFT_SUCCESS) return res; res = VkMulReal(sc, sc->sdataID, sc->sharedStride, sc->sdataID); if (res != VKFFT_SUCCESS) return res; res = VkAddReal(sc, sc->sdataID, sc->sdataID, sc->gl_LocalInvocationID_x); if (res != VKFFT_SUCCESS) return res; //sprintf(sc->sdataID, "sharedStride * gl_LocalInvocationID.y + gl_LocalInvocationID.x + %" PRIu64 "", t * logicalGroupSizeNext); res = VkSharedLoad(sc, tempID[t + k * sc->registers_per_thread], sc->sdataID); if (res != VKFFT_SUCCESS) return res; /*sc->tempLen = sprintf(sc->tempStr, "\ temp%s = sdata[gl_WorkGroupSize.x*(gl_LocalInvocationID.y+%" PRIu64 ")+gl_LocalInvocationID.x];\n", tempID[t + k * sc->registers_per_thread], t * logicalGroupSizeNext);*/ t++; } } if (sc->localSize[1] * logicalStoragePerThreadNext > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { sprintf(sc->regIDs[i], "%s", tempID[i]); } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { free(tempID[i]); tempID[i] = 0; } free(tempID); tempID = 0; } else return VKFFT_ERROR_MALLOC_FAILED; } else { char** tempID; tempID = (char**)malloc(sizeof(char*) * sc->registers_per_thread * sc->registerBoost); if (tempID) { //resID = (char**)malloc(sizeof(char*) * sc->registers_per_thread * sc->registerBoost); for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { tempID[i] = (char*)malloc(sizeof(char) * 50); if (!tempID[i]) { for (uint64_t j = 0; j < i; j++) { free(tempID[j]); tempID[j] = 0; } free(tempID); tempID = 0; return VKFFT_ERROR_MALLOC_FAILED; } } for (uint64_t k = 0; k < sc->registerBoost; ++k) { for (uint64_t j = 0; j < logicalRegistersPerThread / stageRadix; j++) { for (uint64_t i = 0; i < stageRadix; i++) { uint64_t id = j + k * logicalRegistersPerThread / stageRadix + i * logicalStoragePerThread / stageRadix; id = (id / logicalRegistersPerThread) * sc->registers_per_thread + id % logicalRegistersPerThread; sprintf(tempID[j + i * logicalRegistersPerThread / stageRadix + k * sc->registers_per_thread], "%s", sc->regIDs[id]); } } for (uint64_t j = logicalRegistersPerThread; j < sc->registers_per_thread; j++) { sprintf(tempID[j + k * sc->registers_per_thread], "%s", sc->regIDs[j + k * sc->registers_per_thread]); } } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { sprintf(sc->regIDs[i], "%s", tempID[i]); } for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { free(tempID[i]); tempID[i] = 0; } free(tempID); tempID = 0; } else return VKFFT_ERROR_MALLOC_FAILED; } } else { res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->localSize[1] * logicalStoragePerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (((sc->actualInverse) && (sc->normalize)) || ((sc->convolutionStep || sc->useBluesteinFFT) && (stageAngle > 0))) { for (uint64_t i = 0; i < logicalRegistersPerThread; i++) { if (strcmp(stageNormalization, "")) { res = VkMulComplexNumber(sc, sc->regIDs[(i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread], sc->regIDs[(i / logicalRegistersPerThread) * sc->registers_per_thread + i % logicalRegistersPerThread], stageNormalization); } if (res != VKFFT_SUCCESS) return res; } } if (sc->localSize[1] * logicalRegistersPerThread > sc->fftDim) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } return res; } static inline VkFFTResult appendRadixShuffle(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t stageSize, uint64_t stageSizeSum, double stageAngle, uint64_t stageRadix, uint64_t stageRadixNext, uint64_t shuffleType) { VkFFTResult res = VKFFT_SUCCESS; switch (shuffleType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: { res = appendRadixShuffleNonStrided(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, stageRadix, stageRadixNext); if (res != VKFFT_SUCCESS) return res; //appendBarrierVkFFT(sc, 1); break; } case 1: case 2: case 111: case 121: case 131: case 141: case 143: case 145: { res = appendRadixShuffleStrided(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, stageRadix, stageRadixNext); if (res != VKFFT_SUCCESS) return res; //appendBarrierVkFFT(sc, 1); break; } } return res; } static inline VkFFTResult appendBoostThreadDataReorder(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* uintType, uint64_t shuffleType, uint64_t start) { VkFFTResult res = VKFFT_SUCCESS; switch (shuffleType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: { uint64_t logicalStoragePerThread; if (start == 1) { logicalStoragePerThread = sc->registers_per_thread_per_radix[sc->stageRadix[0]] * sc->registerBoost;// (sc->registers_per_thread % sc->stageRadix[0] == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; } else { logicalStoragePerThread = sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] * sc->registerBoost;// (sc->registers_per_thread % sc->stageRadix[sc->numStages - 1] == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; } uint64_t logicalGroupSize = sc->fftDim / logicalStoragePerThread; if ((sc->registerBoost > 1) && (logicalStoragePerThread != sc->min_registers_per_thread * sc->registerBoost)) { for (uint64_t k = 0; k < sc->registerBoost; k++) { if (k > 0) { res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (start == 0) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < logicalStoragePerThread / sc->registerBoost; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s + %" PRIu64 "] = %s;\n", sc->gl_LocalInvocationID_x, i * logicalGroupSize, sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s + %" PRIu64 "] = %s;\n", sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (start == 1) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_x, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < logicalStoragePerThread / sc->registerBoost; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s + %" PRIu64 "];\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, i * logicalGroupSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s + %" PRIu64 "];\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, i * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } } break; } case 1: case 2: case 111: case 121: case 131: case 141: case 143: case 145: { uint64_t logicalStoragePerThread; if (start == 1) { logicalStoragePerThread = sc->registers_per_thread_per_radix[sc->stageRadix[0]] * sc->registerBoost;// (sc->registers_per_thread % sc->stageRadix[0] == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; } else { logicalStoragePerThread = sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] * sc->registerBoost;// (sc->registers_per_thread % sc->stageRadix[sc->numStages - 1] == 0) ? sc->registers_per_thread * sc->registerBoost : sc->min_registers_per_thread * sc->registerBoost; } uint64_t logicalGroupSize = sc->fftDim / logicalStoragePerThread; if ((sc->registerBoost > 1) && (logicalStoragePerThread != sc->min_registers_per_thread * sc->registerBoost)) { for (uint64_t k = 0; k < sc->registerBoost; k++) { if (k > 0) { res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (start == 0) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < logicalStoragePerThread / sc->registerBoost; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s + %s * (%s + %" PRIu64 ")] = %s;\n", sc->gl_LocalInvocationID_x, sc->sharedStride, sc->gl_LocalInvocationID_y, i * logicalGroupSize, sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s + %s * (%s + %" PRIu64 ")] = %s;\n", sc->gl_LocalInvocationID_x, sc->sharedStride, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 2); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (start == 1) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s * %" PRIu64 " < %" PRIu64 ") {\n", sc->gl_LocalInvocationID_y, logicalStoragePerThread, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 0; i < logicalStoragePerThread / sc->registerBoost; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s + %s * (%s + %" PRIu64 ")];\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->sharedStride, sc->gl_LocalInvocationID_y, i * logicalGroupSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s + %s * (%s + %" PRIu64 ")];\n", sc->regIDs[i + k * sc->registers_per_thread], sc->gl_LocalInvocationID_x, sc->sharedStride, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; } } break; } } return res; } static inline VkFFTResult appendCoordinateRegisterStore(VkFFTSpecializationConstantsLayout* sc, uint64_t readType) { VkFFTResult res = VKFFT_SUCCESS; if ((!sc->writeFromRegisters) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))) { switch (readType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144://single_c2c { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->matrixConvolution == 1) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[sharedStride * %s + %s];\n", sc->regIDs[0], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[sharedStride * %s + %s + %" PRIu64 " * %s];\n", sc->regIDs[i], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i, sc->gl_WorkGroupSize_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); } else { sc->tempLen = sprintf(sc->tempStr, "\ switch (coordinate) {\n\ case 0:\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[sharedStride * %s + %s];\n", sc->regIDs[0], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[sharedStride * %s + %s + %" PRIu64 " * %s];\n", sc->regIDs[i], sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i, sc->gl_WorkGroupSize_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->matrixConvolution; i++) { sc->tempLen = sprintf(sc->tempStr, "\ case %" PRIu64 ":\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s_%" PRIu64 " = sdata[sharedStride * %s + %s];\n", sc->regIDs[0], i, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 1; j < sc->min_registers_per_thread; j++) { sc->tempLen = sprintf(sc->tempStr, "\ %s_%" PRIu64 " = sdata[sharedStride * %s + %s + %" PRIu64 " * %s];\n", sc->regIDs[j], i, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, j, sc->gl_WorkGroupSize_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; break; } case 1: case 111: case 121: case 131: case 141: case 143: case 145://grouped_c2c { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->matrixConvolution == 1) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s*(%s)+%s];\n", sc->regIDs[0], sc->sharedStride, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s*(%s+%" PRIu64 "*%s)+%s];\n", sc->regIDs[i], sc->sharedStride, sc->gl_LocalInvocationID_y, i, sc->gl_WorkGroupSize_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); } else { sc->tempLen = sprintf(sc->tempStr, "\ switch (coordinate) {\n\ case 0:\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s*(%s)+%s];\n", sc->regIDs[0], sc->sharedStride, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ %s = sdata[%s*(%s+%" PRIu64 "*%s)+%s];\n", sc->regIDs[i], sc->sharedStride, sc->gl_LocalInvocationID_y, i, sc->gl_WorkGroupSize_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->matrixConvolution; i++) { sc->tempLen = sprintf(sc->tempStr, "\ case %" PRIu64 ":\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ %s_%" PRIu64 " = sdata[%s*(%s)+%s];\n", sc->regIDs[0], i, sc->sharedStride, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 1; j < sc->min_registers_per_thread; j++) { sc->tempLen = sprintf(sc->tempStr, "\ %s_%" PRIu64 " = sdata[%s*(%s+%" PRIu64 "*%s)+%s];\n", sc->regIDs[j], i, sc->sharedStride, sc->gl_LocalInvocationID_y, j, sc->gl_WorkGroupSize_y, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; break; } } } return res; } static inline VkFFTResult appendCoordinateRegisterPull(VkFFTSpecializationConstantsLayout* sc, uint64_t readType) { VkFFTResult res = VKFFT_SUCCESS; if ((!sc->readToRegisters) || ((sc->convolutionStep) && ((sc->matrixConvolution > 1) || (sc->numKernels > 1)))) { switch (readType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144://single_c2c { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->matrixConvolution == 1) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * %s + %s] = %s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * %s + %s + %" PRIu64 " * %s] = %s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i, sc->gl_WorkGroupSize_x, sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); } else { sc->tempLen = sprintf(sc->tempStr, "\ switch (coordinate) {\n\ case 0:\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * %s + %s] = %s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * %s + %s + %" PRIu64 " * %s] = %s;\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, i, sc->gl_WorkGroupSize_x, sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->matrixConvolution; i++) { sc->tempLen = sprintf(sc->tempStr, "\ case %" PRIu64 ":\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * %s + %s] = %s_%" PRIu64 ";\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->regIDs[0], i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 1; j < sc->min_registers_per_thread; j++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[sharedStride * %s + %s + %" PRIu64 " * %s] = %s_%" PRIu64 ";\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, j, sc->gl_WorkGroupSize_x, sc->regIDs[j], i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; break; } case 1: case 111: case 121: case 131: case 141: case 143: case 145://grouped_c2c { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; if (sc->matrixConvolution == 1) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s*(%s)+%s] = %s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s*(%s+%" PRIu64 "*%s)+%s] = %s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, i, sc->gl_WorkGroupSize_y, sc->gl_LocalInvocationID_x, sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); } else { sc->tempLen = sprintf(sc->tempStr, "\ switch (coordinate) {\n\ case 0:\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s*(%s)+%s] = %s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s*(%s+%" PRIu64 "*%s)+%s] = %s;\n", sc->sharedStride, sc->gl_LocalInvocationID_y, i, sc->gl_WorkGroupSize_y, sc->gl_LocalInvocationID_x, sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t i = 1; i < sc->matrixConvolution; i++) { sc->tempLen = sprintf(sc->tempStr, "\ case %" PRIu64 ":\n", i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s*(%s)+%s] = %s_%" PRIu64 ";\n", sc->sharedStride, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->regIDs[0], i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 1; j < sc->min_registers_per_thread; j++) { sc->tempLen = sprintf(sc->tempStr, "\ sdata[%s*(%s+%" PRIu64 "*%s)+%s] = %s_%" PRIu64 ";\n", sc->sharedStride, sc->gl_LocalInvocationID_y, j, sc->gl_WorkGroupSize_y, sc->gl_LocalInvocationID_x, sc->regIDs[j], i); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //appendBarrierVkFFT(sc, 3); sc->tempLen = sprintf(sc->tempStr, " break;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; break; } } } return res; } static inline VkFFTResult appendPreparationBatchedKernelConvolution(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeMemory, const char* uintType, uint64_t dataType) { VkFFTResult res = VKFFT_SUCCESS; char vecType[30]; #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); #endif char separateRegisterStore[100] = "_store"; for (uint64_t i = 0; i < sc->registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " %s %s%s;\n", vecType, sc->regIDs[i], separateRegisterStore); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 1; j < sc->matrixConvolution; j++) { sc->tempLen = sprintf(sc->tempStr, " %s %s_%" PRIu64 "%s;\n", vecType, sc->regIDs[i], j, separateRegisterStore); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } for (uint64_t i = 0; i < sc->registers_per_thread; i++) { //sc->tempLen = sprintf(sc->tempStr, " temp%s[i]=temp[i];\n", separateRegisterStore); sc->tempLen = sprintf(sc->tempStr, " %s%s=%s;\n", sc->regIDs[i], separateRegisterStore, sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 1; j < sc->matrixConvolution; j++) { sc->tempLen = sprintf(sc->tempStr, " %s_%" PRIu64 "%s=%s_%" PRIu64 ";\n", sc->regIDs[i], j, separateRegisterStore, sc->regIDs[i], j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " for (%s batchID=0; batchID < %" PRIu64 "; batchID++){\n", uintType, sc->numKernels); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult appendBluesteinConvolution(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeMemory, const char* uintType, uint64_t dataType) { VkFFTResult res = VKFFT_SUCCESS; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char requestCoordinate[100] = ""; if (sc->convolutionStep) { if (sc->matrixConvolution > 1) { sprintf(requestCoordinate, "0"); } } char index_x[2000] = ""; char index_y[2000] = ""; char requestBatch[100] = ""; char separateRegisterStore[100] = ""; if (sc->convolutionStep) { if (sc->numKernels > 1) { sprintf(requestBatch, "batchID"); sprintf(separateRegisterStore, "_store"); } } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 0; j < sc->matrixConvolution; j++) { sc->tempLen = sprintf(sc->tempStr, " %s temp_real%" PRIu64 " = 0;\n", floatType, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s temp_imag%" PRIu64 " = 0;\n", floatType, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { switch (dataType) { case 0: case 5: case 6: case 110: case 120: case 130: case 140: case 142: case 144: { if (sc->fftDim == sc->fft_dim_full) { sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";\n", sc->inoutID, sc->gl_LocalInvocationID_x, i * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s = %s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");", sc->inoutID, sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexInput(%s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ")%s%s);\n", sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, requestCoordinate, requestBatch); } break; } case 1: case 111: case 121: case 131: case 141: case 143: case 145: { if (sc->fftDim == sc->fft_dim_full) { sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";\n", sc->inoutID, sc->gl_LocalInvocationID_y, i * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 "));\n", sc->inoutID, sc->stageStartSize, sc->gl_LocalInvocationID_y, (i)*sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } break; } } char kernelName[100] = ""; sprintf(kernelName, "BluesteinConvolutionKernel"); if ((sc->inverseBluestein) && (sc->fftDim == sc->fft_dim_full)) sc->tempLen = sprintf(sc->tempStr, " temp_real0 = %s[inoutID].x * %s%s.x + %s[inoutID].y * %s%s.y;\n", kernelName, sc->regIDs[i], separateRegisterStore, kernelName, sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_real0 = %s[inoutID].x * %s%s.x - %s[inoutID].y * %s%s.y;\n", kernelName, sc->regIDs[i], separateRegisterStore, kernelName, sc->regIDs[i], separateRegisterStore); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if ((sc->inverseBluestein) && (sc->fftDim == sc->fft_dim_full)) sc->tempLen = sprintf(sc->tempStr, " temp_imag0 = %s[inoutID].x * %s%s.y - %s[inoutID].y * %s%s.x;\n", kernelName, sc->regIDs[i], separateRegisterStore, kernelName, sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_imag0 = %s[inoutID].x * %s%s.y + %s[inoutID].y * %s%s.x;\n", kernelName, sc->regIDs[i], separateRegisterStore, kernelName, sc->regIDs[i], separateRegisterStore); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = temp_real0;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = temp_imag0;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult appendKernelConvolution(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeMemory, const char* uintType, uint64_t dataType) { VkFFTResult res = VKFFT_SUCCESS; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char requestCoordinate[100] = ""; if (sc->convolutionStep) { if (sc->matrixConvolution > 1) { sprintf(requestCoordinate, "0"); } } char index_x[2000] = ""; char index_y[2000] = ""; char requestBatch[100] = ""; char separateRegisterStore[100] = ""; if (sc->convolutionStep) { if (sc->numKernels > 1) { sprintf(requestBatch, "batchID"); sprintf(separateRegisterStore, "_store"); } } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; res = VkAppendLineFromInput(sc, sc->disableThreadsStart); if (res != VKFFT_SUCCESS) return res; for (uint64_t j = 0; j < sc->matrixConvolution; j++) { sc->tempLen = sprintf(sc->tempStr, " %s temp_real%" PRIu64 " = 0;\n", floatType, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s temp_imag%" PRIu64 " = 0;\n", floatType, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (i > 0) { for (uint64_t j = 0; j < sc->matrixConvolution; j++) { sc->tempLen = sprintf(sc->tempStr, " temp_real%" PRIu64 " = 0;\n", j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " = 0;\n", j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } switch (dataType) { case 0: { if (sc->fftDim == sc->fft_dim_full) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, i * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, i * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputStride[0] > 1) { sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 "", sc->fftDim, sc->inputStride[0], sc->fftDim, sc->inputStride[1]); uint64_t tempSaveInputOffset = sc->inputOffset; uint64_t tempSaveInputNumberByteSize = sc->inputNumberByteSize; sc->inputOffset = sc->kernelOffset; sc->inputNumberByteSize = sc->kernelNumberByteSize; res = indexInputVkFFT(sc, uintType, dataType + 1000, index_x, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->inputOffset = tempSaveInputOffset; sc->inputNumberByteSize = tempSaveInputNumberByteSize; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexInput((combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 "%s%s);\n", sc->fftDim, sc->inputStride[0], sc->fftDim, sc->inputStride[1], requestCoordinate, requestBatch); } else { sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 "", sc->fftDim, sc->fftDim, sc->inputStride[1]); uint64_t tempSaveInputOffset = sc->inputOffset; uint64_t tempSaveInputNumberByteSize = sc->inputNumberByteSize; sc->inputOffset = sc->kernelOffset; sc->inputNumberByteSize = sc->kernelNumberByteSize; res = indexInputVkFFT(sc, uintType, dataType + 1000, index_x, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->inputOffset = tempSaveInputOffset; sc->inputNumberByteSize = tempSaveInputNumberByteSize; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexInput((combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 "%s%s);\n", sc->fftDim, sc->fftDim, sc->inputStride[1], requestCoordinate, requestBatch); } } else { sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "%s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ")", sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize); uint64_t tempSaveInputOffset = sc->inputOffset; uint64_t tempSaveInputNumberByteSize = sc->inputNumberByteSize; sc->inputOffset = sc->kernelOffset; sc->inputNumberByteSize = sc->kernelNumberByteSize; res = indexInputVkFFT(sc, uintType, dataType + 1000, index_x, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->inputOffset = tempSaveInputOffset; sc->inputNumberByteSize = tempSaveInputNumberByteSize; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexInput(%s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ")%s%s);\n", sc->gl_LocalInvocationID_x, i * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, requestCoordinate, requestBatch); } break; } case 1: { sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x); sprintf(index_y, "(%s+%" PRIu64 ")+((%s%s)/%" PRIu64 ")%%(%" PRIu64 ")+((%s%s)/%" PRIu64 ")*(%" PRIu64 ")", sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim); uint64_t tempSaveInputOffset = sc->inputOffset; uint64_t tempSaveInputNumberByteSize = sc->inputNumberByteSize; sc->inputOffset = sc->kernelOffset; sc->inputNumberByteSize = sc->kernelNumberByteSize; res = indexInputVkFFT(sc, uintType, dataType + 1000, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->inputOffset = tempSaveInputOffset; sc->inputNumberByteSize = tempSaveInputNumberByteSize; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexInput((%s%s) %% (%" PRIu64 "), (%s+%" PRIu64 ")+((%s%s)/%" PRIu64 ")%%(%" PRIu64 ")+((%s%s)/%" PRIu64 ")*(%" PRIu64 ")%s%s);\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->gl_LocalInvocationID_y, i * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim, requestCoordinate, requestBatch); break; } } char kernelName[100] = ""; sprintf(kernelName, "kernel_obj"); if ((sc->kernelBlockNum == 1) || (sc->useBluesteinFFT)) { for (uint64_t j = 0; j < sc->matrixConvolution; j++) { for (uint64_t l = 0; l < sc->matrixConvolution; l++) { uint64_t k = 0; if (sc->symmetricKernel) { k = (l < j) ? (l * sc->matrixConvolution - l * l + j) : (j * sc->matrixConvolution - j * j + l); } else { k = (j * sc->matrixConvolution + l); } if (sc->conjugateConvolution == 0) { if (l == 0) sc->tempLen = sprintf(sc->tempStr, " temp_real%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s%s.x - %s[inoutID+%" PRIu64 "].y * %s%s.y;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_real%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s_%" PRIu64 "%s.x - %s[inoutID+%" PRIu64 "].y * %s_%" PRIu64 "%s.y;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore); } else { if (l == 0) sc->tempLen = sprintf(sc->tempStr, " temp_real%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s%s.x + %s[inoutID+%" PRIu64 "].y * %s%s.y;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_real%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s_%" PRIu64 "%s.x + %s[inoutID+%" PRIu64 "].y * %s_%" PRIu64 "%s.y;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t l = 0; l < sc->matrixConvolution; l++) { uint64_t k = 0; if (sc->symmetricKernel) { k = (l < j) ? (l * sc->matrixConvolution - l * l + j) : (j * sc->matrixConvolution - j * j + l); } else { k = (j * sc->matrixConvolution + l); } if (sc->conjugateConvolution == 0) { if (l == 0) sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s%s.y + %s[inoutID+%" PRIu64 "].y * %s%s.x;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s_%" PRIu64 "%s.y + %s[inoutID+%" PRIu64 "].y * %s_%" PRIu64 "%s.x;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore); } else { if (sc->conjugateConvolution == 1) { if (l == 0) sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += %s[inoutID+%" PRIu64 "].y * %s%s.x - %s[inoutID+%" PRIu64 "].x * %s%s.y ;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += %s[inoutID+%" PRIu64 "].y * %s_%" PRIu64 "%s.x - %s[inoutID+%" PRIu64 "].x * %s_%" PRIu64 "%s.y;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore); } else { if (l == 0) sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s%s.y - %s[inoutID+%" PRIu64 "].y * %s%s.x;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += %s[inoutID+%" PRIu64 "].x * %s_%" PRIu64 "%s.y - %s[inoutID+%" PRIu64 "].y * %s_%" PRIu64 "%s.x;\n", j, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore, kernelName, k * sc->inputStride[3], sc->regIDs[i], l, separateRegisterStore); } } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->crossPowerSpectrumNormalization) { #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, " w.x = inversesqrt(temp_real0*temp_real0+temp_imag0*temp_imag0);\n"); #elif(VKFFT_BACKEND==1) sc->tempLen = sprintf(sc->tempStr, " w.x = rsqrt(temp_real0*temp_real0+temp_imag0*temp_imag0);\n"); #elif(VKFFT_BACKEND==2) sc->tempLen = sprintf(sc->tempStr, " w.x = rsqrt(temp_real0*temp_real0+temp_imag0*temp_imag0);\n"); #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, " w.x = rsqrt(temp_real0*temp_real0+temp_imag0*temp_imag0);\n"); #endif res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = temp_real0 * w.x;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = temp_imag0 * w.x;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = temp_real0;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = temp_imag0;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } for (uint64_t l = 1; l < sc->matrixConvolution; l++) { if (sc->crossPowerSpectrumNormalization) { #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, " w.x = inversesqrt(temp_real%" PRIu64 "*temp_real%" PRIu64 "+temp_imag%" PRIu64 "*temp_imag%" PRIu64 ");\n", l, l, l, l); #elif(VKFFT_BACKEND==1) sc->tempLen = sprintf(sc->tempStr, " w.x = rsqrt(temp_real%" PRIu64 "*temp_real%" PRIu64 "+temp_imag%" PRIu64 "*temp_imag%" PRIu64 ");\n", l, l, l, l); #elif(VKFFT_BACKEND==2) sc->tempLen = sprintf(sc->tempStr, " w.x = rsqrt(temp_real%" PRIu64 "*temp_real%" PRIu64 "+temp_imag%" PRIu64 "*temp_imag%" PRIu64 ");\n", l, l, l, l); #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, " w.x = rsqrt(temp_real%" PRIu64 "*temp_real%" PRIu64 "+temp_imag%" PRIu64 "*temp_imag%" PRIu64 ");\n", l, l, l, l); #endif res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s_%" PRIu64 ".x = temp_real%" PRIu64 " * w.x;\n", sc->regIDs[i], l, l); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s_%" PRIu64 ".y = temp_imag%" PRIu64 " * w.x;\n", sc->regIDs[i], l, l); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s_%" PRIu64 ".x = temp_real%" PRIu64 ";\n", sc->regIDs[i], l, l); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s_%" PRIu64 ".y = temp_imag%" PRIu64 ";\n", sc->regIDs[i], l, l); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { for (uint64_t j = 0; j < sc->matrixConvolution; j++) { sc->tempLen = sprintf(sc->tempStr, " %s temp_real%" PRIu64 " = 0;\n", floatType, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t l = 0; l < sc->matrixConvolution; l++) { uint64_t k = 0; if (sc->symmetricKernel) { k = (l < j) ? (l * sc->matrixConvolution - l * l + j) : (j * sc->matrixConvolution - j * j + l); } else { k = (j * sc->matrixConvolution + l); } if (l == 0) sc->tempLen = sprintf(sc->tempStr, " temp_real%" PRIu64 " += kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].x * %s%s.x - kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].y * %s%s.y;\n", j, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], separateRegisterStore, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_real%" PRIu64 " += kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].x * %s_%" PRIu64 "%s.x - kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].y * %s_%" PRIu64 "%s.y;\n", j, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], l, separateRegisterStore, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], l, separateRegisterStore); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s temp_imag%" PRIu64 " = 0;\n", floatType, j); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t l = 0; l < sc->matrixConvolution; l++) { uint64_t k = 0; if (sc->symmetricKernel) { k = (l < j) ? (l * sc->matrixConvolution - l * l + j) : (j * sc->matrixConvolution - j * j + l); } else { k = (j * sc->matrixConvolution + l); } if (l == 0) sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].x * %s%s.y + kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].y * %s%s.x;\n", j, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], separateRegisterStore, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], separateRegisterStore); else sc->tempLen = sprintf(sc->tempStr, " temp_imag%" PRIu64 " += kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].x * %s_%" PRIu64 "%s.y + kernelBlocks[(inoutID+%" PRIu64 ")/%" PRIu64 "].%s[(inoutID+%" PRIu64 ") %% %" PRIu64 "].y * %s_%" PRIu64 "%s.x;\n", j, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], l, separateRegisterStore, k * sc->inputStride[3], sc->kernelBlockSize, kernelName, k * sc->inputStride[3], sc->kernelBlockSize, sc->regIDs[i], l, separateRegisterStore); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, " %s.x = temp_real0;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = temp_imag0;\n", sc->regIDs[i]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t l = 1; l < sc->matrixConvolution; l++) { sc->tempLen = sprintf(sc->tempStr, " %s_%" PRIu64 ".x = temp_real%" PRIu64 ";\n", sc->regIDs[i], l, l); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s_%" PRIu64 ".y = temp_imag%" PRIu64 ";\n", sc->regIDs[i], l, l); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = VkAppendLineFromInput(sc, sc->disableThreadsEnd); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult setWriteFromRegisters(VkFFTSpecializationConstantsLayout* sc, uint64_t writeType) { VkFFTResult res = VKFFT_SUCCESS; switch (writeType) { case 0: //single_c2c { if ((sc->localSize[1] > 1) || (sc->localSize[0] * sc->stageRadix[sc->numStages - 1] * (sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] / sc->stageRadix[sc->numStages - 1]) > sc->fftDim)) { sc->writeFromRegisters = 0; } else sc->writeFromRegisters = 1; break; } case 1: //grouped_c2c { if (sc->localSize[1] * sc->stageRadix[sc->numStages - 1] * (sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] / sc->stageRadix[sc->numStages - 1]) > sc->fftDim) { sc->writeFromRegisters = 0; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } else sc->writeFromRegisters = 1; break; } case 2: //single_c2c_strided { if (sc->localSize[1] * sc->stageRadix[sc->numStages - 1] * (sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] / sc->stageRadix[sc->numStages - 1]) > sc->fftDim) { sc->writeFromRegisters = 0; } else sc->writeFromRegisters = 1; break; } case 5://single_r2c { sc->writeFromRegisters = 0; break; } case 6: //single_c2r { if ((sc->axisSwapped) || (sc->localSize[1] > 1) || (sc->localSize[0] * sc->stageRadix[sc->numStages - 1] * (sc->registers_per_thread_per_radix[sc->stageRadix[sc->numStages - 1]] / sc->stageRadix[sc->numStages - 1]) > sc->fftDim)) { sc->writeFromRegisters = 0; } else sc->writeFromRegisters = 1; break; } case 110: case 111: case 120: case 121: case 130: case 131: case 140: case 141: case 142: case 143: case 144: case 145: { sc->writeFromRegisters = 0; break; } } return res; } static inline VkFFTResult appendWriteDataVkFFT(VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeMemory, const char* uintType, uint64_t writeType) { VkFFTResult res = VKFFT_SUCCESS; double double_PI = 3.1415926535897932384626433832795; char vecType[30]; char outputsStruct[20] = ""; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (sc->outputBufferBlockNum == 1) sprintf(outputsStruct, "outputs"); else sprintf(outputsStruct, ".outputs"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); char cosDef[20] = "cos"; char sinDef[20] = "sin"; #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); sprintf(outputsStruct, "outputs"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); sprintf(outputsStruct, "outputs"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); sprintf(outputsStruct, "outputs"); //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); char cosDef[20] = "native_cos"; char sinDef[20] = "native_sin"; #endif char convTypeLeft[20] = ""; char convTypeRight[20] = ""; if ((!strcmp(floatTypeMemory, "half")) && (strcmp(floatType, "half"))) { if ((writeType == 6) || (writeType == 110) || (writeType == 111) || (writeType == 120) || (writeType == 121) || (writeType == 130) || (writeType == 131) || (writeType == 140) || (writeType == 141) || (writeType == 142) || (writeType == 143) || (writeType == 144) || (writeType == 145)) { sprintf(convTypeLeft, "float16_t("); sprintf(convTypeRight, ")"); } else { sprintf(convTypeLeft, "f16vec2("); sprintf(convTypeRight, ")"); } } if ((!strcmp(floatTypeMemory, "float")) && (strcmp(floatType, "float"))) { if ((writeType == 6) || (writeType == 110) || (writeType == 111) || (writeType == 120) || (writeType == 121) || (writeType == 130) || (writeType == 131) || (writeType == 140) || (writeType == 141) || (writeType == 142) || (writeType == 143) || (writeType == 144) || (writeType == 145)) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "float("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "(float)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "(float)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "(float)"); //sprintf(convTypeRight, ""); #endif } else { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "vec2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "conv_float2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "conv_float2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "conv_float2("); sprintf(convTypeRight, ")"); #endif } } if ((!strcmp(floatTypeMemory, "double")) && (strcmp(floatType, "double"))) { if ((writeType == 6) || (writeType == 110) || (writeType == 111) || (writeType == 120) || (writeType == 121) || (writeType == 130) || (writeType == 131) || (writeType == 140) || (writeType == 141) || (writeType == 142) || (writeType == 143) || (writeType == 144) || (writeType == 145)) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "double("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "(double)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "(double)"); //sprintf(convTypeRight, ""); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "(double)"); //sprintf(convTypeRight, ""); #endif } else { #if(VKFFT_BACKEND==0) sprintf(convTypeLeft, "dvec2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeft, "conv_double2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeft, "conv_double2("); sprintf(convTypeRight, ")"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeft, "conv_double2("); sprintf(convTypeRight, ")"); #endif } } char index_x[2000] = ""; char index_y[2000] = ""; char requestCoordinate[100] = ""; if (sc->convolutionStep) { if (sc->matrixConvolution > 1) { sprintf(requestCoordinate, "coordinate"); } } char requestBatch[100] = ""; if (sc->convolutionStep) { if (sc->numKernels > 1) { sprintf(requestBatch, "batchID");//if one buffer - multiple kernel convolution } } switch (writeType) { case 0: //single_c2c { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->axisSwapped) { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_x); } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); } char shiftY2[100] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); if (sc->fftDim < sc->fft_dim_full) { if (sc->axisSwapped) { if (!sc->reorderFourStep) { sc->tempLen = sprintf(sc->tempStr, " if((%s+%" PRIu64 "*%s)< numActiveThreads) {\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " if (((%s + %" PRIu64 " * %s) %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " < %" PRIu64 ")){\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, sc->localSize[0], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0], sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!sc->reorderFourStep) { res = VkAppendLineFromInput(sc, sc->disableThreadsStart); } else { sc->tempLen = sprintf(sc->tempStr, " if (((%s + %" PRIu64 " * %s) %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " < %" PRIu64 ")){\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, sc->localSize[1], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1], sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); } if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " { \n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->reorderFourStep) { if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->outputStride[0], sc->fftDim, sc->outputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " inoutID = combinedID %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ")+ ((%s%s) %% %" PRIu64 ") * %" PRIu64 ";\n", sc->localSize[0], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0], sc->localSize[0], sc->fft_dim_full / sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s%s)/%" PRIu64 "+ (combinedID * %" PRIu64 ")+ ((%s%s) %% %" PRIu64 ") * %" PRIu64 ";\n", sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize); else sc->tempLen = sprintf(sc->tempStr, " inoutID = combinedID %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ")+ ((%s%s) %% %" PRIu64 ") * %" PRIu64 ";\n", sc->localSize[1], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1], sc->localSize[1], sc->fft_dim_full / sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %s)+(combinedID/%s)*sharedStride]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->gl_WorkGroupSize_x, sc->gl_WorkGroupSize_x, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %s)+(combinedID/%s)*sharedStride]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->gl_WorkGroupSize_x, sc->gl_WorkGroupSize_x, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %s)*sharedStride+combinedID/%s]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->gl_WorkGroupSize_y, sc->gl_WorkGroupSize_y, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %s)*sharedStride+combinedID/%s]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->gl_WorkGroupSize_y, sc->gl_WorkGroupSize_y, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; /* if (sc->outputBufferBlockNum == 1) if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " %s[indexOutput(inoutID%s%s)] = %stemp_%" PRIu64 "%s;\n", requestCoordinate, requestBatch, convTypeLeft, i, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " %s[indexOutput(inoutID%s%s)] = %stemp_%" PRIu64 "%s;\n", requestCoordinate, requestBatch, convTypeLeft, i, convTypeRight); else if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " outputBlocks[indexOutput(inoutID%s%s) / %" PRIu64 "]%s[indexOutput(inoutID%s%s) %% %" PRIu64 "] = %stemp_%" PRIu64 "%s;\n", requestCoordinate, requestBatch, sc->outputBufferBlockSize, outputsStruct, requestCoordinate, requestBatch, sc->outputBufferBlockSize, convTypeLeft, i, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[indexOutput(inoutID%s%s) / %" PRIu64 "]%s[indexOutput(inoutID%s%s) %% %" PRIu64 "] = %stemp_%" PRIu64 "%s;\n", requestCoordinate, requestBatch, sc->outputBufferBlockSize, outputsStruct, requestCoordinate, requestBatch, sc->outputBufferBlockSize, convTypeLeft, i, convTypeRight); */ if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } else { if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->outputStride[0], sc->fftDim, sc->outputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else { if (!sc->axisSwapped) sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 " * numActiveThreads;\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread)); } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ")+(combinedID / %" PRIu64 ") * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");", sc->fftDim, sc->fftDim, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " inoutID = %s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexOutput(%s+i*%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ")%s%s);\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, requestCoordinate, requestBatch); res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID]=%s%s%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID]=%ssdata[%s + sharedStride*(%s + %" PRIu64 ")]%s;\n", outputsStruct, convTypeLeft, sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[%s + sharedStride*(%s + %" PRIu64 ")]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID]=%ssdata[sharedStride*%s + (%s + %" PRIu64 ")]%s;\n", outputsStruct, convTypeLeft, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[sharedStride*%s + (%s + %" PRIu64 ")]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 1: //grouped_c2c { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); sc->tempLen = sprintf(sc->tempStr, " if (((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize, sc->size[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if ((sc->reorderFourStep) && (sc->stageStartSize == 1)) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s + %" PRIu64 ") * (%" PRIu64 ") + (((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")) * (%" PRIu64 ") + ((%s%s) / %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->fft_dim_full / sc->fftDim, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * (sc->firstStageStartSize / sc->fftDim)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x); res = indexOutputVkFFT(sc, uintType, writeType, index_x, sc->inoutID, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s + %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->stageStartSize * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { if (!sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s + %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->stageStartSize * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x); sprintf(index_y, "%" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ")", sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->stageStartSize * sc->fftDim); res = indexOutputVkFFT(sc, uintType, writeType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexOutput((%s%s) %% (%" PRIu64 "), %" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ")%s%s);\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->stageStartSize * sc->fftDim, requestCoordinate, requestBatch); if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s%s%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", outputsStruct, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 2: //single_c2c_strided { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); sc->tempLen = sprintf(sc->tempStr, " if (((%s%s) / %" PRIu64 ") * (%" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize * sc->fftDim, sc->fft_dim_full); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s%s) %% (%" PRIu64 ") + %" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") * (%" PRIu64 ");\n", sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->stageStartSize, sc->stageStartSize * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s%s%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", outputsStruct, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 5://single_r2c { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s + %" PRIu64 "* sharedStride] = sdata[%s];\n\ }\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s * sharedStride + %" PRIu64 "] = sdata[%s * sharedStride];\n\ }\n", sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, sc->fftDim, sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } } uint64_t num_out = (sc->axisSwapped) ? (uint64_t)ceil(mult * (sc->fftDim / 2 + 1) / (double)sc->localSize[1]) : (uint64_t)ceil(mult * (sc->fftDim / 2 + 1) / (double)sc->localSize[0]); //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < num_out; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_out) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_out) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");", sc->inoutID, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");", sc->inoutID, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){", mult * (sc->fftDim / 2 + 1), sc->gl_WorkGroupID_y, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){", mult * (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){", mult * (sc->fftDim / 2 + 1), sc->gl_WorkGroupID_y, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){", mult * (sc->fftDim / 2 + 1) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { //not working yet if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s%s%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y-sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y-sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, 2 * (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", outputsStruct, convTypeLeft, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", outputsStruct, convTypeLeft, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } /*sc->tempLen = sprintf(sc->tempStr, "\ if (%s==%" PRIu64 ") \n\ {\n", sc->gl_LocalInvocationID_x, sc->localSize[0] - 1); sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); sprintf(index_x, "%" PRIu64 "", sc->fftDim / 2); sprintf(index_y, "%s%s", sc->gl_GlobalInvocationID_y, shiftY); indexInputVkFFT(sc, uintType, writeType, index_x, index_y, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexInput(2 * (%s%s), %" PRIu64 ");\n", sc->gl_GlobalInvocationID_y, shiftY, sc->inputStride[2] / (sc->inputStride[1] + 2)); if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID]=%ssdata[(%" PRIu64 " + %s * sharedStride)]%s;\n", outputsStruct, convTypeLeft,sc->fftDim / 2, sc->gl_LocalInvocationID_y, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]=%ssdata[(%" PRIu64 " + %s * sharedStride)]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim / 2, sc->gl_LocalInvocationID_y, convTypeRight); VkAppendLine(sc, " }\n");*/ } break; } case 6: //single_c2r { char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY * %" PRIu64 "", sc->localSize[1]); if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; if (sc->reorderFourStep) { //Not implemented } else { if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->outputStride[0], sc->fftDim, mult * sc->outputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, shiftY, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if((combinedID %% %" PRIu64 ") < %" PRIu64 "){\n", sc->fft_dim_full, sc->fft_zeropad_Bluestein_left_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s.x%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s.x%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";", sc->inoutID, sc->inoutID, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s.y%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s.y%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";", sc->inoutID, sc->inoutID, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride+ (combinedID / %" PRIu64 ")].y%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";", sc->inoutID, sc->inoutID, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } else { } } break; } case 110://DCT-I nonstrided { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; sc->fftDim = (sc->fftDim + 2) / 2; for (uint64_t k = 0; k < sc->registerBoost; k++) { if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s + %" PRIu64 "* sharedStride] = sdata[%s];\n\ }\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s * sharedStride + %" PRIu64 "] = sdata[%s * sharedStride];\n\ }\n", sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, sc->fftDim, sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } } uint64_t num_out = (sc->axisSwapped) ? (uint64_t)ceil((sc->fftDim) / (double)sc->localSize[1]) : (uint64_t)ceil((sc->fftDim) / (double)sc->localSize[0]); //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < num_out; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_out) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_out) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim), sc->gl_WorkGroupID_y, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim), sc->gl_WorkGroupID_y, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s = (sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")]);\n", sc->regIDs[0], sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(%s.x)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(%s.x)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = %s(%s.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = %s(%s.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " %s = (sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]);\n", sc->regIDs[0], sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(%s.x)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(%s.x)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = %s(%s.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = %s(%s.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } else { if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x) %s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } sc->fftDim = 2 * sc->fftDim - 2; if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 111://DCT-II strided { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX*%s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; sc->fftDim = (sc->fftDim + 2) / 2; for (uint64_t k = 0; k < sc->registerBoost; k++) { if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s + %" PRIu64 "* sharedStride] = sdata[%s];\n\ }\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } uint64_t num_out = (uint64_t)ceil(mult * (sc->fftDim) / (double)sc->localSize[1]); //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < num_out; i++) { sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_out) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = %s%s + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->gl_GlobalInvocationID_x, shiftX, sc->localSize[0], sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", sc->localSize[0], sc->gl_WorkGroupID_x, sc->localSize[0], sc->size[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if(((combinedID/%" PRIu64 ") %% %" PRIu64 " < %" PRIu64 ")||((combinedID/%" PRIu64 ") %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y-sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y);\n", sc->regIDs[1], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x);\n", sc->regIDs[1], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " > 0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID / %" PRIu64 ") + %s%s * %" PRIu64 ";\n", sc->inoutID, sc->fftDim, sc->localSize[0], sc->gl_GlobalInvocationID_x, shiftX, 2 * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } sc->fftDim = 2 * sc->fftDim - 2; if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 120://DCT-II nonstrided { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s + %" PRIu64 "* sharedStride] = sdata[%s];\n\ }\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s * sharedStride + %" PRIu64 "] = sdata[%s * sharedStride];\n\ }\n", sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, sc->fftDim, sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } } uint64_t num_out = (sc->axisSwapped) ? (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[1]) : (uint64_t)ceil((sc->fftDim / 2 + 1) / (double)sc->localSize[0]); //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < num_out; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * num_out) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_out) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim / 2 + 1), sc->gl_WorkGroupID_y, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim / 2 + 1), sc->gl_WorkGroupID_y, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID %% %" PRIu64 "];\n", sc->startDCT3LUT, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.x = 2*mult.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = -2*mult.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = 2*%s(%.17e%s * (combinedID %% %" PRIu64 ") );\n", cosDef, -double_PI / 2 / sc->fftDim, LFending, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = 2*%s(%.17e%s * (combinedID %% %" PRIu64 ") );\n", sinDef, -double_PI / 2 / sc->fftDim, LFending, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y-sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[1], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ")* sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[1], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " > 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID %% %" PRIu64 ") + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y-sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[1], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 "-combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[1], LFending, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1), sc->fftDim, sc->fftDim / 2 + 1, (sc->fftDim / 2 + 1)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " > 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID %% %" PRIu64 ") + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, 2 * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x*mult.x - sdata[sdataID].y*mult.y)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x*mult.x - sdata[sdataID].y*mult.y) %s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " > 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID %% %" PRIu64 ") + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = -%s(sdata[sdataID].y*mult.x + sdata[sdataID].x*mult.y)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = -%s(sdata[sdataID].y*mult.x + sdata[sdataID].x*mult.y) %s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim / 2 + 1, sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x*mult.x -sdata[sdataID].y*mult.y)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x*mult.x - sdata[sdataID].y*mult.y)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " > 0){\n", sc->fftDim / 2 + 1); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID %% %" PRIu64 ") + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim / 2 + 1, sc->fftDim / 2 + 1, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = -%s(sdata[sdataID].y*mult.x +sdata[sdataID].x*mult.y)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = -%s(sdata[sdataID].y*mult.x + sdata[sdataID].x*mult.y)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 121://DCT-II strided { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX*%s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s + %" PRIu64 "* sharedStride] = sdata[%s];\n\ }\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } uint64_t num_out = (uint64_t)ceil(mult * (sc->fftDim / 2 + 1) / (double)sc->localSize[1]); //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < num_out; i++) { sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * num_out) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = %s%s + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->gl_GlobalInvocationID_x, shiftX, sc->localSize[0], sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", sc->localSize[0], sc->gl_WorkGroupID_x, sc->localSize[0], sc->size[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if(((combinedID/%" PRIu64 ") %% %" PRIu64 " < %" PRIu64 ")||((combinedID/%" PRIu64 ") %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID / %" PRIu64 "];\n", sc->startDCT3LUT, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.x = 2*mult.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = -2*mult.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = 2*%s(%.17e%s * (combinedID / %" PRIu64 ") );\n", cosDef, -double_PI / 2 / sc->fftDim, LFending, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = 2*%s(%.17e%s * (combinedID / %" PRIu64 ") );\n", sinDef, -double_PI / 2 / sc->fftDim, LFending, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y-sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y);\n", sc->regIDs[1], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x);\n", sc->regIDs[1], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " > 0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID / %" PRIu64 ") + %s%s * %" PRIu64 ";\n", sc->inoutID, sc->fftDim, sc->localSize[0], sc->gl_GlobalInvocationID_x, shiftX, 2 * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x*mult.x -sdata[sdataID].y*mult.y)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x*mult.x - sdata[sdataID].y*mult.y)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if((combinedID/ %" PRIu64 ")> 0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID / %" PRIu64 ") * %" PRIu64 " + %s%s;\n", sc->inoutID, sc->fftDim, sc->localSize[0], sc->outputStride[1], sc->gl_GlobalInvocationID_x, shiftX); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if(( (%" PRIu64 " - combinedID / %" PRIu64 ") %% %" PRIu64 " < %" PRIu64 ")||( (%" PRIu64 " - combinedID / %" PRIu64 ") %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fftDim, sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->fftDim, sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = -%s(sdata[sdataID].y*mult.x +sdata[sdataID].x*mult.y)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = -%s(sdata[sdataID].y*mult.x + sdata[sdataID].x*mult.y)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 130://DCT-III nonstrided { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; uint64_t maxBluesteinCutOff = 1; if (sc->zeropadBluestein[1]) { if (sc->axisSwapped) maxBluesteinCutOff = sc->fftDim * sc->localSize[0]; else maxBluesteinCutOff = sc->fftDim * sc->localSize[1]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim, mult * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ")* sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(sdata[sdataID].x)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(sdata[sdataID].x)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = %s + %" PRIu64 ";\n", sc->inoutID, sc->inoutID, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(sdata[sdataID].y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(sdata[sdataID].y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (!sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ") * sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 131://DCT-III strided { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX*%s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); //uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s + %" PRIu64 "* sharedStride] = sdata[%s];\n\ }\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = %s%s + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->gl_GlobalInvocationID_x, shiftX, sc->localSize[0], sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", sc->localSize[0], sc->gl_WorkGroupID_x, sc->localSize[0], sc->size[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } /*if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", mult * (sc->fftDim / 2 + 1) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; }*/ if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if(((combinedID/%" PRIu64 ") %% %" PRIu64 " < %" PRIu64 ")||((combinedID/%" PRIu64 ") %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y-sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].y);\n", sc->regIDs[1], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x+sdata[(%" PRIu64 "-combinedID / %" PRIu64 ")* sharedStride + (combinedID %% %" PRIu64 ")].x);\n", sc->regIDs[1], LFending, sc->localSize[0], sc->localSize[0], sc->fftDim, sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = %s(%s.x*mult.x-%s.y*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " > 0){\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = (%" PRIu64 " - combinedID / %" PRIu64 ") + %s%s * %" PRIu64 ";\n", sc->inoutID, sc->fftDim, sc->localSize[0], sc->gl_GlobalInvocationID_x, shiftX, 2 * sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[0], sc->regIDs[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s+%" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", outputsStruct, sc->inoutID, sc->outputStride[1], convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[(%s %" PRIu64 ")/ %" PRIu64 "]%s[(%s+%" PRIu64 ") %% %" PRIu64 "] = -%s(%s.y*mult.x+%s.x*mult.y)%s;\n", sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputStride[1], sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID / %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID / %" PRIu64 ") %% 2)) * ((combinedID / %" PRIu64 ")/2)) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->fftDim - 1, sc->localSize[0], sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(sdata[sdataID].x)%s;\n", outputsStruct, convTypeLeft, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(sdata[sdataID].x)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } /*if ((1 + i + k * num_out) * sc->localSize[0] * sc->localSize[1] >= mult * (sc->fftDim / 2 + 1) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; }*/ if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 140: //DCT-IV nonstrided as 8N DFT { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->axisSwapped) { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_x); } else { if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); } char shiftY2[100] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY "); if (sc->fftDim < sc->fft_dim_full) { if (sc->axisSwapped) { if (!sc->reorderFourStep) { sc->tempLen = sprintf(sc->tempStr, " if((%s+%" PRIu64 "*%s)< numActiveThreads) {\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " if (((%s + %" PRIu64 " * %s) %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " < %" PRIu64 ")){\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, sc->localSize[0], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0], sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " if (((%s + %" PRIu64 " * %s) %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " < %" PRIu64 ")){\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, sc->localSize[1], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1], sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { sc->tempLen = sprintf(sc->tempStr, " { \n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //if (sc->reorderFourStep) { if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < (uint64_t)ceil(sc->min_registers_per_thread / 8.0); i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim / 8, sc->outputStride[0], sc->fftDim / 8, sc->outputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim / 8, sc->fftDim / 8, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim / 8, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim / 8 * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + (%s%s)*%" PRIu64 "< %" PRIu64 "){", sc->fftDim / 8, sc->gl_WorkGroupID_y, shiftY2, sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim / 8 * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(2*(combinedID %% %" PRIu64 ")+1) * sharedStride + (combinedID / %" PRIu64 ")].x/2%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim / 8, sc->fftDim / 8, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(2*(combinedID %% %" PRIu64 ")+1) * sharedStride + (combinedID / %" PRIu64 ")].x/2%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim / 8, sc->fftDim / 8, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[2*(combinedID %% %" PRIu64 ")+1 + (combinedID / %" PRIu64 ") * sharedStride].x/2%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim / 8, sc->fftDim / 8, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[2*(combinedID %% %" PRIu64 ")+1 + (combinedID / %" PRIu64 ") * sharedStride].x/2%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim / 8, sc->fftDim / 8, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } /*else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " inoutID = combinedID %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ")+ ((%s%s) %% %" PRIu64 ") * %" PRIu64 ";\n", sc->localSize[0], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0], sc->localSize[0], sc->fft_dim_full / sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s%s)/%" PRIu64 "+ (combinedID * %" PRIu64 ")+ ((%s%s) %% %" PRIu64 ") * %" PRIu64 ";\n", sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize); else sc->tempLen = sprintf(sc->tempStr, " inoutID = combinedID %% %" PRIu64 " + ((%s%s) / %" PRIu64 ")*%" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ")+ ((%s%s) %% %" PRIu64 ") * %" PRIu64 ";\n", sc->localSize[1], sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1], sc->localSize[1], sc->fft_dim_full / sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %s)+(combinedID/%s)*sharedStride]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->gl_WorkGroupSize_x, sc->gl_WorkGroupSize_x, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %s)+(combinedID/%s)*sharedStride]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->gl_WorkGroupSize_x, sc->gl_WorkGroupSize_x, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %s)*sharedStride+combinedID/%s]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->gl_WorkGroupSize_y, sc->gl_WorkGroupSize_y, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %s)*sharedStride+combinedID/%s]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->gl_WorkGroupSize_y, sc->gl_WorkGroupSize_y, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } }*/ /*} else { if (sc->fftDim == sc->fft_dim_full) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputStride[0] > 1) sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") * %" PRIu64 " + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->outputStride[0], sc->fftDim, sc->outputStride[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * %" PRIu64 ";\n", sc->fftDim, sc->fftDim, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){", sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %s%s%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %s%s%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[(combinedID %% %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride]%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->fftDim, sc->fftDim, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if (sc->size[sc->axis_id + 1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[sc->axis_id + 1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } } else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 " * numActiveThreads;\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (combinedID %% %" PRIu64 ")+(combinedID / %" PRIu64 ") * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");", sc->fftDim, sc->fftDim, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[0] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " inoutID = %s+%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ");", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexOutput(%s+i*%" PRIu64 "+%s * %" PRIu64 " + (((%s%s) %% %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") * %" PRIu64 ")%s%s);\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, sc->firstStageStartSize, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->fftDim, sc->gl_WorkGroupID_x, shiftX, sc->firstStageStartSize / sc->fftDim, sc->localSize[1] * sc->firstStageStartSize, requestCoordinate, requestBatch); res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID]=%s%s%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->axisSwapped) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID]=%ssdata[%s + sharedStride*(%s + %" PRIu64 ")]%s;\n", outputsStruct, convTypeLeft, sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[%s + sharedStride*(%s + %" PRIu64 ")]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID]=%ssdata[sharedStride*%s + (%s + %" PRIu64 ")]%s;\n", outputsStruct, convTypeLeft, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[sharedStride*%s + (%s + %" PRIu64 ")]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } }*/ sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 141: //DCT-IV strided as 8N DFT { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); if (sc->fftDim != sc->fft_dim_full) sc->tempLen = sprintf(sc->tempStr, " if (((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ") < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->fftDim * sc->stageStartSize, sc->size[sc->axis_id]); else sc->tempLen = sprintf(sc->tempStr, " {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //if ((sc->reorderFourStep) && (sc->stageStartSize == 1)) { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < (uint64_t)ceil(sc->min_registers_per_thread / 8.0); i++) { if (sc->fftDim == sc->fft_dim_full) sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s + %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1]); else sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s + %" PRIu64 ") * (%" PRIu64 ") + (((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")) * (%" PRIu64 ") + ((%s%s) / %" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->fft_dim_full / sc->fftDim, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->firstStageStartSize / sc->fftDim, sc->fft_dim_full / sc->firstStageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * (sc->firstStageStartSize / sc->fftDim)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(inoutID < %" PRIu64 "){\n", sc->fftDim / 8); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x); res = indexOutputVkFFT(sc, uintType, writeType, index_x, sc->inoutID, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[%s] = %ssdata[%s*(2*(%s+%" PRIu64 ")+1) + %s].x/2%s;\n", outputsStruct, sc->inoutID, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[%s / %" PRIu64 "]%s[%s %% %" PRIu64 "] = %ssdata[%s*(2*(%s+%" PRIu64 ")+1) + %s].x/2%s;\n", sc->inoutID, sc->outputBufferBlockSize, outputsStruct, sc->inoutID, sc->outputBufferBlockSize, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } /*} else { for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " inoutID = (%s + %" PRIu64 ") * %" PRIu64 " + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ");\n", sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->stageStartSize * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->fft_dim_full, sc->fft_zeropad_left_write[sc->axis_id], sc->fft_dim_full, sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x); sprintf(index_y, "%" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ")", sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->stageStartSize * sc->fftDim); res = indexOutputVkFFT(sc, uintType, writeType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, " inoutID = indexOutput((%s%s) %% (%" PRIu64 "), %" PRIu64 " * (%s + %" PRIu64 ") + ((%s%s) / %" PRIu64 ") %% (%" PRIu64 ")+((%s%s) / %" PRIu64 ") * (%" PRIu64 ")%s%s);\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x * sc->stageStartSize, sc->stageStartSize * sc->fftDim, requestCoordinate, requestBatch); if (sc->writeFromRegisters) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s%s%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", outputsStruct, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %ssdata[%s*(%s+%" PRIu64 ") + %s]%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->sharedStride, sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[1], sc->gl_LocalInvocationID_x, convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } }*/ sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; break; } case 142://DCT-IV nonstrided as 2xN/2 DCT-II { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; uint64_t maxBluesteinCutOff = 1; if (sc->zeropadBluestein[1]) { if (sc->axisSwapped) maxBluesteinCutOff = sc->fftDim * sc->localSize[0]; else maxBluesteinCutOff = sc->fftDim * sc->localSize[1]; } for (uint64_t k = 0; k < sc->registerBoost; k++) { //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", maxBluesteinCutOff); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) * sharedStride + (combinedID / %" PRIu64 ");\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID %% %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID %% %" PRIu64 ") %% 2)) * ((combinedID %% %" PRIu64 ")/2)) + (combinedID / %" PRIu64 ")* sharedStride;\n", sc->fftDim, sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = sdata[sdataID];\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.y * (1.0%s - 2 * ((combinedID %% %" PRIu64 ")%%2));\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], LFending, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if (sc->size[1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim), sc->gl_WorkGroupID_y, sc->localSize[0], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", (sc->fftDim), sc->gl_WorkGroupID_y, sc->localSize[1], sc->size[sc->axis_id + 1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(index_x, "combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ")", sc->fftDim, sc->fftDim, sc->outputStride[1]); sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, index_x, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID %% %" PRIu64 "];\n", sc->startDCT4LUT, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(%.17e%s * (2*(combinedID %% %" PRIu64 ")+1) );\n", cosDef, -double_PI / 8 / sc->fftDim, LFending, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(%.17e%s * (2*(combinedID %% %" PRIu64 ")+1) );\n", sinDef, -double_PI / 8 / sc->fftDim, LFending, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(%s.x*mult.x - %s.y*mult.y)%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(%s.x*mult.x - %s.y*mult.y)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(index_x, "%" PRIu64 " - combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ")", 2 * sc->fftDim - 1, sc->fftDim, sc->fftDim, sc->outputStride[1]); sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, index_x, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(-%s.x*mult.y - %s.y*mult.x)%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(-%s.x*mult.y - %s.y*mult.x)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->axisSwapped) { if (sc->size[1] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (sc->size[1] % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 143://DCT-IV strided as 2xN/2 DCT-II { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftX2[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX2, " + consts.workGroupShiftX * %s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " sdataID = (((combinedID / %" PRIu64 ") %% 2) * %" PRIu64 " + (1-2*((combinedID / %" PRIu64 ") %% 2)) * ((combinedID / %" PRIu64 ")/2)) * sharedStride + (combinedID %% %" PRIu64 ");\n", sc->localSize[0], sc->fftDim - 1, sc->localSize[0], sc->localSize[0], sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[sdataID];\n", sc->regIDs[i + k * sc->registers_per_thread]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = %s.y * (1.0%s - 2 * ((combinedID / %" PRIu64 ")%%2));\n", sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], LFending, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if ((%s%s) < %" PRIu64 ") {\n", sc->gl_GlobalInvocationID_x, shiftX2, (uint64_t)ceil(sc->size[0])); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", (sc->fftDim) * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%s + %" PRIu64 ")", sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[1]); res = indexOutputVkFFT(sc, uintType, writeType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " mult = twiddleLUT[%" PRIu64 " + combinedID / %" PRIu64 "];\n", sc->startDCT4LUT, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " mult.x = %s(%.17e%s * (2*(combinedID / %" PRIu64 ")+1) );\n", cosDef, -double_PI / 8 / sc->fftDim, LFending, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " mult.y = %s(%.17e%s * (2*(combinedID / %" PRIu64 ")+1) );\n", sinDef, -double_PI / 8 / sc->fftDim, LFending, sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(%s.x*mult.x - %s.y*mult.y)%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(%s.x*mult.x - %s.y*mult.y)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%s%s) %% (%" PRIu64 ")", sc->gl_GlobalInvocationID_x, shiftX2, sc->fft_dim_x); sprintf(index_y, "(%" PRIu64 " - (%s + %" PRIu64 "))", 2 * sc->fftDim - 1, sc->gl_LocalInvocationID_y, (i + k * 2 * sc->min_registers_per_thread) * sc->localSize[1]); res = indexOutputVkFFT(sc, uintType, writeType, index_x, index_y, requestCoordinate, requestBatch); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((%s %% %" PRIu64 " < %" PRIu64 ")||(%s %% %" PRIu64 " >= %" PRIu64 ")){\n", index_y, sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], index_y, sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s(-%s.x*mult.y - %s.y*mult.x)%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s(-%s.x*mult.y - %s.y*mult.x)%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= (sc->fftDim) * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((uint64_t)ceil(sc->size[0]) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropadBluestein[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } case 144://odd DCT-IV nonstrided as N FFT { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX "); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s + %" PRIu64 "* sharedStride] = sdata[%s];\n\ }\n", sc->gl_LocalInvocationID_y, sc->gl_LocalInvocationID_x, sc->fftDim, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "\ if (%s==0)\n\ {\n\ sdata[%s * sharedStride + %" PRIu64 "] = sdata[%s * sharedStride];\n\ }\n", sc->gl_LocalInvocationID_x, sc->gl_LocalInvocationID_y, sc->fftDim, sc->gl_LocalInvocationID_y); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; } } //uint64_t num_out = (sc->axisSwapped) ? (uint64_t)ceil(mult * (sc->fftDim / 2 + 1) / (double)sc->localSize[1]) : (uint64_t)ceil(mult * (sc->fftDim / 2 + 1) / (double)sc->localSize[0]); //num_out = (uint64_t)ceil(num_out / (double)sc->min_registers_per_thread); for (uint64_t i = 0; i < mult * sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = combinedID %% %" PRIu64 " + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->fftDim, sc->fftDim, sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", mult * sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[0], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= mult * sc->fftDim * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", mult * sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID / %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", mult * sc->fftDim, sc->gl_WorkGroupID_y, sc->localSize[1], (uint64_t)ceil(sc->size[1] / (double)mult)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= mult * sc->fftDim * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", mult * sc->fftDim * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if((inoutID %% %" PRIu64 " < %" PRIu64 ")||(inoutID %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->outputStride[1], sc->fft_zeropad_left_write[sc->axis_id], sc->outputStride[1], sc->fft_zeropad_right_write[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; if (sc->writeFromRegisters) { //not working yet if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s%s%s;\n", outputsStruct, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[i + k * sc->registers_per_thread], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " sdataID = combinedID %% %" PRIu64 ";\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(sdataID < %" PRIu64 "){\n", sc->fftDim / 4); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID+1) * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 "- (2*sdataID+1)) * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(2*sdataID+1) * sharedStride + (combinedID / %" PRIu64 ")].y-sdata[(%" PRIu64 "- (2*sdataID+1)) * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID+1) * sharedStride + (combinedID / %" PRIu64 ")].y+sdata[(%" PRIu64 "- (2*sdataID+1)) * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(2*sdataID+1) * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 "- (2*sdataID+1)) * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID+1) + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 "- (2*sdataID+1)) + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(2*sdataID+1) + (combinedID / %" PRIu64 ") * sharedStride].y-sdata[(%" PRIu64 "- (2*sdataID+1)) + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID+1) + (combinedID / %" PRIu64 ") * sharedStride].y+sdata[(%" PRIu64 "- (2*sdataID+1)) + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(2*sdataID+1) + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 "- (2*sdataID+1)) + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, mult * sc->fftDim, sc->fftDim, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!sc->axisSwapped) sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*sdataID+1) + (combinedID / %" PRIu64 ") * sharedStride];\n", sc->regIDs[0], sc->fftDim); else sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*sdataID+1) * sharedStride + (combinedID / %" PRIu64 ")];\n", sc->regIDs[0], sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID + 1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x += %s.y;\n\ else\n\ %s.x -= %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((sdataID < %" PRIu64 ")&&(sdataID >= %" PRIu64 ")){\n", sc->fftDim / 2, sc->fftDim / 4); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 " + 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y-sdata[(%" PRIu64 " + 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y+sdata[(%" PRIu64 " + 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 " + 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 " + 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y-sdata[(%" PRIu64 " + 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y+sdata[(%" PRIu64 " + 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 " + 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim - 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!sc->axisSwapped) sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride];\n", sc->regIDs[0], 2 * (sc->fftDim / 2), sc->fftDim); else sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")];\n", sc->regIDs[0], 2 * (sc->fftDim / 2), sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID+1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x -= %s.y;\n\ else\n\ %s.x += %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((sdataID < %" PRIu64 ")&&(sdataID >= %" PRIu64 ")){\n", 3 * sc->fftDim / 4, sc->fftDim / 2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y-sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y+sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y-sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y+sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, 2 * (sc->fftDim / 2), mult * sc->fftDim, sc->fftDim + 2 * (sc->fftDim / 2), mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!sc->axisSwapped) sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride];\n", sc->regIDs[0], 2 * (sc->fftDim / 2), sc->fftDim); else sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")];\n", sc->regIDs[0], 2 * (sc->fftDim / 2), sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID+1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x += %s.y;\n\ else\n\ %s.x -= %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((sdataID >= %" PRIu64 ")){\n", 3 * sc->fftDim / 4); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->mergeSequencesR2C) { if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y-sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].y+sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].y);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")].x+sdata[(2*sdataID - %" PRIu64 ") * sharedStride + (combinedID / %" PRIu64 ")].x);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "if ( (combinedID / %" PRIu64 ") %% 2 == 0){\n", sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y-sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x = 0.5%s*(sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].y+sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].y);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.y = 0.5%s*(-sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride].x+sdata[(2*sdataID - %" PRIu64 ") + (combinedID / %" PRIu64 ") * sharedStride].x);\n", sc->regIDs[0], LFending, 2 * sc->fftDim - 1, mult * sc->fftDim, sc->fftDim - 1, mult * sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if (!sc->axisSwapped) sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(%" PRIu64 " - 2*sdataID) + (combinedID / %" PRIu64 ") * sharedStride];\n", sc->regIDs[0], 2 * sc->fftDim - 1, sc->fftDim); else sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + (combinedID / %" PRIu64 ")];\n", sc->regIDs[0], 2 * sc->fftDim - 1, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID+1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x -= %s.y;\n\ else\n\ %s.x += %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x *= 1.41421356237309504880%s;\n", sc->regIDs[1], LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s%s.x%s;\n", outputsStruct, convTypeLeft, sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s.x%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->axisSwapped) { if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= mult * sc->fftDim * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= mult * sc->fftDim * sc->localSize[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (sc->axisSwapped) { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } else { if ((uint64_t)ceil(sc->size[1] / (double)mult) % sc->localSize[1] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; /*for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { if (sc->localSize[1] == 1) sc->tempLen = sprintf(sc->tempStr, " combinedID = %s + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0]); else sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(%s + %" PRIu64 " < %" PRIu64 "){\n", sc->gl_LocalInvocationID_x, (i + k * sc->min_registers_per_thread) * sc->localSize[0], (sc->fftDim-1)/2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " w = sdata[(2*(combinedID %% %" PRIu64 ")+1)* sharedStride + (combinedID / %" PRIu64 ")];\n",sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*(combinedID %% %" PRIu64 ")+2)* sharedStride + (combinedID / %" PRIu64 ")];\n", sc->regIDs[i + k * sc->min_registers_per_thread], sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " w = sdata[(2*(combinedID %% %" PRIu64 ")+1) + (combinedID / %" PRIu64 ") * sharedStride];\n", sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*(combinedID %% %" PRIu64 ")+2) + (combinedID / %" PRIu64 ") * sharedStride];\n", sc->regIDs[i + k * sc->min_registers_per_thread], sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }else{\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->axisSwapped) { sc->tempLen = sprintf(sc->tempStr, " w = sdata[(2*(%" PRIu64 " - combinedID %% %" PRIu64 ")-1)* sharedStride + (combinedID / %" PRIu64 ")];\n", sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*(%" PRIu64 " - combinedID %% %" PRIu64 "))* sharedStride + (combinedID / %" PRIu64 ")];\n", sc->regIDs[i + k * sc->min_registers_per_thread], sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " w = sdata[(2*(%" PRIu64 " - combinedID %% %" PRIu64 ")-1) + (combinedID / %" PRIu64 ")* sharedStride];\n", sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*(%" PRIu64 " - combinedID %% %" PRIu64 ")) + (combinedID / %" PRIu64 ")* sharedStride];\n", sc->regIDs[i + k * sc->min_registers_per_thread], sc->fftDim, sc->fftDim, sc->fftDim); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } }*/ } else { } } break; } case 145://odd DCT-IV strided as N FFT { if (!sc->writeFromRegisters) { res = appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) return res; } //res = appendZeropadStart(sc); //if (res != VKFFT_SUCCESS) return res; char shiftX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(shiftX, " + consts.workGroupShiftX*%s ", sc->gl_WorkGroupSize_x); char shiftY[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY, " + consts.workGroupShiftY*%s ", sc->gl_WorkGroupSize_y); char shiftY2[500] = ""; if (sc->performWorkGroupShift[1]) sprintf(shiftY2, " + consts.workGroupShiftY "); uint64_t mult = (sc->mergeSequencesR2C) ? 2 : 1; if (sc->reorderFourStep) { //Not implemented } else { //appendBarrierVkFFT(sc, 1); //appendZeropadStart(sc); if (sc->fftDim == sc->fft_dim_full) { if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_zeropad_Bluestein_left_write[sc->axis_id]; for (uint64_t k = 0; k < sc->registerBoost; k++) { for (uint64_t i = 0; i < sc->min_registers_per_thread; i++) { sc->tempLen = sprintf(sc->tempStr, " combinedID = (%s + %" PRIu64 " * %s) + %" PRIu64 ";\n", sc->gl_LocalInvocationID_x, sc->localSize[0], sc->gl_LocalInvocationID_y, (i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = %s%s + ((combinedID/%" PRIu64 ") * %" PRIu64 ");\n", sc->inoutID, sc->gl_GlobalInvocationID_x, shiftX, sc->localSize[0], sc->outputStride[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID %% %" PRIu64 " + %s*%" PRIu64 "< %" PRIu64 "){\n", sc->localSize[0], sc->gl_WorkGroupID_x, sc->localSize[0], sc->size[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= sc->fftDim * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " if(combinedID < %" PRIu64 "){\n", sc->fftDim * sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " if(((combinedID/%" PRIu64 ") %% %" PRIu64 " < %" PRIu64 ")||((combinedID/%" PRIu64 ") %% %" PRIu64 " >= %" PRIu64 ")){\n", sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_left_read[sc->axis_id], sc->localSize[0], sc->fft_dim_full, sc->fft_zeropad_right_read[sc->axis_id]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " %s = ", sc->inoutID); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = indexOutputVkFFT(sc, uintType, writeType, sc->inoutID, 0, requestCoordinate, requestBatch); sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadStartReadWriteStage(sc, 0); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " sdataID = combinedID / %" PRIu64 ";\n", sc->localSize[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if(sdataID < %" PRIu64 "){\n", sc->fftDim / 4); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*sdataID+1) * sharedStride + %s];\n", sc->regIDs[0], sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID + 1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x += %s.y;\n\ else\n\ %s.x -= %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((sdataID < %" PRIu64 ")&&(sdataID >= %" PRIu64 ")){\n", sc->fftDim / 2, sc->fftDim / 4); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + %s];\n", sc->regIDs[0], 2 * (sc->fftDim / 2), sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID+1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x -= %s.y;\n\ else\n\ %s.x += %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((sdataID < %" PRIu64 ")&&(sdataID >= %" PRIu64 ")){\n", 3 * sc->fftDim / 4, sc->fftDim / 2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(2*sdataID - %" PRIu64 ") * sharedStride + %s];\n", sc->regIDs[0], 2 * (sc->fftDim / 2), sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID+1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x += %s.y;\n\ else\n\ %s.x -= %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if((sdataID >= %" PRIu64 ")){\n", 3 * sc->fftDim / 4); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s = sdata[(%" PRIu64 " - 2*sdataID) * sharedStride + %s];\n", sc->regIDs[0], 2 * sc->fftDim - 1, sc->gl_LocalInvocationID_x); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID+1)/2) %% 2) != 0) \n\ %s.x = -%s.x;\n\ else\n\ %s.x = %s.x;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " if ((((sdataID)/2) %% 2) != 0) \n\ %s.x -= %s.y;\n\ else\n\ %s.x += %s.y;\n", sc->regIDs[1], sc->regIDs[0], sc->regIDs[1], sc->regIDs[0]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " }\n\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s.x *= 1.41421356237309504880%s;\n", sc->regIDs[1], LFending); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %s%s.x%s;\n", outputsStruct, convTypeLeft, sc->regIDs[1], convTypeRight); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %s%s.x%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeft, sc->regIDs[1], convTypeRight); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->zeropad[1]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } res = appendZeropadEndReadWriteStage(sc); if (res != VKFFT_SUCCESS) return res; if ((1 + i + k * sc->min_registers_per_thread) * sc->localSize[0] * sc->localSize[1] >= sc->fftDim * sc->localSize[0]) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->size[0] % sc->localSize[0] != 0) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } } if (sc->zeropadBluestein[1]) sc->fftDim = sc->fft_dim_full; } else { } } break; } } //res = appendZeropadEnd(sc); //if (res != VKFFT_SUCCESS) return res; return res; } static inline VkFFTResult shaderGenVkFFT_R2C_decomposition(char* output, VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeInputMemory, const char* floatTypeOutputMemory, const char* floatTypeKernelMemory, const char* uintType, uint64_t type) { VkFFTResult res = VKFFT_SUCCESS; //appendLicense(output); sc->oldLocale = setlocale(LC_ALL, NULL); setlocale(LC_ALL, "C"); sc->output = output; sc->tempStr = (char*)malloc(sizeof(char) * sc->maxTempLength); if (!sc->tempStr) return VKFFT_ERROR_MALLOC_FAILED; sc->tempLen = 0; sc->currentLen = 0; char vecType[30]; char vecTypeInput[30]; char vecTypeOutput[30]; char inputsStruct[20] = ""; char outputsStruct[20] = ""; char LFending[4] = ""; if (!strcmp(floatType, "float")) sprintf(LFending, "f"); #if(VKFFT_BACKEND==0) if (sc->inputBufferBlockNum == 1) sprintf(inputsStruct, "inputs"); else sprintf(inputsStruct, ".inputs"); if (sc->outputBufferBlockNum == 1) sprintf(outputsStruct, "outputs"); else sprintf(outputsStruct, ".outputs"); if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "vec2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "dvec2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "vec2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "dvec2"); sprintf(sc->gl_LocalInvocationID_x, "gl_LocalInvocationID.x"); sprintf(sc->gl_LocalInvocationID_y, "gl_LocalInvocationID.y"); sprintf(sc->gl_LocalInvocationID_z, "gl_LocalInvocationID.z"); sprintf(sc->gl_GlobalInvocationID_x, "gl_GlobalInvocationID.x"); sprintf(sc->gl_GlobalInvocationID_y, "gl_GlobalInvocationID.y"); sprintf(sc->gl_GlobalInvocationID_z, "gl_GlobalInvocationID.z"); sprintf(sc->gl_WorkGroupID_x, "gl_WorkGroupID.x"); sprintf(sc->gl_WorkGroupID_y, "gl_WorkGroupID.y"); sprintf(sc->gl_WorkGroupID_z, "gl_WorkGroupID.z"); sprintf(sc->gl_WorkGroupSize_x, "gl_WorkGroupSize.x"); sprintf(sc->gl_WorkGroupSize_y, "gl_WorkGroupSize.y"); sprintf(sc->gl_WorkGroupSize_z, "gl_WorkGroupSize.z"); if (!strcmp(floatType, "double")) sprintf(LFending, "LF"); char cosDef[20] = "cos"; char sinDef[20] = "sin"; #elif(VKFFT_BACKEND==1) sprintf(inputsStruct, "inputs"); sprintf(outputsStruct, "outputs"); if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "float2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "double2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "float2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "double2"); sprintf(sc->gl_LocalInvocationID_x, "threadIdx.x"); sprintf(sc->gl_LocalInvocationID_y, "threadIdx.y"); sprintf(sc->gl_LocalInvocationID_z, "threadIdx.z"); sprintf(sc->gl_GlobalInvocationID_x, "(threadIdx.x + blockIdx.x * blockDim.x)"); sprintf(sc->gl_GlobalInvocationID_y, "(threadIdx.y + blockIdx.y * blockDim.y)"); sprintf(sc->gl_GlobalInvocationID_z, "(threadIdx.z + blockIdx.z * blockDim.z)"); sprintf(sc->gl_WorkGroupID_x, "blockIdx.x"); sprintf(sc->gl_WorkGroupID_y, "blockIdx.y"); sprintf(sc->gl_WorkGroupID_z, "blockIdx.z"); sprintf(sc->gl_WorkGroupSize_x, "blockDim.x"); sprintf(sc->gl_WorkGroupSize_y, "blockDim.y"); sprintf(sc->gl_WorkGroupSize_z, "blockDim.z"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; #elif(VKFFT_BACKEND==2) sprintf(inputsStruct, "inputs"); sprintf(outputsStruct, "outputs"); if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "float2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "double2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "float2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "double2"); sprintf(sc->gl_LocalInvocationID_x, "threadIdx.x"); sprintf(sc->gl_LocalInvocationID_y, "threadIdx.y"); sprintf(sc->gl_LocalInvocationID_z, "threadIdx.z"); sprintf(sc->gl_GlobalInvocationID_x, "(threadIdx.x + blockIdx.x * blockDim.x)"); sprintf(sc->gl_GlobalInvocationID_y, "(threadIdx.y + blockIdx.y * blockDim.y)"); sprintf(sc->gl_GlobalInvocationID_z, "(threadIdx.z + blockIdx.z * blockDim.z)"); sprintf(sc->gl_WorkGroupID_x, "blockIdx.x"); sprintf(sc->gl_WorkGroupID_y, "blockIdx.y"); sprintf(sc->gl_WorkGroupID_z, "blockIdx.z"); sprintf(sc->gl_WorkGroupSize_x, "blockDim.x"); sprintf(sc->gl_WorkGroupSize_y, "blockDim.y"); sprintf(sc->gl_WorkGroupSize_z, "blockDim.z"); if (!strcmp(floatType, "double")) sprintf(LFending, "l"); char cosDef[20] = "__cosf"; char sinDef[20] = "__sinf"; #elif(VKFFT_BACKEND==3) sprintf(inputsStruct, "inputs"); sprintf(outputsStruct, "outputs"); if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "float2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "double2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "float2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "double2"); sprintf(sc->gl_LocalInvocationID_x, "get_local_id(0)"); sprintf(sc->gl_LocalInvocationID_y, "get_local_id(1)"); sprintf(sc->gl_LocalInvocationID_z, "get_local_id(2)"); sprintf(sc->gl_GlobalInvocationID_x, "get_global_id(0)"); sprintf(sc->gl_GlobalInvocationID_y, "get_global_id(1)"); sprintf(sc->gl_GlobalInvocationID_z, "get_global_id(2)"); sprintf(sc->gl_WorkGroupID_x, "get_group_id(0)"); sprintf(sc->gl_WorkGroupID_y, "get_group_id(1)"); sprintf(sc->gl_WorkGroupID_z, "get_group_id(2)"); sprintf(sc->gl_WorkGroupSize_x, "get_local_size(0)"); sprintf(sc->gl_WorkGroupSize_y, "get_local_size(1)"); sprintf(sc->gl_WorkGroupSize_z, "get_local_size(2)"); //if (!strcmp(floatType, "double")) sprintf(LFending, "l"); char cosDef[20] = "native_cos"; char sinDef[20] = "native_sin"; #endif sprintf(sc->stageInvocationID, "stageInvocationID"); sprintf(sc->blockInvocationID, "blockInvocationID"); sprintf(sc->tshuffle, "tshuffle"); sprintf(sc->sharedStride, "sharedStride"); sprintf(sc->combinedID, "combinedID"); sprintf(sc->inoutID, "inoutID"); sprintf(sc->sdataID, "sdataID"); char convTypeLeftInput[20] = ""; char convTypeRightInput[20] = ""; if ((!strcmp(floatType, "float")) && (strcmp(floatTypeInputMemory, "float"))) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeftInput, "vec2("); sprintf(convTypeRightInput, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeftInput, "conv_float2("); sprintf(convTypeRightInput, ")"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeftInput, "conv_float2("); sprintf(convTypeRightInput, ")"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeftInput, "conv_float2("); sprintf(convTypeRightInput, ")"); #endif } if ((!strcmp(floatType, "double")) && (strcmp(floatTypeInputMemory, "double"))) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeftInput, "dvec2("); sprintf(convTypeRightInput, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeftInput, "conv_double2("); sprintf(convTypeRightInput, ")"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeftInput, "conv_double2("); sprintf(convTypeRightInput, ")"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeftInput, "conv_double2("); sprintf(convTypeRightInput, ")"); #endif } char convTypeLeftOutput[20] = ""; char convTypeRightOutput[20] = ""; if ((!strcmp(floatTypeOutputMemory, "half")) && (strcmp(floatType, "half"))) { sprintf(convTypeLeftOutput, "f16vec2("); sprintf(convTypeRightOutput, ")"); } if ((!strcmp(floatTypeOutputMemory, "float")) && (strcmp(floatType, "float"))) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeftOutput, "vec2("); sprintf(convTypeRightOutput, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeftOutput, "(float2)"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeftOutput, "(float2)"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeftOutput, "conv_float2("); sprintf(convTypeRightOutput, ")"); #endif } if ((!strcmp(floatTypeOutputMemory, "double")) && (strcmp(floatType, "double"))) { #if(VKFFT_BACKEND==0) sprintf(convTypeLeftOutput, "dvec2("); sprintf(convTypeRightOutput, ")"); #elif(VKFFT_BACKEND==1) sprintf(convTypeLeftOutput, "(double2)"); #elif(VKFFT_BACKEND==2) sprintf(convTypeLeftOutput, "(double2)"); #elif(VKFFT_BACKEND==3) sprintf(convTypeLeftOutput, "conv_double2("); sprintf(convTypeRightOutput, ")"); #endif } //sprintf(sc->tempReg, "temp"); res = appendVersion(sc); if (res != VKFFT_SUCCESS) return res; res = appendExtensions(sc, floatType, floatTypeInputMemory, floatTypeOutputMemory, floatTypeKernelMemory); if (res != VKFFT_SUCCESS) return res; res = appendLayoutVkFFT(sc); if (res != VKFFT_SUCCESS) return res; res = appendConstantsVkFFT(sc, floatType, uintType); if (res != VKFFT_SUCCESS) return res; if ((!sc->LUT) && (!strcmp(floatType, "double"))) { res = appendSinCos20(sc, floatType, uintType); if (res != VKFFT_SUCCESS) return res; } if (strcmp(floatType, floatTypeInputMemory)) { res = appendConversion(sc, floatType, floatTypeInputMemory); if (res != VKFFT_SUCCESS) return res; } if (strcmp(floatType, floatTypeOutputMemory) && strcmp(floatTypeInputMemory, floatTypeOutputMemory)) { res = appendConversion(sc, floatType, floatTypeOutputMemory); if (res != VKFFT_SUCCESS) return res; } res = appendPushConstantsVkFFT(sc, floatType, uintType); if (res != VKFFT_SUCCESS) return res; uint64_t id = 0; res = appendInputLayoutVkFFT(sc, id, floatTypeInputMemory, 0); if (res != VKFFT_SUCCESS) return res; id++; res = appendOutputLayoutVkFFT(sc, id, floatTypeOutputMemory, 0); if (res != VKFFT_SUCCESS) return res; id++; if (sc->convolutionStep) { res = appendKernelLayoutVkFFT(sc, id, floatTypeKernelMemory); if (res != VKFFT_SUCCESS) return res; id++; } if (sc->LUT) { res = appendLUTLayoutVkFFT(sc, id, floatType); if (res != VKFFT_SUCCESS) return res; id++; } //appendIndexInputVkFFT(sc, uintType, type); //appendIndexOutputVkFFT(sc, uintType, type); /*uint64_t appendedRadix[10] = { 0,0,0,0,0,0,0,0,0,0 }; for (uint64_t i = 0; i < sc->numStages; i++) { if (appendedRadix[sc->stageRadix[i]] == 0) { appendedRadix[sc->stageRadix[i]] = 1; appendRadixKernelVkFFT(sc, floatType, uintType, sc->stageRadix[i]); } }*/ #if(VKFFT_BACKEND==0) sc->tempLen = sprintf(sc->tempStr, "void main() {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; #elif(VKFFT_BACKEND==1) sc->tempLen = sprintf(sc->tempStr, "extern \"C\" __global__ __launch_bounds__(%" PRIu64 ") void VkFFT_main_R2C ", sc->localSize[0] * sc->localSize[1] * sc->localSize[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", vecTypeInput, vecTypeOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->convolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", %s* kernel_obj", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, ", %s* twiddleLUT", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, ") {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, ", const PushConsts consts) {\n"); #elif(VKFFT_BACKEND==2) sc->tempLen = sprintf(sc->tempStr, "extern \"C\" __launch_bounds__(%" PRIu64 ") __global__ void VkFFT_main_R2C ", sc->localSize[0] * sc->localSize[1] * sc->localSize[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", vecTypeInput, vecTypeOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->convolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", %s* kernel_obj", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, ", %s* twiddleLUT", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, ") {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, ", const PushConsts consts) {\n"); #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, "__kernel __attribute__((reqd_work_group_size(%" PRIu64 ", %" PRIu64 ", %" PRIu64 "))) void VkFFT_main_R2C ", sc->localSize[0], sc->localSize[1], sc->localSize[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", vecTypeInput, vecTypeOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->convolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", __global %s* kernel_obj", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, ", __global %s* twiddleLUT", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->pushConstantsStructSize > 0) { sc->tempLen = sprintf(sc->tempStr, ", PushConsts consts"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, ") {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //sc->tempLen = sprintf(sc->tempStr, ", const PushConsts consts) {\n"); #endif char index_x[2000] = ""; char idX[500] = ""; if (sc->performWorkGroupShift[0]) sprintf(idX, "(%s + consts.workGroupShiftX * %s)", sc->gl_GlobalInvocationID_x, sc->gl_WorkGroupSize_x); else sprintf(idX, "%s", sc->gl_GlobalInvocationID_x); res = appendZeropadStart(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "%s id_x = %s %% %" PRIu64 ";\n", uintType, idX, (uint64_t)ceil(sc->size[0] / 4.0)); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "%s id_y = (%s / %" PRIu64 ") %% %" PRIu64 ";\n", uintType, idX, (uint64_t)ceil(sc->size[0] / 4.0), sc->size[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "%s id_z = (%s / %" PRIu64 ") / %" PRIu64 ";\n", uintType, idX, (uint64_t)ceil(sc->size[0] / 4.0), sc->size[1]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "if (%s < %" PRIu64 "){\n", idX, (uint64_t)ceil(sc->size[0] / 4.0) * sc->size[1] * sc->size[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "%s inoutID = ", uintType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "id_x + id_y*%" PRIu64 " +id_z*%" PRIu64 "", sc->inputStride[1], sc->inputStride[2]); res = indexInputVkFFT(sc, uintType, 0, index_x, 0, 0, 0); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "%s inoutID2;\n", uintType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "%s inoutID3;\n", uintType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s t0 = %s%s[inoutID]%s;\n", vecType, convTypeLeftInput, inputsStruct, convTypeRightInput); else sc->tempLen = sprintf(sc->tempStr, " %s t0 = %sinputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "]%s;\n", vecType, convTypeLeftInput, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRightInput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s tf;\n", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->size[0] % 4 == 0) { sc->tempLen = sprintf(sc->tempStr, "if (id_x == 0) {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID2 = "); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "%" PRIu64 " + id_y*%" PRIu64 " +id_z*%" PRIu64 "", (sc->size[0] / 2), sc->inputStride[1], sc->inputStride[2]); res = indexInputVkFFT(sc, uintType, 0, index_x, 0, 0, 0); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID3 = "); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "%" PRIu64 " + id_y*%" PRIu64 " +id_z*%" PRIu64 "", (uint64_t)ceil(sc->size[0] / 4.0), sc->inputStride[1], sc->inputStride[2]); res = indexInputVkFFT(sc, uintType, 0, index_x, 0, 0, 0); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " tf = %s%s[inoutID3]%s;\n", convTypeLeftInput, inputsStruct, convTypeRightInput); else sc->tempLen = sprintf(sc->tempStr, " tf = %sinputBlocks[inoutID3 / %" PRIu64 "]%s[inoutID3 %% %" PRIu64 "]%s;\n", convTypeLeftInput, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRightInput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "} else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " inoutID2 = "); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%" PRIu64 "-id_x) + id_y*%" PRIu64 " +id_z*%" PRIu64 "", (sc->size[0] / 2), sc->inputStride[1], sc->inputStride[2]); res = indexInputVkFFT(sc, uintType, 0, index_x, 0, 0, 0); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, "inoutID2 = "); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sprintf(index_x, "(%" PRIu64 "-id_x) + id_y*%" PRIu64 " +id_z*%" PRIu64 "", (sc->size[0] / 2), sc->inputStride[1], sc->inputStride[2]); res = indexInputVkFFT(sc, uintType, 0, index_x, 0, 0, 0); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, ";\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->inputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s t1 = %s%s[inoutID2]%s;\n", vecType, convTypeLeftInput, inputsStruct, convTypeRightInput); else sc->tempLen = sprintf(sc->tempStr, " %s t1 = %sinputBlocks[inoutID2 / %" PRIu64 "]%s[inoutID2 %% %" PRIu64 "]%s;\n", vecType, convTypeLeftInput, sc->inputBufferBlockSize, inputsStruct, sc->inputBufferBlockSize, convTypeRightInput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s t2;\n", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " %s t3;\n", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "if (id_x == 0) {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->size[0] % 4 == 0) { if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " t2.x = t0.x+t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t2.y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t3.x = t0.x-t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t3.y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " t2.x = (t0.x+t1.x);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t2.y = (t0.x-t1.x);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } sc->tempLen = sprintf(sc->tempStr, " tf.y = -tf.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->inverse) { res = VkMulComplexNumber(sc, "tf", "tf", "2"); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %st2%s;\n", outputsStruct, convTypeLeftOutput, convTypeRightOutput); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %st2%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeftOutput, convTypeRightOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID2] = %st3%s;\n", outputsStruct, convTypeLeftOutput, convTypeRightOutput); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID2 / %" PRIu64 "]%s[inoutID2 %% %" PRIu64 "] = %st3%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeftOutput, convTypeRightOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID3] = %stf%s;\n", outputsStruct, convTypeLeftOutput, convTypeRightOutput); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID3 / %" PRIu64 "]%s[inoutID3 %% %" PRIu64 "] = %stf%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeftOutput, convTypeRightOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " t2.x = t0.x+t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t2.y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t3.x = t0.x-t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t3.y = 0;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " t2.x = (t0.x+t1.x);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t2.y = (t0.x-t1.x);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %st2%s;\n", outputsStruct, convTypeLeftOutput, convTypeRightOutput); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %st2%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeftOutput, convTypeRightOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID2] = %st3%s;\n", outputsStruct, convTypeLeftOutput, convTypeRightOutput); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID2 / %" PRIu64 "]%s[inoutID2 %% %" PRIu64 "] = %st3%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeftOutput, convTypeRightOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } sc->tempLen = sprintf(sc->tempStr, "} else {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = VkAddComplex(sc, "t2", "t0", "t1"); if (res != VKFFT_SUCCESS) return res; res = VkSubComplex(sc, "t3", "t0", "t1"); if (res != VKFFT_SUCCESS) return res; if (!sc->inverse) { res = VkMulComplexNumber(sc, "t2", "t2", "0.5"); if (res != VKFFT_SUCCESS) return res; res = VkMulComplexNumber(sc, "t3", "t3", "0.5"); if (res != VKFFT_SUCCESS) return res; } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, " tf = twiddleLUT[id_x];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " %s angle = (loc_PI*id_x)/%" PRIu64 ";\n", floatType, sc->size[0] / 2); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (!strcmp(floatType, "float")) { sc->tempLen = sprintf(sc->tempStr, " tf.x = %s(angle);\n", cosDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " tf.y = %s(angle);\n", sinDef); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } if (!strcmp(floatType, "double")) { sc->tempLen = sprintf(sc->tempStr, " tf = sincos_20(angle);\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } } if (!sc->inverse) { sc->tempLen = sprintf(sc->tempStr, " t0.x = tf.x*t2.y-tf.y*t3.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t0.y = -tf.y*t2.y-tf.x*t3.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t1.x = t2.x-t0.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t1.y = -t3.y+t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t0.x = t2.x+t0.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t0.y = t3.y+t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } else { sc->tempLen = sprintf(sc->tempStr, " t0.x = tf.x*t2.y+tf.y*t3.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t0.y = -tf.y*t2.y+tf.x*t3.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t1.x = t2.x+t0.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t1.y = -t3.y+t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t0.x = t2.x-t0.x;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, " t0.y = t3.y+t0.y;\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; } //sc->tempLen = sprintf(sc->tempStr, " t0.x = t2.x+tf.x*t2.y-tf.y*t3.x;\n"); //sc->tempLen = sprintf(sc->tempStr, " t0.y = t3.y-tf.y*t2.y-tf.x*t3.x;\n"); //sc->tempLen = sprintf(sc->tempStr, " t1.x = t2.x-tf.x*t2.y+tf.y*t3.x;\n"); //sc->tempLen = sprintf(sc->tempStr, " t1.y = -t3.y-tf.y*t2.y-tf.x*t3.x;\n"); if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID] = %st0%s;\n", outputsStruct, convTypeLeftOutput, convTypeRightOutput); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID / %" PRIu64 "]%s[inoutID %% %" PRIu64 "] = %st0%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeftOutput, convTypeRightOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; if (sc->outputBufferBlockNum == 1) sc->tempLen = sprintf(sc->tempStr, " %s[inoutID2] = %st1%s;\n", outputsStruct, convTypeLeftOutput, convTypeRightOutput); else sc->tempLen = sprintf(sc->tempStr, " outputBlocks[inoutID2 / %" PRIu64 "]%s[inoutID2 %% %" PRIu64 "] = %st1%s;\n", sc->outputBufferBlockSize, outputsStruct, sc->outputBufferBlockSize, convTypeLeftOutput, convTypeRightOutput); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; res = appendZeropadEnd(sc); if (res != VKFFT_SUCCESS) return res; sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) return res; //printf("%s", output); return res; } static inline void freeShaderGenVkFFT(VkFFTSpecializationConstantsLayout* sc) { if (sc->tempStr) { free(sc->tempStr); sc->tempStr = 0; } if (sc->disableThreadsStart) { free(sc->disableThreadsStart); sc->disableThreadsStart = 0; } if (sc->disableThreadsStart) { free(sc->disableThreadsEnd); sc->disableThreadsEnd = 0; } if (sc->regIDs) { for (uint64_t i = 0; i < sc->registers_per_thread * sc->registerBoost; i++) { if (sc->regIDs[i]) { free(sc->regIDs[i]); sc->regIDs[i] = 0; } } free(sc->regIDs); sc->regIDs = 0; } if (sc->oldLocale) { setlocale(LC_ALL, sc->oldLocale); } } static inline VkFFTResult shaderGenVkFFT(char* output, VkFFTSpecializationConstantsLayout* sc, const char* floatType, const char* floatTypeInputMemory, const char* floatTypeOutputMemory, const char* floatTypeKernelMemory, const char* uintType, uint64_t type) { VkFFTResult res = VKFFT_SUCCESS; //appendLicense(output); sc->oldLocale = setlocale(LC_ALL, NULL); setlocale(LC_ALL, "C"); sc->output = output; sc->tempStr = (char*)malloc(sizeof(char) * sc->maxTempLength); if (!sc->tempStr) return VKFFT_ERROR_MALLOC_FAILED; sc->tempLen = 0; sc->currentLen = 0; char vecType[30]; char vecTypeInput[30]; char vecTypeOutput[30]; #if(VKFFT_BACKEND==0) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "vec2"); if (!strcmp(floatType, "double")) sprintf(vecType, "dvec2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "vec2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "dvec2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "vec2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "dvec2"); sprintf(sc->gl_LocalInvocationID_x, "gl_LocalInvocationID.x"); sprintf(sc->gl_LocalInvocationID_y, "gl_LocalInvocationID.y"); sprintf(sc->gl_LocalInvocationID_z, "gl_LocalInvocationID.z"); sprintf(sc->gl_GlobalInvocationID_x, "gl_GlobalInvocationID.x"); sprintf(sc->gl_GlobalInvocationID_y, "gl_GlobalInvocationID.y"); sprintf(sc->gl_GlobalInvocationID_z, "gl_GlobalInvocationID.z"); sprintf(sc->gl_WorkGroupID_x, "gl_WorkGroupID.x"); sprintf(sc->gl_WorkGroupID_y, "gl_WorkGroupID.y"); sprintf(sc->gl_WorkGroupID_z, "gl_WorkGroupID.z"); sprintf(sc->gl_WorkGroupSize_x, "gl_WorkGroupSize.x"); sprintf(sc->gl_WorkGroupSize_y, "gl_WorkGroupSize.y"); sprintf(sc->gl_WorkGroupSize_z, "gl_WorkGroupSize.z"); #elif(VKFFT_BACKEND==1) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "float2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "double2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "float2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "double2"); sprintf(sc->gl_LocalInvocationID_x, "threadIdx.x"); sprintf(sc->gl_LocalInvocationID_y, "threadIdx.y"); sprintf(sc->gl_LocalInvocationID_z, "threadIdx.z"); sprintf(sc->gl_GlobalInvocationID_x, "(threadIdx.x + blockIdx.x * blockDim.x)"); sprintf(sc->gl_GlobalInvocationID_y, "(threadIdx.y + blockIdx.y * blockDim.y)"); sprintf(sc->gl_GlobalInvocationID_z, "(threadIdx.z + blockIdx.z * blockDim.z)"); sprintf(sc->gl_WorkGroupID_x, "blockIdx.x"); sprintf(sc->gl_WorkGroupID_y, "blockIdx.y"); sprintf(sc->gl_WorkGroupID_z, "blockIdx.z"); sprintf(sc->gl_WorkGroupSize_x, "blockDim.x"); sprintf(sc->gl_WorkGroupSize_y, "blockDim.y"); sprintf(sc->gl_WorkGroupSize_z, "blockDim.z"); #elif(VKFFT_BACKEND==2) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "float2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "double2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "float2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "double2"); sprintf(sc->gl_LocalInvocationID_x, "threadIdx.x"); sprintf(sc->gl_LocalInvocationID_y, "threadIdx.y"); sprintf(sc->gl_LocalInvocationID_z, "threadIdx.z"); sprintf(sc->gl_GlobalInvocationID_x, "(threadIdx.x + blockIdx.x * blockDim.x)"); sprintf(sc->gl_GlobalInvocationID_y, "(threadIdx.y + blockIdx.y * blockDim.y)"); sprintf(sc->gl_GlobalInvocationID_z, "(threadIdx.z + blockIdx.z * blockDim.z)"); sprintf(sc->gl_WorkGroupID_x, "blockIdx.x"); sprintf(sc->gl_WorkGroupID_y, "blockIdx.y"); sprintf(sc->gl_WorkGroupID_z, "blockIdx.z"); sprintf(sc->gl_WorkGroupSize_x, "blockDim.x"); sprintf(sc->gl_WorkGroupSize_y, "blockDim.y"); sprintf(sc->gl_WorkGroupSize_z, "blockDim.z"); #elif(VKFFT_BACKEND==3) if (!strcmp(floatType, "half")) sprintf(vecType, "f16vec2"); if (!strcmp(floatType, "float")) sprintf(vecType, "float2"); if (!strcmp(floatType, "double")) sprintf(vecType, "double2"); if (!strcmp(floatTypeInputMemory, "half")) sprintf(vecTypeInput, "f16vec2"); if (!strcmp(floatTypeInputMemory, "float")) sprintf(vecTypeInput, "float2"); if (!strcmp(floatTypeInputMemory, "double")) sprintf(vecTypeInput, "double2"); if (!strcmp(floatTypeOutputMemory, "half")) sprintf(vecTypeOutput, "f16vec2"); if (!strcmp(floatTypeOutputMemory, "float")) sprintf(vecTypeOutput, "float2"); if (!strcmp(floatTypeOutputMemory, "double")) sprintf(vecTypeOutput, "double2"); sprintf(sc->gl_LocalInvocationID_x, "get_local_id(0)"); sprintf(sc->gl_LocalInvocationID_y, "get_local_id(1)"); sprintf(sc->gl_LocalInvocationID_z, "get_local_id(2)"); sprintf(sc->gl_GlobalInvocationID_x, "get_global_id(0)"); sprintf(sc->gl_GlobalInvocationID_y, "get_global_id(1)"); sprintf(sc->gl_GlobalInvocationID_z, "get_global_id(2)"); sprintf(sc->gl_WorkGroupID_x, "get_group_id(0)"); sprintf(sc->gl_WorkGroupID_y, "get_group_id(1)"); sprintf(sc->gl_WorkGroupID_z, "get_group_id(2)"); sprintf(sc->gl_WorkGroupSize_x, "get_local_size(0)"); sprintf(sc->gl_WorkGroupSize_y, "get_local_size(1)"); sprintf(sc->gl_WorkGroupSize_z, "get_local_size(2)"); #endif sprintf(sc->stageInvocationID, "stageInvocationID"); sprintf(sc->blockInvocationID, "blockInvocationID"); sprintf(sc->tshuffle, "tshuffle"); sprintf(sc->sharedStride, "sharedStride"); sprintf(sc->combinedID, "combinedID"); sprintf(sc->inoutID, "inoutID"); sprintf(sc->sdataID, "sdataID"); //sprintf(sc->tempReg, "temp"); sc->disableThreadsStart = (char*)malloc(sizeof(char) * 500); if (!sc->disableThreadsStart) { freeShaderGenVkFFT(sc); return VKFFT_ERROR_MALLOC_FAILED; } sc->disableThreadsEnd = (char*)malloc(sizeof(char) * 2); if (!sc->disableThreadsEnd) { freeShaderGenVkFFT(sc); return VKFFT_ERROR_MALLOC_FAILED; } sc->disableThreadsStart[0] = 0; sc->disableThreadsEnd[0] = 0; res = appendVersion(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = appendExtensions(sc, floatType, floatTypeInputMemory, floatTypeOutputMemory, floatTypeKernelMemory); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = appendLayoutVkFFT(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = appendConstantsVkFFT(sc, floatType, uintType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if ((!sc->LUT) && (!strcmp(floatType, "double"))) { res = appendSinCos20(sc, floatType, uintType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (strcmp(floatType, floatTypeInputMemory)) { res = appendConversion(sc, floatType, floatTypeInputMemory); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (strcmp(floatType, floatTypeOutputMemory) && strcmp(floatTypeInputMemory, floatTypeOutputMemory)) { res = appendConversion(sc, floatType, floatTypeOutputMemory); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } res = appendPushConstantsVkFFT(sc, floatType, uintType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } uint64_t id = 0; res = appendInputLayoutVkFFT(sc, id, floatTypeInputMemory, type); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } id++; res = appendOutputLayoutVkFFT(sc, id, floatTypeOutputMemory, type); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } id++; if (sc->convolutionStep) { res = appendKernelLayoutVkFFT(sc, id, floatTypeKernelMemory); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } id++; } if (sc->LUT) { res = appendLUTLayoutVkFFT(sc, id, floatType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } id++; } if (sc->useBluesteinFFT) { res = appendBluesteinLayoutVkFFT(sc, id, floatType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->BluesteinConvolutionStep) id++; if (sc->BluesteinPreMultiplication || sc->BluesteinPostMultiplication) id++; } //appendIndexInputVkFFT(sc, uintType, type); //appendIndexOutputVkFFT(sc, uintType, type); /*uint64_t appendedRadix[10] = { 0,0,0,0,0,0,0,0,0,0 }; for (uint64_t i = 0; i < sc->numStages; i++) { if (appendedRadix[sc->stageRadix[i]] == 0) { appendedRadix[sc->stageRadix[i]] = 1; appendRadixKernelVkFFT(sc, floatType, uintType, sc->stageRadix[i]); } }*/ uint64_t locType = (((type == 0) || (type == 5) || (type == 6) || (type == 110) || (type == 120) || (type == 130) || (type == 140) || (type == 142) || (type == 144)) && (sc->axisSwapped)) ? 1 : type; #if(VKFFT_BACKEND==0) res = appendSharedMemoryVkFFT(sc, floatType, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } sc->tempLen = sprintf(sc->tempStr, "void main() {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } #elif(VKFFT_BACKEND==1) sc->tempLen = sprintf(sc->tempStr, "extern __shared__ float shared[];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } sc->tempLen = sprintf(sc->tempStr, "extern \"C\" __global__ void __launch_bounds__(%" PRIu64 ") VkFFT_main ", sc->localSize[0] * sc->localSize[1] * sc->localSize[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } switch (type) { case 5: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, vecTypeOutput); break; } case 6: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", vecTypeInput, floatTypeOutputMemory); break; } case 110: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 111: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 120: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 121: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 130: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 131: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 140: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 141: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 142: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 143: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 144: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 145: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } default: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", vecTypeInput, vecTypeOutput); break; } } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->convolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", %s* kernel_obj", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, ", %s* twiddleLUT", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->BluesteinConvolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", %s* BluesteinConvolutionKernel", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->BluesteinPreMultiplication || sc->BluesteinPostMultiplication) { sc->tempLen = sprintf(sc->tempStr, ", %s* BluesteinMultiplication", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } sc->tempLen = sprintf(sc->tempStr, ") {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } //sc->tempLen = sprintf(sc->tempStr, ", const PushConsts consts) {\n"); res = appendSharedMemoryVkFFT(sc, floatType, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } #elif(VKFFT_BACKEND==2) sc->tempLen = sprintf(sc->tempStr, "extern __shared__ float shared[];\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } sc->tempLen = sprintf(sc->tempStr, "extern \"C\" __launch_bounds__(%" PRIu64 ") __global__ void VkFFT_main ", sc->localSize[0] * sc->localSize[1] * sc->localSize[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } switch (type) { case 5: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, vecTypeOutput); break; } case 6: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", vecTypeInput, floatTypeOutputMemory); break; } case 110: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 111: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 120: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 121: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 130: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 131: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 140: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 141: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 142: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 143: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 144: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 145: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } default: { sc->tempLen = sprintf(sc->tempStr, "(%s* inputs, %s* outputs", vecTypeInput, vecTypeOutput); break; } } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->convolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", %s* kernel_obj", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, ", %s* twiddleLUT", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->BluesteinConvolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", %s* BluesteinConvolutionKernel", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->BluesteinPreMultiplication || sc->BluesteinPostMultiplication) { sc->tempLen = sprintf(sc->tempStr, ", %s* BluesteinMultiplication", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } sc->tempLen = sprintf(sc->tempStr, ") {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } //sc->tempLen = sprintf(sc->tempStr, ", const PushConsts consts) {\n"); res = appendSharedMemoryVkFFT(sc, floatType, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } #elif(VKFFT_BACKEND==3) sc->tempLen = sprintf(sc->tempStr, "__kernel __attribute__((reqd_work_group_size(%" PRIu64 ", %" PRIu64 ", %" PRIu64 "))) void VkFFT_main ", sc->localSize[0], sc->localSize[1], sc->localSize[2]); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } switch (type) { case 5: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, vecTypeOutput); break; } case 6: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", vecTypeInput, floatTypeOutputMemory); break; } case 110: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 111: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 120: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 121: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 130: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 131: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 140: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 141: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 142: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 143: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 144: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } case 145: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", floatTypeInputMemory, floatTypeOutputMemory); break; } default: { sc->tempLen = sprintf(sc->tempStr, "(__global %s* inputs, __global %s* outputs", vecTypeInput, vecTypeOutput); break; } } res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->convolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", __global %s* kernel_obj", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->LUT) { sc->tempLen = sprintf(sc->tempStr, ", __global %s* twiddleLUT", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->BluesteinConvolutionStep) { sc->tempLen = sprintf(sc->tempStr, ", __global %s* BluesteinConvolutionKernel", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->BluesteinPreMultiplication || sc->BluesteinPostMultiplication) { sc->tempLen = sprintf(sc->tempStr, ", __global %s* BluesteinMultiplication", vecType); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->pushConstantsStructSize > 0) { sc->tempLen = sprintf(sc->tempStr, ", PushConsts consts"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } sc->tempLen = sprintf(sc->tempStr, ") {\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } //sc->tempLen = sprintf(sc->tempStr, ", const PushConsts consts) {\n"); res = appendSharedMemoryVkFFT(sc, floatType, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } #endif //if (type==0) sc->tempLen = sprintf(sc->tempStr, "return;\n"); res = appendInitialization(sc, floatType, uintType, type); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = setReadToRegisters(sc, type); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = setWriteFromRegisters(sc, type); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if ((sc->convolutionStep) && (sc->matrixConvolution > 1)) { sc->tempLen = sprintf(sc->tempStr, " for (%s coordinate=%" PRIu64 "; coordinate > 0; coordinate--){\n\ coordinate--;\n", uintType, sc->matrixConvolution); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } res = appendReadDataVkFFT(sc, floatType, floatTypeInputMemory, uintType, type); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->useBluesteinFFT && sc->BluesteinPreMultiplication) { res = appendBluesteinMultiplication(sc, floatType, uintType, locType, 0); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } //appendBarrierVkFFT(sc, 1); res = appendReorder4StepRead(sc, floatType, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = appendBoostThreadDataReorder(sc, floatType, uintType, locType, 1); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } uint64_t stageSize = 1; uint64_t stageSizeSum = 0; double PI_const = 3.1415926535897932384626433832795; double stageAngle = (sc->inverse) ? PI_const : -PI_const; for (uint64_t i = 0; i < sc->numStages; i++) { if ((i == sc->numStages - 1) && (sc->registerBoost > 1)) { res = appendRadixStage(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, sc->stageRadix[i], locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = appendRegisterBoostShuffle(sc, floatType, stageSize, sc->stageRadix[i - 1], sc->stageRadix[i], stageAngle); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } else { res = appendRadixStage(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, sc->stageRadix[i], locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } switch (sc->stageRadix[i]) { case 2: stageSizeSum += stageSize; break; case 3: stageSizeSum += stageSize * 2; break; case 4: stageSizeSum += stageSize * 2; break; case 5: stageSizeSum += stageSize * 4; break; case 7: stageSizeSum += stageSize * 6; break; case 8: stageSizeSum += stageSize * 3; break; case 11: stageSizeSum += stageSize * 10; break; case 13: stageSizeSum += stageSize * 12; break; } if (i == sc->numStages - 1) { res = appendRadixShuffle(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, sc->stageRadix[i], sc->stageRadix[i], locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } else { res = appendRadixShuffle(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, sc->stageRadix[i], sc->stageRadix[i + 1], locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } stageSize *= sc->stageRadix[i]; stageAngle /= sc->stageRadix[i]; } } if ((sc->convolutionStep) || (sc->useBluesteinFFT && sc->BluesteinConvolutionStep)) { res = appendCoordinateRegisterStore(sc, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->matrixConvolution > 1) { sc->tempLen = sprintf(sc->tempStr, " coordinate++;}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->numKernels > 1) { res = appendPreparationBatchedKernelConvolution(sc, floatType, floatTypeKernelMemory, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if (sc->useBluesteinFFT && sc->BluesteinConvolutionStep) { res = appendBluesteinConvolution(sc, floatType, floatTypeKernelMemory, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } else { res = appendKernelConvolution(sc, floatType, floatTypeKernelMemory, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } appendBarrierVkFFT(sc, 1); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->matrixConvolution > 1) { sc->tempLen = sprintf(sc->tempStr, " for (%s coordinate=0; coordinate < %" PRIu64 "; coordinate++){\n", uintType, sc->matrixConvolution); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } res = appendCoordinateRegisterPull(sc, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } stageSize = 1; stageSizeSum = 0; stageAngle = PI_const; sc->inverse = 1; for (uint64_t i = 0; i < sc->numStages; i++) { res = appendRadixStage(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, sc->stageRadix[i], locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } switch (sc->stageRadix[i]) { case 2: stageSizeSum += stageSize; break; case 3: stageSizeSum += stageSize * 2; break; case 4: stageSizeSum += stageSize * 2; break; case 5: stageSizeSum += stageSize * 4; break; case 7: stageSizeSum += stageSize * 6; break; case 8: stageSizeSum += stageSize * 3; break; case 11: stageSizeSum += stageSize * 10; break; case 13: stageSizeSum += stageSize * 12; break; } if (i == sc->numStages - 1) { res = appendRadixShuffle(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, sc->stageRadix[i], sc->stageRadix[i], locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } else { res = appendRadixShuffle(sc, floatType, uintType, stageSize, stageSizeSum, stageAngle, sc->stageRadix[i], sc->stageRadix[i + 1], locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } stageSize *= sc->stageRadix[i]; stageAngle /= sc->stageRadix[i]; } } res = appendBoostThreadDataReorder(sc, floatType, uintType, locType, 0); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } res = appendReorder4StepWrite(sc, floatType, uintType, locType); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if (sc->useBluesteinFFT && sc->BluesteinPostMultiplication) { res = appendBluesteinMultiplication(sc, floatType, uintType, locType, 1); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } res = appendWriteDataVkFFT(sc, floatType, floatTypeOutputMemory, uintType, type); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } if ((sc->convolutionStep) && (sc->matrixConvolution > 1)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } if ((sc->convolutionStep) && (sc->numKernels > 1)) { sc->tempLen = sprintf(sc->tempStr, " }\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } } sc->tempLen = sprintf(sc->tempStr, "}\n"); res = VkAppendLine(sc); if (res != VKFFT_SUCCESS) { freeShaderGenVkFFT(sc); return res; } freeShaderGenVkFFT(sc); //if (sc->useBluesteinFFT) //printf("%s", output); return res; } #if(VKFFT_BACKEND==0) static inline VkFFTResult findMemoryType(VkFFTApplication* app, uint64_t memoryTypeBits, uint64_t memorySize, VkMemoryPropertyFlags properties, uint32_t* memoryTypeIndex) { VkPhysicalDeviceMemoryProperties memoryProperties = { 0 }; vkGetPhysicalDeviceMemoryProperties(app->configuration.physicalDevice[0], &memoryProperties); for (uint64_t i = 0; i < memoryProperties.memoryTypeCount; ++i) { if ((memoryTypeBits & ((uint64_t)1 << i)) && ((memoryProperties.memoryTypes[i].propertyFlags & properties) == properties) && (memoryProperties.memoryHeaps[memoryProperties.memoryTypes[i].heapIndex].size >= memorySize)) { memoryTypeIndex[0] = (uint32_t)i; return VKFFT_SUCCESS; } } return VKFFT_ERROR_FAILED_TO_FIND_MEMORY; } static inline VkFFTResult allocateFFTBuffer(VkFFTApplication* app, VkBuffer* buffer, VkDeviceMemory* deviceMemory, VkBufferUsageFlags usageFlags, VkMemoryPropertyFlags propertyFlags, VkDeviceSize size) { VkFFTResult resFFT = VKFFT_SUCCESS; VkResult res = VK_SUCCESS; uint32_t queueFamilyIndices; VkBufferCreateInfo bufferCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO }; bufferCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE; bufferCreateInfo.queueFamilyIndexCount = 1; bufferCreateInfo.pQueueFamilyIndices = &queueFamilyIndices; bufferCreateInfo.size = size; bufferCreateInfo.usage = usageFlags; res = vkCreateBuffer(app->configuration.device[0], &bufferCreateInfo, 0, buffer); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_CREATE_BUFFER; VkMemoryRequirements memoryRequirements = { 0 }; vkGetBufferMemoryRequirements(app->configuration.device[0], buffer[0], &memoryRequirements); VkMemoryAllocateInfo memoryAllocateInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO }; memoryAllocateInfo.allocationSize = memoryRequirements.size; resFFT = findMemoryType(app, memoryRequirements.memoryTypeBits, memoryRequirements.size, propertyFlags, &memoryAllocateInfo.memoryTypeIndex); if (resFFT != VKFFT_SUCCESS) return resFFT; res = vkAllocateMemory(app->configuration.device[0], &memoryAllocateInfo, 0, deviceMemory); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_ALLOCATE_MEMORY; res = vkBindBufferMemory(app->configuration.device[0], buffer[0], deviceMemory[0], 0); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_BIND_BUFFER_MEMORY; return resFFT; } static inline VkFFTResult transferDataFromCPU(VkFFTApplication* app, void* arr, VkBuffer* buffer, VkDeviceSize bufferSize) { VkResult res = VK_SUCCESS; VkFFTResult resFFT = VKFFT_SUCCESS; VkDeviceSize stagingBufferSize = bufferSize; VkBuffer stagingBuffer = { 0 }; VkDeviceMemory stagingBufferMemory = { 0 }; resFFT = allocateFFTBuffer(app, &stagingBuffer, &stagingBufferMemory, VK_BUFFER_USAGE_TRANSFER_SRC_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, stagingBufferSize); if (resFFT != VKFFT_SUCCESS) return resFFT; void* data; res = vkMapMemory(app->configuration.device[0], stagingBufferMemory, 0, stagingBufferSize, 0, &data); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_MAP_MEMORY; memcpy(data, arr, stagingBufferSize); vkUnmapMemory(app->configuration.device[0], stagingBufferMemory); VkCommandBufferAllocateInfo commandBufferAllocateInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO }; commandBufferAllocateInfo.commandPool = app->configuration.commandPool[0]; commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY; commandBufferAllocateInfo.commandBufferCount = 1; VkCommandBuffer commandBuffer = { 0 }; res = vkAllocateCommandBuffers(app->configuration.device[0], &commandBufferAllocateInfo, &commandBuffer); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_ALLOCATE_COMMAND_BUFFERS; VkCommandBufferBeginInfo commandBufferBeginInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT; res = vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_BEGIN_COMMAND_BUFFER; VkBufferCopy copyRegion = { 0 }; copyRegion.srcOffset = 0; copyRegion.dstOffset = 0; copyRegion.size = stagingBufferSize; vkCmdCopyBuffer(commandBuffer, stagingBuffer, buffer[0], 1, ©Region); res = vkEndCommandBuffer(commandBuffer); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_END_COMMAND_BUFFER; VkSubmitInfo submitInfo = { VK_STRUCTURE_TYPE_SUBMIT_INFO }; submitInfo.commandBufferCount = 1; submitInfo.pCommandBuffers = &commandBuffer; res = vkQueueSubmit(app->configuration.queue[0], 1, &submitInfo, app->configuration.fence[0]); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_SUBMIT_QUEUE; res = vkWaitForFences(app->configuration.device[0], 1, app->configuration.fence, VK_TRUE, 100000000000); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_WAIT_FOR_FENCES; res = vkResetFences(app->configuration.device[0], 1, app->configuration.fence); if (res != VK_SUCCESS) return VKFFT_ERROR_FAILED_TO_RESET_FENCES; vkFreeCommandBuffers(app->configuration.device[0], app->configuration.commandPool[0], 1, &commandBuffer); vkDestroyBuffer(app->configuration.device[0], stagingBuffer, 0); vkFreeMemory(app->configuration.device[0], stagingBufferMemory, 0); return resFFT; } #endif static inline void deleteAxis(VkFFTApplication* app, VkFFTAxis* axis) { #if(VKFFT_BACKEND==0) if ((app->configuration.useLUT) && (!axis->referenceLUT)) { if (axis->bufferLUT != 0) { vkDestroyBuffer(app->configuration.device[0], axis->bufferLUT, 0); axis->bufferLUT = 0; } if (axis->bufferLUTDeviceMemory != 0) { vkFreeMemory(app->configuration.device[0], axis->bufferLUTDeviceMemory, 0); axis->bufferLUTDeviceMemory = 0; } } if (axis->descriptorPool != 0) { vkDestroyDescriptorPool(app->configuration.device[0], axis->descriptorPool, 0); axis->descriptorPool = 0; } if (axis->descriptorSetLayout != 0) { vkDestroyDescriptorSetLayout(app->configuration.device[0], axis->descriptorSetLayout, 0); axis->descriptorSetLayout = 0; } if (axis->pipelineLayout != 0) { vkDestroyPipelineLayout(app->configuration.device[0], axis->pipelineLayout, 0); axis->pipelineLayout = 0; } if (axis->pipeline != 0) { vkDestroyPipeline(app->configuration.device[0], axis->pipeline, 0); axis->pipeline = 0; } #elif(VKFFT_BACKEND==1) CUresult res = CUDA_SUCCESS; cudaError_t res_t = cudaSuccess; if ((app->configuration.useLUT) && (!axis->referenceLUT) && (axis->bufferLUT != 0)) { res_t = cudaFree(axis->bufferLUT); axis->bufferLUT = 0; } if (axis->VkFFTModule != 0) { res = cuModuleUnload(axis->VkFFTModule); axis->VkFFTModule = 0; } #elif(VKFFT_BACKEND==2) hipError_t res = hipSuccess; if ((app->configuration.useLUT) && (!axis->referenceLUT) && (axis->bufferLUT != 0)) { res = hipFree(axis->bufferLUT); axis->bufferLUT = 0; } if (axis->VkFFTModule != 0) { res = hipModuleUnload(axis->VkFFTModule); axis->VkFFTModule = 0; } #elif(VKFFT_BACKEND==3) cl_int res = 0; if ((app->configuration.useLUT) && (!axis->referenceLUT) && (axis->bufferLUT != 0)) { res = clReleaseMemObject(axis->bufferLUT); axis->bufferLUT = 0; } if (axis->program != 0) { res = clReleaseProgram(axis->program); axis->program = 0; } if (axis->kernel != 0) { res = clReleaseKernel(axis->kernel); axis->kernel = 0; } #endif if (app->configuration.saveApplicationToString) { if (axis->binary != 0) { free(axis->binary); axis->binary = 0; } } } static inline void deleteVkFFT(VkFFTApplication* app) { #if(VKFFT_BACKEND==0) if (app->configuration.isCompilerInitialized) { glslang_finalize_process(); app->configuration.isCompilerInitialized = 0; } #elif(VKFFT_BACKEND==1) CUresult res = CUDA_SUCCESS; cudaError_t res_t = cudaSuccess; if (app->configuration.num_streams > 1) { for (uint64_t i = 0; i < app->configuration.num_streams; i++) { if (app->configuration.stream_event[i] != 0) { res_t = cudaEventDestroy(app->configuration.stream_event[i]); app->configuration.stream_event[i] = 0; } } if (app->configuration.stream_event != 0) { free(app->configuration.stream_event); app->configuration.stream_event = 0; } } #elif(VKFFT_BACKEND==2) hipError_t res_t = hipSuccess; if (app->configuration.num_streams > 1) { for (uint64_t i = 0; i < app->configuration.num_streams; i++) { if (app->configuration.stream_event[i] != 0) { res_t = hipEventDestroy(app->configuration.stream_event[i]); app->configuration.stream_event[i] = 0; } } if (app->configuration.stream_event != 0) { free(app->configuration.stream_event); app->configuration.stream_event = 0; } } #endif if (!app->configuration.userTempBuffer) { if (app->configuration.allocateTempBuffer) { app->configuration.allocateTempBuffer = 0; #if(VKFFT_BACKEND==0) if (app->configuration.tempBuffer[0] != 0) { vkDestroyBuffer(app->configuration.device[0], app->configuration.tempBuffer[0], 0); app->configuration.tempBuffer[0] = 0; } if (app->configuration.tempBufferDeviceMemory != 0) { vkFreeMemory(app->configuration.device[0], app->configuration.tempBufferDeviceMemory, 0); app->configuration.tempBufferDeviceMemory = 0; } #elif(VKFFT_BACKEND==1) if (app->configuration.tempBuffer[0] != 0) { res_t = cudaFree(app->configuration.tempBuffer[0]); app->configuration.tempBuffer[0] = 0; } #elif(VKFFT_BACKEND==2) if (app->configuration.tempBuffer[0] != 0) { res_t = hipFree(app->configuration.tempBuffer[0]); app->configuration.tempBuffer[0] = 0; } #elif(VKFFT_BACKEND==3) cl_int res = 0; if (app->configuration.tempBuffer[0] != 0) { res = clReleaseMemObject(app->configuration.tempBuffer[0]); app->configuration.tempBuffer[0] = 0; } #endif if (app->configuration.tempBuffer != 0) { free(app->configuration.tempBuffer); app->configuration.tempBuffer = 0; } } if (app->configuration.tempBufferSize != 0) { free(app->configuration.tempBufferSize); app->configuration.tempBufferSize = 0; } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->useBluesteinFFT[i]) { #if(VKFFT_BACKEND==0) if (app->bufferBluestein[i] != 0) { vkDestroyBuffer(app->configuration.device[0], app->bufferBluestein[i], 0); app->bufferBluestein[i] = 0; } if (app->bufferBluesteinDeviceMemory[i] != 0) { vkFreeMemory(app->configuration.device[0], app->bufferBluesteinDeviceMemory[i], 0); app->bufferBluesteinDeviceMemory[i] = 0; } if (app->bufferBluesteinFFT[i] != 0) { vkDestroyBuffer(app->configuration.device[0], app->bufferBluesteinFFT[i], 0); app->bufferBluesteinFFT[i] = 0; } if (app->bufferBluesteinFFTDeviceMemory[i] != 0) { vkFreeMemory(app->configuration.device[0], app->bufferBluesteinFFTDeviceMemory[i], 0); app->bufferBluesteinFFTDeviceMemory[i] = 0; } if (app->bufferBluesteinIFFT[i] != 0) { vkDestroyBuffer(app->configuration.device[0], app->bufferBluesteinIFFT[i], 0); app->bufferBluesteinIFFT[i] = 0; } if (app->bufferBluesteinIFFTDeviceMemory[i] != 0) { vkFreeMemory(app->configuration.device[0], app->bufferBluesteinIFFTDeviceMemory[i], 0); app->bufferBluesteinIFFTDeviceMemory[i] = 0; } #elif(VKFFT_BACKEND==1) if (app->bufferBluestein[i] != 0) { res_t = cudaFree(app->bufferBluestein[i]); app->bufferBluestein[i] = 0; } if (app->bufferBluesteinFFT[i] != 0) { res_t = cudaFree(app->bufferBluesteinFFT[i]); app->bufferBluesteinFFT[i] = 0; } if (app->bufferBluesteinIFFT[i] != 0) { res_t = cudaFree(app->bufferBluesteinIFFT[i]); app->bufferBluesteinIFFT[i] = 0; } #elif(VKFFT_BACKEND==2) if (app->bufferBluestein[i] != 0) { res_t = hipFree(app->bufferBluestein[i]); app->bufferBluestein[i] = 0; } if (app->bufferBluesteinFFT[i] != 0) { res_t = hipFree(app->bufferBluesteinFFT[i]); app->bufferBluesteinFFT[i] = 0; } if (app->bufferBluesteinIFFT[i] != 0) { res_t = hipFree(app->bufferBluesteinIFFT[i]); app->bufferBluesteinIFFT[i] = 0; } #elif(VKFFT_BACKEND==3) cl_int res = 0; if (app->bufferBluestein[i] != 0) { res = clReleaseMemObject(app->bufferBluestein[i]); app->bufferBluestein[i] = 0; } if (app->bufferBluesteinFFT[i] != 0) { res = clReleaseMemObject(app->bufferBluesteinFFT[i]); app->bufferBluesteinFFT[i] = 0; } if (app->bufferBluesteinIFFT[i] != 0) { res = clReleaseMemObject(app->bufferBluesteinIFFT[i]); app->bufferBluesteinIFFT[i] = 0; } #endif } } if (!app->configuration.makeInversePlanOnly) { if (app->localFFTPlan != 0) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->localFFTPlan->numAxisUploads[i] > 0) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) deleteAxis(app, &app->localFFTPlan->axes[i][j]); } } if (app->localFFTPlan->multiUploadR2C) { deleteAxis(app, &app->localFFTPlan->R2Cdecomposition); } if (app->localFFTPlan != 0) { free(app->localFFTPlan); app->localFFTPlan = 0; } } } if (!app->configuration.makeForwardPlanOnly) { if (app->localFFTPlan_inverse != 0) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->localFFTPlan_inverse->numAxisUploads[i] > 0) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) deleteAxis(app, &app->localFFTPlan_inverse->axes[i][j]); } } if (app->localFFTPlan_inverse->multiUploadR2C) { deleteAxis(app, &app->localFFTPlan_inverse->R2Cdecomposition); } if (app->localFFTPlan_inverse != 0) { free(app->localFFTPlan_inverse); app->localFFTPlan_inverse = 0; } } } if (app->configuration.saveApplicationToString) { if (app->saveApplicationString != 0) { free(app->saveApplicationString); app->saveApplicationString = 0; } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->applicationBluesteinString[i] != 0) { free(app->applicationBluesteinString[i]); app->applicationBluesteinString[i] = 0; } } } } static inline VkFFTResult VkFFTGetRegistersPerThread(uint64_t* loc_multipliers, uint64_t* registers_per_thread_per_radix, uint64_t* registers_per_thread, uint64_t* min_registers_per_thread, uint64_t* isGoodSequence) { for (uint64_t i = 0; i < 14; i++) { registers_per_thread_per_radix[i] = 0; } registers_per_thread[0] = 0; min_registers_per_thread[0] = -1; if (loc_multipliers[2] > 0) { if (loc_multipliers[3] > 0) { if (loc_multipliers[5] > 0) { if (loc_multipliers[7] > 0) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 15; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; case 3: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 12; break; } registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 15; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; case 3: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 12; break; } registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 15; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; case 3: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 12; break; } registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 15; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; case 3: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 12; break; } registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 15; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; } registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 15; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; } registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 15; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; break; } registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 6; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 5; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 10; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 10; break; } registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } } else { if (loc_multipliers[7] > 0) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 22; registers_per_thread_per_radix[3] = 21; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 22; registers_per_thread_per_radix[13] = 26; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 22; registers_per_thread_per_radix[3] = 21; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 22; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; } } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 26; registers_per_thread_per_radix[3] = 21; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 26; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 6; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 6; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; } } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 6; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 6; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; } } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 6; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 6; registers_per_thread_per_radix[3] = 6; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 12; registers_per_thread_per_radix[3] = 12; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; } } } } } } else { if (loc_multipliers[5] > 0) { if (loc_multipliers[7] > 0) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; case 3: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; case 3: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; } } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; case 3: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; } } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; } } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 10; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 10; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; } } } } } else { if (loc_multipliers[7] > 0) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; case 3: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; case 3: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; } } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; case 2: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; case 3: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; default: registers_per_thread_per_radix[2] = 16; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; case 3: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 14; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 14; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; break; } } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 22; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 22; registers_per_thread_per_radix[13] = 26; break; case 2: registers_per_thread_per_radix[2] = 22; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 22; registers_per_thread_per_radix[13] = 26; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; break; } } else { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 22; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 22; registers_per_thread_per_radix[13] = 0; break; case 2: registers_per_thread_per_radix[2] = 22; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 22; registers_per_thread_per_radix[13] = 0; break; case 3: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; break; } } } else { if (loc_multipliers[13] > 0) { switch (loc_multipliers[2]) { case 1: registers_per_thread_per_radix[2] = 26; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 26; break; case 2: registers_per_thread_per_radix[2] = 26; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 26; break; default: registers_per_thread_per_radix[2] = 8; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; break; } } else { registers_per_thread_per_radix[2] = (loc_multipliers[2] > 2) ? 8 : (uint64_t)pow(2, loc_multipliers[2]); registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } } } } else { if (loc_multipliers[3] > 0) { if (loc_multipliers[5] > 0) { if (loc_multipliers[7] > 0) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 15; registers_per_thread_per_radix[5] = 15; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } } else { if (loc_multipliers[7] > 0) { if (loc_multipliers[3] == 1) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 21; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 21; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 21; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 21; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 21; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } } else { if (loc_multipliers[3] == 1) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 33; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 33; registers_per_thread_per_radix[13] = 39; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 33; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 33; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 39; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 39; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 3; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 9; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } } } } else { if (loc_multipliers[5] > 0) { if (loc_multipliers[7] > 0) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 5; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } } else { if (loc_multipliers[7] > 0) { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 7; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 0; } } } else { if (loc_multipliers[11] > 0) { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 13; } else { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 11; registers_per_thread_per_radix[13] = 0; } } else { if (loc_multipliers[13] > 0) { registers_per_thread_per_radix[2] = 0; registers_per_thread_per_radix[3] = 0; registers_per_thread_per_radix[5] = 0; registers_per_thread_per_radix[7] = 0; registers_per_thread_per_radix[11] = 0; registers_per_thread_per_radix[13] = 13; } else { return VKFFT_ERROR_UNSUPPORTED_RADIX; } } } } } } for (uint64_t i = 0; i < 14; i++) { if ((registers_per_thread_per_radix[i] != 0) && (registers_per_thread_per_radix[i] < min_registers_per_thread[0])) min_registers_per_thread[0] = registers_per_thread_per_radix[i]; if ((registers_per_thread_per_radix[i] != 0) && (registers_per_thread_per_radix[i] > registers_per_thread[0])) registers_per_thread[0] = registers_per_thread_per_radix[i]; } if ((registers_per_thread[0] > 16) || (registers_per_thread[0] >= 2 * min_registers_per_thread[0])) isGoodSequence[0] = 0; else isGoodSequence[0] = 1; return VKFFT_SUCCESS; } static inline VkFFTResult VkFFTScheduler(VkFFTApplication* app, VkFFTPlan* FFTPlan, uint64_t axis_id, uint64_t supportAxis) { VkFFTResult res = VKFFT_SUCCESS; VkFFTAxis* axes = FFTPlan->axes[axis_id]; uint64_t complexSize; if (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) complexSize = (2 * sizeof(double)); else if (app->configuration.halfPrecision) complexSize = (2 * sizeof(float)); else complexSize = (2 * sizeof(float)); uint64_t maxSequenceLengthSharedMemory = app->configuration.sharedMemorySize / complexSize; uint64_t maxSingleSizeNonStrided = maxSequenceLengthSharedMemory; uint64_t nonStridedAxisId = (app->configuration.considerAllAxesStrided) ? -1 : 0; for (uint64_t i = 0; i < 3; i++) { FFTPlan->actualFFTSizePerAxis[axis_id][i] = app->configuration.size[i]; } FFTPlan->actualPerformR2CPerAxis[axis_id] = app->configuration.performR2C; if ((axis_id == 0) && (app->configuration.performR2C) && (app->configuration.size[axis_id] > maxSingleSizeNonStrided)) { FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] = app->configuration.size[axis_id] / 2; // now in actualFFTSize - modified dimension size for R2C/DCT FFTPlan->actualPerformR2CPerAxis[axis_id] = 0; FFTPlan->multiUploadR2C = 1; } if (app->configuration.performDCT == 1) { FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] = 2 * app->configuration.size[axis_id] - 2; // now in actualFFTSize - modified dimension size for R2C/DCT } if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) { FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] = app->configuration.size[axis_id] / 2; // now in actualFFTSize - modified dimension size for R2C/DCT //FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] = app->configuration.size[axis_id] * 8; // now in actualFFTSize - modified dimension size for R2C/DCT } if ((axis_id > 0) && (app->configuration.performR2C)) { FFTPlan->actualFFTSizePerAxis[axis_id][0] = FFTPlan->actualFFTSizePerAxis[axis_id][0] / 2 + 1; } if (axis_id != nonStridedAxisId) { if (app->configuration.performBandwidthBoost > 0) axes->specializationConstants.performBandwidthBoost = app->configuration.performBandwidthBoost; } uint64_t multipliers[20] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 };//split the sequence uint64_t tempSequence = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; for (uint64_t i = 2; i < 14; i++) { if (tempSequence % i == 0) { tempSequence /= i; multipliers[i]++; i--; } } if (tempSequence != 1) { app->useBluesteinFFT[axis_id] = 1; if (axis_id != nonStridedAxisId) { if (app->configuration.performBandwidthBoost == 0) axes->specializationConstants.performBandwidthBoost = 2; } app->configuration.registerBoost = 1; tempSequence = 2 * FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1; uint64_t FFTSizeSelected = 0; if (app->configuration.fixMaxRadixBluestein > 0) { while (!FFTSizeSelected) { uint64_t testSequence = tempSequence; for (uint64_t i = 0; i < 20; i++) { multipliers[i] = 0; } for (uint64_t i = 2; i < app->configuration.fixMaxRadixBluestein + 1; i++) { if (testSequence % i == 0) { testSequence /= i; multipliers[i]++; i--; } } if (testSequence == 1) FFTSizeSelected = 1; else tempSequence++; } } else { while (!FFTSizeSelected) { if (axis_id == nonStridedAxisId) { if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] < 128) || ((((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) * 0.75) <= tempSequence) && (((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) <= maxSequenceLengthSharedMemory) || ((2 * FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1) > maxSequenceLengthSharedMemory)))) tempSequence = (uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))); } else { uint64_t maxSequenceLengthSharedMemoryStrided_temp = (app->configuration.coalescedMemory > complexSize) ? app->configuration.sharedMemorySize / (app->configuration.coalescedMemory) : app->configuration.sharedMemorySize / complexSize; if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] < 128) || ((((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) * 0.75) <= tempSequence) && (((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) <= maxSequenceLengthSharedMemoryStrided_temp) || ((2 * FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1) > maxSequenceLengthSharedMemoryStrided_temp)))) tempSequence = (uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))); } uint64_t testSequence = tempSequence; for (uint64_t i = 0; i < 20; i++) { multipliers[i] = 0; } for (uint64_t i = 2; i < 8; i++) { if (testSequence % i == 0) { testSequence /= i; multipliers[i]++; i--; } } if (testSequence != 1) tempSequence++; else { uint64_t registers_per_thread_per_radix[14]; uint64_t registers_per_thread = 0; uint64_t min_registers_per_thread = -1; uint64_t isGoodSequence = 0; res = VkFFTGetRegistersPerThread(multipliers, registers_per_thread_per_radix, ®isters_per_thread, &min_registers_per_thread, &isGoodSequence); if (res != VKFFT_SUCCESS) return res; if (isGoodSequence) FFTSizeSelected = 1; else tempSequence++; } } } FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] = tempSequence; //check if padded system still single upload for r2c - else redo the optimization if ((axis_id == 0) && (app->configuration.performR2C) && (!FFTPlan->multiUploadR2C) && (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] > maxSingleSizeNonStrided)) { FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] = app->configuration.size[axis_id] / 2; // now in actualFFTSize - modified dimension size for R2C/DCT FFTPlan->actualPerformR2CPerAxis[axis_id] = 0; FFTPlan->multiUploadR2C = 1; tempSequence = 2 * FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1; uint64_t FFTSizeSelected = 0; if (app->configuration.fixMaxRadixBluestein > 0) { while (!FFTSizeSelected) { uint64_t testSequence = tempSequence; for (uint64_t i = 0; i < 20; i++) { multipliers[i] = 0; } for (uint64_t i = 2; i < app->configuration.fixMaxRadixBluestein + 1; i++) { if (testSequence % i == 0) { testSequence /= i; multipliers[i]++; i--; } } if (testSequence == 1) FFTSizeSelected = 1; else tempSequence++; } } else { while (!FFTSizeSelected) { if (axis_id == nonStridedAxisId) { if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] < 128) || ((((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) * 0.75) <= tempSequence) && (((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) <= maxSequenceLengthSharedMemory) || ((2 * FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1) > maxSequenceLengthSharedMemory)))) tempSequence = (uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))); } else { uint64_t maxSequenceLengthSharedMemoryStrided_temp = (app->configuration.coalescedMemory > complexSize) ? app->configuration.sharedMemorySize / (app->configuration.coalescedMemory) : app->configuration.sharedMemorySize / complexSize; if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] < 128) || ((((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) * 0.75) <= tempSequence) && (((uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))) <= maxSequenceLengthSharedMemoryStrided_temp) || ((2 * FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1) > maxSequenceLengthSharedMemoryStrided_temp)))) tempSequence = (uint64_t)pow(2, (uint64_t)ceil(log2(tempSequence))); } uint64_t testSequence = tempSequence; for (uint64_t i = 0; i < 20; i++) { multipliers[i] = 0; } for (uint64_t i = 2; i < 8; i++) { if (testSequence % i == 0) { testSequence /= i; multipliers[i]++; i--; } } if (testSequence != 1) tempSequence++; else { uint64_t registers_per_thread_per_radix[14]; uint64_t registers_per_thread = 0; uint64_t min_registers_per_thread = -1; uint64_t isGoodSequence = 0; res = VkFFTGetRegistersPerThread(multipliers, registers_per_thread_per_radix, ®isters_per_thread, &min_registers_per_thread, &isGoodSequence); if (res != VKFFT_SUCCESS) return res; if (isGoodSequence) FFTSizeSelected = 1; else tempSequence++; } } } FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] = tempSequence; } if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] & (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1)) == 0) { app->configuration.sharedMemorySize = app->configuration.sharedMemorySizePow2; maxSequenceLengthSharedMemory = app->configuration.sharedMemorySize / complexSize; maxSingleSizeNonStrided = maxSequenceLengthSharedMemory; } } uint64_t isPowOf2 = (pow(2, (uint64_t)log2(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id])) == FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]) ? 1 : 0; if (app->configuration.tempBufferSize[0] == 0) { if ((app->configuration.performR2C) && (axis_id == 0)) { if (FFTPlan->multiUploadR2C) app->configuration.tempBufferSize[0] = (FFTPlan->actualFFTSizePerAxis[axis_id][0] + 1) * FFTPlan->actualFFTSizePerAxis[axis_id][1] * FFTPlan->actualFFTSizePerAxis[axis_id][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches * app->configuration.numberKernels * complexSize; } else { app->configuration.tempBufferSize[0] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1] * FFTPlan->actualFFTSizePerAxis[axis_id][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches * app->configuration.numberKernels * complexSize; } } if (app->useBluesteinFFT[axis_id]) { if ((app->configuration.performR2C) && (axis_id == 0)) { if (FFTPlan->multiUploadR2C) { if ((FFTPlan->actualFFTSizePerAxis[axis_id][0] + 1) * FFTPlan->actualFFTSizePerAxis[axis_id][1] * FFTPlan->actualFFTSizePerAxis[axis_id][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches * app->configuration.numberKernels * complexSize > app->configuration.tempBufferSize[0]) app->configuration.tempBufferSize[0] = (FFTPlan->actualFFTSizePerAxis[axis_id][0] + 1) * FFTPlan->actualFFTSizePerAxis[axis_id][1] * FFTPlan->actualFFTSizePerAxis[axis_id][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches * app->configuration.numberKernels * complexSize; } } else { if (FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1] * FFTPlan->actualFFTSizePerAxis[axis_id][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches * app->configuration.numberKernels * complexSize > app->configuration.tempBufferSize[0]) app->configuration.tempBufferSize[0] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1] * FFTPlan->actualFFTSizePerAxis[axis_id][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches * app->configuration.numberKernels * complexSize; } } //return VKFFT_ERROR_UNSUPPORTED_RADIX; uint64_t registerBoost = 1; for (uint64_t i = 1; i <= app->configuration.registerBoost; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (i * i) == 0) registerBoost = i; } if ((axis_id == nonStridedAxisId) && (!app->configuration.performConvolution)) maxSingleSizeNonStrided *= registerBoost; uint64_t maxSequenceLengthSharedMemoryStrided = (app->configuration.coalescedMemory > complexSize) ? app->configuration.sharedMemorySize / (app->configuration.coalescedMemory) : app->configuration.sharedMemorySize / complexSize; uint64_t maxSingleSizeStrided = (!app->configuration.performConvolution) ? maxSequenceLengthSharedMemoryStrided * registerBoost : maxSequenceLengthSharedMemoryStrided; uint64_t numPasses = 1; uint64_t numPassesHalfBandwidth = 1; uint64_t temp; temp = (axis_id == nonStridedAxisId) ? (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)maxSingleSizeNonStrided) : (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)maxSingleSizeStrided); if (temp > 1) {//more passes than one for (uint64_t i = 1; i <= app->configuration.registerBoost4Step; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (i * i) == 0) { registerBoost = i; } } if ((!app->configuration.performConvolution)) maxSingleSizeNonStrided = maxSequenceLengthSharedMemory * registerBoost; if ((!app->configuration.performConvolution)) maxSingleSizeStrided = maxSequenceLengthSharedMemoryStrided * registerBoost; temp = ((axis_id == nonStridedAxisId) && ((!app->configuration.reorderFourStep) || (app->useBluesteinFFT[axis_id]))) ? FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSingleSizeNonStrided : FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSingleSizeStrided; if (app->configuration.reorderFourStep && (!app->useBluesteinFFT[axis_id])) numPasses = (uint64_t)ceil(log2(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]) / log2(maxSingleSizeStrided)); else numPasses += (uint64_t)ceil(log2(temp) / log2(maxSingleSizeStrided)); } registerBoost = ((axis_id == nonStridedAxisId) && ((app->useBluesteinFFT[axis_id]) || (!app->configuration.reorderFourStep) || (numPasses == 1))) ? (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)(pow(maxSequenceLengthSharedMemoryStrided, numPasses - 1) * maxSequenceLengthSharedMemory)) : (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)pow(maxSequenceLengthSharedMemoryStrided, numPasses)); uint64_t canBoost = 0; for (uint64_t i = registerBoost; i <= app->configuration.registerBoost; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (i * i) == 0) { registerBoost = i; i = app->configuration.registerBoost + 1; canBoost = 1; } } if (((canBoost == 0) || (((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] & (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - 1)) != 0) && (!app->configuration.registerBoostNonPow2))) && (registerBoost > 1)) { registerBoost = 1; numPasses++; } maxSingleSizeNonStrided = maxSequenceLengthSharedMemory * registerBoost; maxSingleSizeStrided = maxSequenceLengthSharedMemoryStrided * registerBoost; uint64_t maxSingleSizeStridedHalfBandwidth = maxSingleSizeStrided; if ((axes->specializationConstants.performBandwidthBoost)) { maxSingleSizeStridedHalfBandwidth = (app->configuration.coalescedMemory / axes->specializationConstants.performBandwidthBoost > complexSize) ? app->configuration.sharedMemorySizePow2 / (app->configuration.coalescedMemory / axes->specializationConstants.performBandwidthBoost) : app->configuration.sharedMemorySizePow2 / complexSize; temp = (axis_id == nonStridedAxisId) ? (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)maxSingleSizeNonStrided) : (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)maxSingleSizeStridedHalfBandwidth); //temp = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSingleSizeNonStrided; if (temp > 1) {//more passes than two temp = ((!app->configuration.reorderFourStep) || (app->useBluesteinFFT[axis_id])) ? (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)maxSingleSizeNonStrided) : (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (double)maxSingleSizeStridedHalfBandwidth); for (uint64_t i = 0; i < 5; i++) { temp = (uint64_t)ceil(temp / (double)maxSingleSizeStrided); numPassesHalfBandwidth++; if (temp == 1) i = 5; } /* temp = ((axis_id == 0) && (!app->configuration.reorderFourStep)) ? FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSingleSizeNonStrided : FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSingleSizeStridedHalfBandwidth; if (app->configuration.reorderFourStep) numPassesHalfBandwidth = (uint64_t)ceil(log2(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]) / log2(maxSingleSizeStridedHalfBandwidth)); else numPassesHalfBandwidth = 1 + (uint64_t)ceil(log2(temp) / log2(maxSingleSizeStridedHalfBandwidth)); if ((numPassesHalfBandwidth == 2)&& (!app->configuration.reorderFourStep)&&(registerBoost>1)) //switch back for two step and don't do half bandwidth on strided accesses if register boost and no 4-step reordering */ } if (numPassesHalfBandwidth < numPasses) numPasses = numPassesHalfBandwidth; else maxSingleSizeStridedHalfBandwidth = maxSingleSizeStrided; } if (((uint64_t)log2(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]) >= app->configuration.swapTo3Stage4Step) && (app->configuration.swapTo3Stage4Step >= 17)) numPasses = 3;//Force set to 3 stage 4 step algorithm uint64_t* locAxisSplit = FFTPlan->axisSplit[axis_id]; if (numPasses == 1) { locAxisSplit[0] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; } if (numPasses == 2) { if (isPowOf2) { if ((axis_id == nonStridedAxisId) && ((!app->configuration.reorderFourStep) || (app->useBluesteinFFT[axis_id]))) { uint64_t maxPow8SharedMemory = (uint64_t)pow(8, ((uint64_t)log2(maxSequenceLengthSharedMemory)) / 3); //unit stride if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxPow8SharedMemory <= maxSingleSizeStrided) { locAxisSplit[0] = maxPow8SharedMemory; } else { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSequenceLengthSharedMemory <= maxSingleSizeStrided) { locAxisSplit[0] = maxSequenceLengthSharedMemory; } else { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory * registerBoost) < maxSingleSizeStridedHalfBandwidth) { for (uint64_t i = 1; i <= (uint64_t)log2(registerBoost); i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory * (uint64_t)pow(2, i)) <= maxSingleSizeStrided) { locAxisSplit[0] = (maxSequenceLengthSharedMemory * (uint64_t)pow(2, i)); i = (uint64_t)log2(registerBoost) + 1; } } } else { locAxisSplit[0] = (maxSequenceLengthSharedMemory * registerBoost); } } } } else { uint64_t maxPow8Strided = (uint64_t)pow(8, ((uint64_t)log2(maxSingleSizeStrided)) / 3); //all FFTs are considered as non-unit stride if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxPow8Strided <= maxSingleSizeStrided) { locAxisSplit[0] = maxPow8Strided; } else { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSingleSizeStrided < maxSingleSizeStridedHalfBandwidth) { locAxisSplit[0] = maxSingleSizeStrided; } else { locAxisSplit[0] = maxSingleSizeStridedHalfBandwidth; } } } locAxisSplit[1] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / locAxisSplit[0]; if (locAxisSplit[1] < 64) { locAxisSplit[0] = (locAxisSplit[1] == 0) ? locAxisSplit[0] / (64) : locAxisSplit[0] / (64 / locAxisSplit[1]); locAxisSplit[1] = 64; } if (locAxisSplit[1] > locAxisSplit[0]) { uint64_t swap = locAxisSplit[0]; locAxisSplit[0] = locAxisSplit[1]; locAxisSplit[1] = swap; } } else { uint64_t successSplit = 0; if ((axis_id == nonStridedAxisId) && ((!app->configuration.reorderFourStep) || (app->useBluesteinFFT[axis_id]))) { /*for (uint64_t i = 0; i < maxSequenceLengthSharedMemory; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (maxSequenceLengthSharedMemory - i) == 0) { if (((maxSequenceLengthSharedMemory - i) <= maxSequenceLengthSharedMemory) && (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory - i) <= maxSingleSizeStrided)) { locAxisSplit[0] = (maxSequenceLengthSharedMemory - i); locAxisSplit[1] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory - i); i = maxSequenceLengthSharedMemory; successSplit = 1; } } }*/ uint64_t sqrtSequence = (uint64_t)ceil(sqrt(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id])); for (uint64_t i = 0; i < sqrtSequence; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (sqrtSequence - i) == 0) { if ((sqrtSequence - i <= maxSingleSizeStrided) && (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrtSequence - i) <= maxSequenceLengthSharedMemory)) { locAxisSplit[0] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrtSequence - i); locAxisSplit[1] = sqrtSequence - i; i = sqrtSequence; successSplit = 1; } } } } else { uint64_t sqrtSequence = (uint64_t)ceil(sqrt(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id])); for (uint64_t i = 0; i < sqrtSequence; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (sqrtSequence - i) == 0) { if ((sqrtSequence - i <= maxSingleSizeStrided) && (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrtSequence - i) <= maxSingleSizeStridedHalfBandwidth)) { locAxisSplit[0] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrtSequence - i); locAxisSplit[1] = sqrtSequence - i; i = sqrtSequence; successSplit = 1; } } } } if (successSplit == 0) numPasses = 3; } } if (numPasses == 3) { if (isPowOf2) { uint64_t maxPow8Strided = (uint64_t)pow(8, ((uint64_t)log2(maxSingleSizeStrided)) / 3); if ((axis_id == nonStridedAxisId) && ((!app->configuration.reorderFourStep) || (app->useBluesteinFFT[axis_id]))) { //unit stride uint64_t maxPow8SharedMemory = (uint64_t)pow(8, ((uint64_t)log2(maxSequenceLengthSharedMemory)) / 3); if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxPow8SharedMemory <= maxPow8Strided * maxPow8Strided) locAxisSplit[0] = maxPow8SharedMemory; else { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSequenceLengthSharedMemory <= maxSingleSizeStrided * maxSingleSizeStrided) locAxisSplit[0] = maxSequenceLengthSharedMemory; else { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory * registerBoost) <= maxSingleSizeStrided * maxSingleSizeStrided) { for (uint64_t i = 0; i <= (uint64_t)log2(registerBoost); i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory * (uint64_t)pow(2, i)) <= maxSingleSizeStrided * maxSingleSizeStrided) { locAxisSplit[0] = (maxSequenceLengthSharedMemory * (uint64_t)pow(2, i)); i = (uint64_t)log2(registerBoost) + 1; } } } else { locAxisSplit[0] = (maxSequenceLengthSharedMemory * registerBoost); } } } } else { //to account for TLB misses, it is best to coalesce the unit-strided stage to 128 bytes /*uint64_t log2axis = (uint64_t)log2(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]); locAxisSplit[0] = (uint64_t)pow(2, (uint64_t)log2axis / 3); if (log2axis % 3 > 0) locAxisSplit[0] *= 2; locAxisSplit[1] = (uint64_t)pow(2, (uint64_t)log2axis / 3); if (log2axis % 3 > 1) locAxisSplit[1] *= 2; locAxisSplit[2] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / locAxisSplit[0] / locAxisSplit[1];*/ uint64_t maxSingleSizeStrided128 = app->configuration.sharedMemorySize / (128); uint64_t maxPow8_128 = (uint64_t)pow(8, ((uint64_t)log2(maxSingleSizeStrided128)) / 3); //unit stride if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxPow8_128 <= maxPow8Strided * maxSingleSizeStrided) locAxisSplit[0] = maxPow8_128; //non-unit stride else { if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxPow8_128 * 2) <= maxPow8Strided * maxSingleSizeStrided) && (maxPow8_128 * 2 <= maxSingleSizeStrided128)) { locAxisSplit[0] = maxPow8_128 * 2; } else { if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxPow8_128 * 4) <= maxPow8Strided * maxSingleSizeStrided) && (maxPow8_128 * 4 <= maxSingleSizeStrided128)) { locAxisSplit[0] = maxPow8_128 * 4; } else { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / maxSingleSizeStrided <= maxSingleSizeStrided * maxSingleSizeStrided) { for (uint64_t i = 0; i <= (uint64_t)log2(maxSingleSizeStrided / maxSingleSizeStrided128); i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSingleSizeStrided128 * (uint64_t)pow(2, i)) <= maxSingleSizeStrided * maxSingleSizeStrided) { locAxisSplit[0] = (maxSingleSizeStrided128 * (uint64_t)pow(2, i)); i = (uint64_t)log2(maxSingleSizeStrided / maxSingleSizeStrided128) + 1; } } } else locAxisSplit[0] = maxSingleSizeStridedHalfBandwidth; } } } } if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / locAxisSplit[0] / maxPow8Strided <= maxSingleSizeStrided) { locAxisSplit[1] = maxPow8Strided; locAxisSplit[2] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / locAxisSplit[1] / locAxisSplit[0]; } else { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / locAxisSplit[0] / maxSingleSizeStrided <= maxSingleSizeStrided) { locAxisSplit[1] = maxSingleSizeStrided; locAxisSplit[2] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / locAxisSplit[1] / locAxisSplit[0]; } else { locAxisSplit[1] = maxSingleSizeStridedHalfBandwidth; locAxisSplit[2] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / locAxisSplit[1] / locAxisSplit[0]; } } if (locAxisSplit[2] < 64) { locAxisSplit[1] = (locAxisSplit[2] == 0) ? locAxisSplit[1] / (64) : locAxisSplit[1] / (64 / locAxisSplit[2]); locAxisSplit[2] = 64; } if (locAxisSplit[2] > locAxisSplit[1]) { uint64_t swap = locAxisSplit[1]; locAxisSplit[1] = locAxisSplit[2]; locAxisSplit[2] = swap; } } else { uint64_t successSplit = 0; if ((axis_id == nonStridedAxisId) && ((!app->configuration.reorderFourStep) || (app->useBluesteinFFT[axis_id]))) { for (uint64_t i = 0; i < maxSequenceLengthSharedMemory; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (maxSequenceLengthSharedMemory - i) == 0) { uint64_t sqrt3Sequence = (uint64_t)ceil(sqrt(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory - i))); for (uint64_t j = 0; j < sqrt3Sequence; j++) { if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory - i)) % (sqrt3Sequence - j) == 0) { if (((maxSequenceLengthSharedMemory - i) <= maxSequenceLengthSharedMemory) && (sqrt3Sequence - j <= maxSingleSizeStrided) && (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory - i) / (sqrt3Sequence - j) <= maxSingleSizeStrided)) { locAxisSplit[0] = (maxSequenceLengthSharedMemory - i); locAxisSplit[1] = sqrt3Sequence - j; locAxisSplit[2] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (maxSequenceLengthSharedMemory - i) / (sqrt3Sequence - j); i = maxSequenceLengthSharedMemory; j = sqrt3Sequence; successSplit = 1; } } } } } } else { uint64_t sqrt3Sequence = (uint64_t)ceil(pow(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id], 1.0 / 3.0)); for (uint64_t i = 0; i < sqrt3Sequence; i++) { if (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] % (sqrt3Sequence - i) == 0) { uint64_t sqrt2Sequence = (uint64_t)ceil(sqrt(FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrt3Sequence - i))); for (uint64_t j = 0; j < sqrt2Sequence; j++) { if ((FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrt3Sequence - i)) % (sqrt2Sequence - j) == 0) { if ((sqrt3Sequence - i <= maxSingleSizeStrided) && (sqrt2Sequence - j <= maxSingleSizeStrided) && (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrt3Sequence - i) / (sqrt2Sequence - j) <= maxSingleSizeStridedHalfBandwidth)) { locAxisSplit[0] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / (sqrt3Sequence - i) / (sqrt2Sequence - j); locAxisSplit[1] = sqrt3Sequence - i; locAxisSplit[2] = sqrt2Sequence - j; i = sqrt3Sequence; j = sqrt2Sequence; successSplit = 1; } } } } } } if (successSplit == 0) numPasses = 4; } } if (numPasses > 3) { //printf("sequence length exceeds boundaries\n"); return VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH; } if ((numPasses > 1) && (app->configuration.performDCT > 0)) { //printf("sequence length exceeds boundaries\n"); return VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_DCT; } if ((numPasses > 1) && (app->configuration.performR2C > 0) && (axis_id == 0) && (app->configuration.size[axis_id] % 2 != 0)) { //printf("sequence length exceeds boundaries\n"); return VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_R2C; } if (((app->configuration.reorderFourStep) && (!app->useBluesteinFFT[axis_id]))) { for (uint64_t i = 0; i < numPasses; i++) { if ((locAxisSplit[0] % 2 != 0) && (locAxisSplit[i] % 2 == 0)) { uint64_t swap = locAxisSplit[0]; locAxisSplit[0] = locAxisSplit[i]; locAxisSplit[i] = swap; } } for (uint64_t i = 0; i < numPasses; i++) { if ((locAxisSplit[0] % 4 != 0) && (locAxisSplit[i] % 4 == 0)) { uint64_t swap = locAxisSplit[0]; locAxisSplit[0] = locAxisSplit[i]; locAxisSplit[i] = swap; } } for (uint64_t i = 0; i < numPasses; i++) { if ((locAxisSplit[0] % 8 != 0) && (locAxisSplit[i] % 8 == 0)) { uint64_t swap = locAxisSplit[0]; locAxisSplit[0] = locAxisSplit[i]; locAxisSplit[i] = swap; } } } FFTPlan->numAxisUploads[axis_id] = numPasses; for (uint64_t k = 0; k < numPasses; k++) { tempSequence = locAxisSplit[k]; uint64_t loc_multipliers[20] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; //split the smaller sequence for (uint64_t i = 2; i < 14; i++) { if (tempSequence % i == 0) { tempSequence /= i; loc_multipliers[i]++; i--; } } uint64_t registers_per_thread_per_radix[14]; uint64_t registers_per_thread = 0; uint64_t min_registers_per_thread = -1; uint64_t isGoodSequence = 0; res = VkFFTGetRegistersPerThread(loc_multipliers, registers_per_thread_per_radix, ®isters_per_thread, &min_registers_per_thread, &isGoodSequence); if (res != VKFFT_SUCCESS) return res; registers_per_thread_per_radix[8] = registers_per_thread_per_radix[2]; registers_per_thread_per_radix[4] = registers_per_thread_per_radix[2]; if ((registerBoost == 4) && (registers_per_thread % 4 != 0)) { registers_per_thread *= 2; for (uint64_t i = 2; i < 14; i++) { registers_per_thread_per_radix[i] *= 2; } min_registers_per_thread *= 2; } if (registers_per_thread_per_radix[8] % 8 == 0) { loc_multipliers[8] = loc_multipliers[2] / 3; loc_multipliers[2] = loc_multipliers[2] - loc_multipliers[8] * 3; } if (registers_per_thread_per_radix[4] % 4 == 0) { loc_multipliers[4] = loc_multipliers[2] / 2; loc_multipliers[2] = loc_multipliers[2] - loc_multipliers[4] * 2; } if ((registerBoost == 2) && (loc_multipliers[2] == 0)) { if (loc_multipliers[4] > 0) { loc_multipliers[4]--; loc_multipliers[2] = 2; } else { loc_multipliers[8]--; loc_multipliers[4]++; loc_multipliers[2]++; } } if ((registerBoost == 4) && (loc_multipliers[4] == 0)) { loc_multipliers[8]--; loc_multipliers[4]++; loc_multipliers[2]++; } uint64_t maxBatchCoalesced = ((axis_id == 0) && (((k == 0) && ((!app->configuration.reorderFourStep) || (app->useBluesteinFFT[axis_id]))) || (numPasses == 1))) ? 1 : app->configuration.coalescedMemory / complexSize; if (maxBatchCoalesced * locAxisSplit[k] / (min_registers_per_thread * registerBoost) > app->configuration.maxThreadsNum) { uint64_t scaleRegistersNum = 1; if ((maxBatchCoalesced * locAxisSplit[k] / (min_registers_per_thread * registerBoost * scaleRegistersNum)) > app->configuration.maxThreadsNum) { for (uint64_t i = 2; i < locAxisSplit[k]; i++) { if ((locAxisSplit[k] / (min_registers_per_thread * registerBoost * scaleRegistersNum) % i == 0) && ((maxBatchCoalesced * locAxisSplit[k] / (min_registers_per_thread * registerBoost * i)) <= app->configuration.maxThreadsNum)) { scaleRegistersNum = i; i = locAxisSplit[k]; } } } min_registers_per_thread *= scaleRegistersNum; uint64_t temp_scaleRegistersNum = scaleRegistersNum; while ((locAxisSplit[k] / (registers_per_thread * registerBoost)) % temp_scaleRegistersNum != 0) temp_scaleRegistersNum++; registers_per_thread *= temp_scaleRegistersNum; for (uint64_t i = 2; i < 14; i++) { if (registers_per_thread_per_radix[i] != 0) { temp_scaleRegistersNum = scaleRegistersNum; while ((locAxisSplit[k] / (registers_per_thread_per_radix[i] * registerBoost)) % temp_scaleRegistersNum != 0) temp_scaleRegistersNum++; registers_per_thread_per_radix[i] *= temp_scaleRegistersNum; } } if (min_registers_per_thread > registers_per_thread) { uint64_t temp = min_registers_per_thread; min_registers_per_thread = registers_per_thread; registers_per_thread = temp; } for (uint64_t i = 2; i < 14; i++) { if (registers_per_thread_per_radix[i] > registers_per_thread) { registers_per_thread = registers_per_thread_per_radix[i]; } if ((registers_per_thread_per_radix[i] > 0) && (registers_per_thread_per_radix[i] < min_registers_per_thread)) { min_registers_per_thread = registers_per_thread_per_radix[i]; } } if ((loc_multipliers[3] >= 2) && (((registers_per_thread / min_registers_per_thread) % 3) == 0)) { registers_per_thread /= 3; for (uint64_t i = 2; i < 14; i++) { if (registers_per_thread_per_radix[i] % 9 == 0) { registers_per_thread_per_radix[i] /= 3; } } for (uint64_t i = 2; i < 14; i++) { if (registers_per_thread_per_radix[i] > registers_per_thread) { registers_per_thread = registers_per_thread_per_radix[i]; } if ((registers_per_thread_per_radix[i] > 0) && (registers_per_thread_per_radix[i] < min_registers_per_thread)) { min_registers_per_thread = registers_per_thread_per_radix[i]; } } } } uint64_t j = 0; axes[k].specializationConstants.registerBoost = registerBoost; axes[k].specializationConstants.registers_per_thread = registers_per_thread; axes[k].specializationConstants.min_registers_per_thread = min_registers_per_thread; for (uint64_t i = 2; i < 14; i++) { axes[k].specializationConstants.registers_per_thread_per_radix[i] = registers_per_thread_per_radix[i]; } axes[k].specializationConstants.numStages = 0; axes[k].specializationConstants.fftDim = locAxisSplit[k]; uint64_t tempRegisterBoost = registerBoost;// ((axis_id == nonStridedAxisId) && ((!app->configuration.reorderFourStep)||(app->useBluesteinFFT[axis_id]))) ? (uint64_t)ceil(axes[k].specializationConstants.fftDim / (double)maxSingleSizeNonStrided) : (uint64_t)ceil(axes[k].specializationConstants.fftDim / (double)maxSingleSizeStrided); uint64_t switchRegisterBoost = 0; if (tempRegisterBoost > 1) { if (loc_multipliers[tempRegisterBoost] > 0) { loc_multipliers[tempRegisterBoost]--; switchRegisterBoost = tempRegisterBoost; } else { for (uint64_t i = 14; i > 1; i--) { if (loc_multipliers[i] > 0) { loc_multipliers[i]--; switchRegisterBoost = i; i = 1; } } } } for (uint64_t i = 14; i > 1; i--) { if (loc_multipliers[i] > 0) { axes[k].specializationConstants.stageRadix[j] = i; loc_multipliers[i]--; i++; j++; axes[k].specializationConstants.numStages++; } } if (switchRegisterBoost > 0) { axes[k].specializationConstants.stageRadix[axes[k].specializationConstants.numStages] = switchRegisterBoost; axes[k].specializationConstants.numStages++; } else { if (min_registers_per_thread != registers_per_thread) { for (uint64_t i = 0; i < axes[k].specializationConstants.numStages; i++) { if (axes[k].specializationConstants.registers_per_thread_per_radix[axes[k].specializationConstants.stageRadix[i]] == min_registers_per_thread) { j = axes[k].specializationConstants.stageRadix[i]; axes[k].specializationConstants.stageRadix[i] = axes[k].specializationConstants.stageRadix[0]; axes[k].specializationConstants.stageRadix[0] = j; i = axes[k].specializationConstants.numStages; } } } } } return VKFFT_SUCCESS; } static inline VkFFTResult VkFFTGeneratePhaseVectors(VkFFTApplication* app, VkFFTPlan* FFTPlan, uint64_t axis_id, uint64_t supportAxis) { //generate two arrays used for Blueestein convolution and post-convolution multiplication double double_PI = 3.1415926535897932384626433832795; VkFFTResult resFFT = VKFFT_SUCCESS; VkFFTApplication kernelPreparationApplication = {}; VkFFTConfiguration kernelPreparationConfiguration = {}; kernelPreparationConfiguration.FFTdim = 1; kernelPreparationConfiguration.size[0] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; kernelPreparationConfiguration.size[1] = 1; kernelPreparationConfiguration.size[2] = 1; kernelPreparationConfiguration.doublePrecision = app->configuration.doublePrecision; kernelPreparationConfiguration.useLUT = 1; kernelPreparationConfiguration.registerBoost = 1; kernelPreparationConfiguration.disableReorderFourStep = 1; kernelPreparationConfiguration.saveApplicationToString = app->configuration.saveApplicationToString; kernelPreparationConfiguration.loadApplicationFromString = app->configuration.loadApplicationFromString; if (kernelPreparationConfiguration.loadApplicationFromString) { #if((VKFFT_BACKEND==0)||(VKFFT_BACKEND==2)) kernelPreparationConfiguration.loadApplicationString = (void*)((uint32_t*)app->configuration.loadApplicationString + app->currentApplicationStringPos); #else kernelPreparationConfiguration.loadApplicationString = (void*)((char*)app->configuration.loadApplicationString + app->currentApplicationStringPos); #endif } kernelPreparationConfiguration.performBandwidthBoost = (app->configuration.performBandwidthBoost > 0) ? app->configuration.performBandwidthBoost : 2; if (axis_id == 0) kernelPreparationConfiguration.performBandwidthBoost = 0; if (axis_id > 0) kernelPreparationConfiguration.considerAllAxesStrided = 1; if (app->configuration.tempBuffer) { kernelPreparationConfiguration.userTempBuffer = 1; kernelPreparationConfiguration.tempBuffer = app->configuration.tempBuffer; kernelPreparationConfiguration.tempBufferSize = app->configuration.tempBufferSize; kernelPreparationConfiguration.tempBufferNum = app->configuration.tempBufferNum; } kernelPreparationConfiguration.device = app->configuration.device; #if(VKFFT_BACKEND==0) kernelPreparationConfiguration.queue = app->configuration.queue; //to allocate memory for LUT, we have to pass a queue, vkGPU->fence, commandPool and physicalDevice pointers kernelPreparationConfiguration.fence = app->configuration.fence; kernelPreparationConfiguration.commandPool = app->configuration.commandPool; kernelPreparationConfiguration.physicalDevice = app->configuration.physicalDevice; kernelPreparationConfiguration.isCompilerInitialized = 1;//compiler can be initialized before VkFFT plan creation. if not, VkFFT will create and destroy one after initialization kernelPreparationConfiguration.tempBufferDeviceMemory = app->configuration.tempBufferDeviceMemory; #elif(VKFFT_BACKEND==3) kernelPreparationConfiguration.platform = app->configuration.platform; kernelPreparationConfiguration.context = app->configuration.context; #endif uint64_t bufferSize = (uint64_t)sizeof(float) * 2 * kernelPreparationConfiguration.size[0] * kernelPreparationConfiguration.size[1] * kernelPreparationConfiguration.size[2]; if (kernelPreparationConfiguration.doublePrecision) bufferSize *= sizeof(double) / sizeof(float); app->bufferBluesteinSize[axis_id] = bufferSize; kernelPreparationConfiguration.inputBufferSize = &app->bufferBluesteinSize[axis_id]; kernelPreparationConfiguration.bufferSize = &app->bufferBluesteinSize[axis_id]; kernelPreparationConfiguration.isInputFormatted = 1; resFFT = initializeVkFFT(&kernelPreparationApplication, kernelPreparationConfiguration); if (resFFT != VKFFT_SUCCESS) return resFFT; if (kernelPreparationConfiguration.loadApplicationFromString) { app->currentApplicationStringPos += kernelPreparationApplication.currentApplicationStringPos; } #if(VKFFT_BACKEND==0) VkResult res = VK_SUCCESS; resFFT = allocateFFTBuffer(app, &app->bufferBluestein[axis_id], &app->bufferBluesteinDeviceMemory[axis_id], VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, bufferSize); if (resFFT != VKFFT_SUCCESS) return resFFT; if (!app->configuration.makeInversePlanOnly) { resFFT = allocateFFTBuffer(app, &app->bufferBluesteinFFT[axis_id], &app->bufferBluesteinFFTDeviceMemory[axis_id], VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, bufferSize); if (resFFT != VKFFT_SUCCESS) return resFFT; } if (!app->configuration.makeForwardPlanOnly) { resFFT = allocateFFTBuffer(app, &app->bufferBluesteinIFFT[axis_id], &app->bufferBluesteinIFFTDeviceMemory[axis_id], VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, bufferSize); if (resFFT != VKFFT_SUCCESS) return resFFT; } #elif(VKFFT_BACKEND==1) cudaError_t res = cudaSuccess; res = cudaMalloc((void**)&app->bufferBluestein[axis_id], bufferSize); if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_ALLOCATE; if (!app->configuration.makeInversePlanOnly) { res = cudaMalloc((void**)&app->bufferBluesteinFFT[axis_id], bufferSize); if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_ALLOCATE; } if (!app->configuration.makeForwardPlanOnly) { res = cudaMalloc((void**)&app->bufferBluesteinIFFT[axis_id], bufferSize); if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==2) hipError_t res = hipSuccess; res = hipMalloc((void**)&app->bufferBluestein[axis_id], bufferSize); if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_ALLOCATE; if (!app->configuration.makeInversePlanOnly) { res = hipMalloc((void**)&app->bufferBluesteinFFT[axis_id], bufferSize); if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_ALLOCATE; } if (!app->configuration.makeForwardPlanOnly) { res = hipMalloc((void**)&app->bufferBluesteinIFFT[axis_id], bufferSize); if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==3) cl_int res = CL_SUCCESS; app->bufferBluestein[axis_id] = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_WRITE, bufferSize, 0, &res); if (res != CL_SUCCESS) return VKFFT_ERROR_FAILED_TO_ALLOCATE; if (!app->configuration.makeInversePlanOnly) { app->bufferBluesteinFFT[axis_id] = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_WRITE, bufferSize, 0, &res); if (res != CL_SUCCESS) return VKFFT_ERROR_FAILED_TO_ALLOCATE; } if (!app->configuration.makeForwardPlanOnly) { app->bufferBluesteinIFFT[axis_id] = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_WRITE, bufferSize, 0, &res); if (res != CL_SUCCESS) return VKFFT_ERROR_FAILED_TO_ALLOCATE; } cl_command_queue commandQueue = clCreateCommandQueue(app->configuration.context[0], app->configuration.device[0], 0, &res); if (res != CL_SUCCESS) return VKFFT_ERROR_FAILED_TO_CREATE_COMMAND_QUEUE; #endif void* phaseVectors = malloc(bufferSize); if (!phaseVectors) { deleteVkFFT(&kernelPreparationApplication); deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } uint64_t phaseVectorsNonZeroSize = (((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) || ((FFTPlan->multiUploadR2C) && (axis_id == 0))) ? app->configuration.size[axis_id] / 2 : app->configuration.size[axis_id]; if (app->configuration.performDCT == 1) phaseVectorsNonZeroSize = 2 * app->configuration.size[axis_id] - 2; if ((FFTPlan->numAxisUploads[axis_id] > 1) && (!app->configuration.makeForwardPlanOnly)) { if (kernelPreparationConfiguration.doublePrecision) { double* phaseVectors_cast = (double*)phaseVectors; for (uint64_t i = 0; i < FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; i++) { uint64_t rm = (i * i) % (2 * phaseVectorsNonZeroSize); double angle = double_PI * rm / phaseVectorsNonZeroSize; phaseVectors_cast[2 * i] = (i < phaseVectorsNonZeroSize) ? (double)cos(angle) : 0; phaseVectors_cast[2 * i + 1] = (i < phaseVectorsNonZeroSize) ? (double)-sin(angle) : 0; } for (uint64_t i = 1; i < phaseVectorsNonZeroSize; i++) { phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i)] = phaseVectors_cast[2 * i]; phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i) + 1] = phaseVectors_cast[2 * i + 1]; } } else { float* phaseVectors_cast = (float*)phaseVectors; for (uint64_t i = 0; i < FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; i++) { uint64_t rm = (i * i) % (2 * phaseVectorsNonZeroSize); double angle = double_PI * rm / phaseVectorsNonZeroSize; phaseVectors_cast[2 * i] = (i < phaseVectorsNonZeroSize) ? (float)cos(angle) : 0; phaseVectors_cast[2 * i + 1] = (i < phaseVectorsNonZeroSize) ? (float)-sin(angle) : 0; } for (uint64_t i = 1; i < phaseVectorsNonZeroSize; i++) { phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i)] = phaseVectors_cast[2 * i]; phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i) + 1] = phaseVectors_cast[2 * i + 1]; } } #if(VKFFT_BACKEND==0) resFFT = transferDataFromCPU(&kernelPreparationApplication, phaseVectors, &app->bufferBluestein[axis_id], bufferSize); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } #elif(VKFFT_BACKEND==1) res = cudaMemcpy(app->bufferBluestein[axis_id], phaseVectors, bufferSize, cudaMemcpyHostToDevice); if (res != cudaSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_COPY; } #elif(VKFFT_BACKEND==2) res = hipMemcpy(app->bufferBluestein[axis_id], phaseVectors, bufferSize, hipMemcpyHostToDevice); if (res != hipSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_COPY; } #elif(VKFFT_BACKEND==3) res = clEnqueueWriteBuffer(commandQueue, app->bufferBluestein[axis_id], CL_TRUE, 0, bufferSize, phaseVectors, 0, NULL, NULL); if (res != CL_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_COPY; } #endif #if(VKFFT_BACKEND==0) { VkCommandBufferAllocateInfo commandBufferAllocateInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO }; commandBufferAllocateInfo.commandPool = kernelPreparationApplication.configuration.commandPool[0]; commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY; commandBufferAllocateInfo.commandBufferCount = 1; VkCommandBuffer commandBuffer = {}; res = vkAllocateCommandBuffers(kernelPreparationApplication.configuration.device[0], &commandBufferAllocateInfo, &commandBuffer); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_ALLOCATE_COMMAND_BUFFERS; } VkCommandBufferBeginInfo commandBufferBeginInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT; res = vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_BEGIN_COMMAND_BUFFER; } VkFFTLaunchParams launchParams = {}; launchParams.commandBuffer = &commandBuffer; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; //Record commands resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = vkEndCommandBuffer(commandBuffer); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_END_COMMAND_BUFFER; } VkSubmitInfo submitInfo = { VK_STRUCTURE_TYPE_SUBMIT_INFO }; submitInfo.commandBufferCount = 1; submitInfo.pCommandBuffers = &commandBuffer; res = vkQueueSubmit(kernelPreparationApplication.configuration.queue[0], 1, &submitInfo, kernelPreparationApplication.configuration.fence[0]); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SUBMIT_QUEUE; } res = vkWaitForFences(kernelPreparationApplication.configuration.device[0], 1, kernelPreparationApplication.configuration.fence, VK_TRUE, 100000000000); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_WAIT_FOR_FENCES; } res = vkResetFences(kernelPreparationApplication.configuration.device[0], 1, kernelPreparationApplication.configuration.fence); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_RESET_FENCES; } vkFreeCommandBuffers(kernelPreparationApplication.configuration.device[0], kernelPreparationApplication.configuration.commandPool[0], 1, &commandBuffer); } #elif(VKFFT_BACKEND==1) VkFFTLaunchParams launchParams = {}; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = cudaDeviceSynchronize(); if (res != cudaSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } #elif(VKFFT_BACKEND==2) VkFFTLaunchParams launchParams = {}; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = hipDeviceSynchronize(); if (res != hipSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } #elif(VKFFT_BACKEND==3) VkFFTLaunchParams launchParams = {}; launchParams.commandQueue = &commandQueue; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = clFinish(commandQueue); if (res != CL_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } #endif } if (kernelPreparationConfiguration.doublePrecision) { double* phaseVectors_cast = (double*)phaseVectors; for (uint64_t i = 0; i < FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; i++) { uint64_t rm = (i * i) % (2 * phaseVectorsNonZeroSize); double angle = double_PI * rm / phaseVectorsNonZeroSize; phaseVectors_cast[2 * i] = (i < phaseVectorsNonZeroSize) ? (double)cos(angle) : 0; phaseVectors_cast[2 * i + 1] = (i < phaseVectorsNonZeroSize) ? (double)sin(angle) : 0; } for (uint64_t i = 1; i < phaseVectorsNonZeroSize; i++) { phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i)] = phaseVectors_cast[2 * i]; phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i) + 1] = phaseVectors_cast[2 * i + 1]; } } else { float* phaseVectors_cast = (float*)phaseVectors; for (uint64_t i = 0; i < FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; i++) { uint64_t rm = (i * i) % (2 * phaseVectorsNonZeroSize); double angle = double_PI * rm / phaseVectorsNonZeroSize; phaseVectors_cast[2 * i] = (i < phaseVectorsNonZeroSize) ? (float)cos(angle) : 0; phaseVectors_cast[2 * i + 1] = (i < phaseVectorsNonZeroSize) ? (float)sin(angle) : 0; } for (uint64_t i = 1; i < phaseVectorsNonZeroSize; i++) { phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i)] = phaseVectors_cast[2 * i]; phaseVectors_cast[2 * (FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] - i) + 1] = phaseVectors_cast[2 * i + 1]; } } #if(VKFFT_BACKEND==0) resFFT = transferDataFromCPU(&kernelPreparationApplication, phaseVectors, &app->bufferBluestein[axis_id], bufferSize); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } #elif(VKFFT_BACKEND==1) res = cudaMemcpy(app->bufferBluestein[axis_id], phaseVectors, bufferSize, cudaMemcpyHostToDevice); if (res != cudaSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_COPY; } #elif(VKFFT_BACKEND==2) res = hipMemcpy(app->bufferBluestein[axis_id], phaseVectors, bufferSize, hipMemcpyHostToDevice); if (res != hipSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_COPY; } #elif(VKFFT_BACKEND==3) res = clEnqueueWriteBuffer(commandQueue, app->bufferBluestein[axis_id], CL_TRUE, 0, bufferSize, phaseVectors, 0, NULL, NULL); if (res != CL_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_COPY; } #endif #if(VKFFT_BACKEND==0) if (!app->configuration.makeInversePlanOnly) { VkCommandBufferAllocateInfo commandBufferAllocateInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO }; commandBufferAllocateInfo.commandPool = kernelPreparationApplication.configuration.commandPool[0]; commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY; commandBufferAllocateInfo.commandBufferCount = 1; VkCommandBuffer commandBuffer = {}; res = vkAllocateCommandBuffers(kernelPreparationApplication.configuration.device[0], &commandBufferAllocateInfo, &commandBuffer); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_ALLOCATE_COMMAND_BUFFERS; } VkCommandBufferBeginInfo commandBufferBeginInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT; res = vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_BEGIN_COMMAND_BUFFER; } VkFFTLaunchParams launchParams = {}; launchParams.commandBuffer = &commandBuffer; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; launchParams.buffer = &app->bufferBluesteinFFT[axis_id]; //Record commands resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = vkEndCommandBuffer(commandBuffer); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_END_COMMAND_BUFFER; } VkSubmitInfo submitInfo = { VK_STRUCTURE_TYPE_SUBMIT_INFO }; submitInfo.commandBufferCount = 1; submitInfo.pCommandBuffers = &commandBuffer; res = vkQueueSubmit(kernelPreparationApplication.configuration.queue[0], 1, &submitInfo, kernelPreparationApplication.configuration.fence[0]); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SUBMIT_QUEUE; } res = vkWaitForFences(kernelPreparationApplication.configuration.device[0], 1, kernelPreparationApplication.configuration.fence, VK_TRUE, 100000000000); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_WAIT_FOR_FENCES; } res = vkResetFences(kernelPreparationApplication.configuration.device[0], 1, kernelPreparationApplication.configuration.fence); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_RESET_FENCES; } vkFreeCommandBuffers(kernelPreparationApplication.configuration.device[0], kernelPreparationApplication.configuration.commandPool[0], 1, &commandBuffer); } if ((FFTPlan->numAxisUploads[axis_id] == 1) && (!app->configuration.makeForwardPlanOnly)) { VkCommandBufferAllocateInfo commandBufferAllocateInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO }; commandBufferAllocateInfo.commandPool = kernelPreparationApplication.configuration.commandPool[0]; commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY; commandBufferAllocateInfo.commandBufferCount = 1; VkCommandBuffer commandBuffer = {}; res = vkAllocateCommandBuffers(kernelPreparationApplication.configuration.device[0], &commandBufferAllocateInfo, &commandBuffer); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_ALLOCATE_COMMAND_BUFFERS; } VkCommandBufferBeginInfo commandBufferBeginInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO }; commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT; res = vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_BEGIN_COMMAND_BUFFER; } VkFFTLaunchParams launchParams = {}; launchParams.commandBuffer = &commandBuffer; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; //Record commands resFFT = VkFFTAppend(&kernelPreparationApplication, 1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = vkEndCommandBuffer(commandBuffer); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_END_COMMAND_BUFFER; } VkSubmitInfo submitInfo = { VK_STRUCTURE_TYPE_SUBMIT_INFO }; submitInfo.commandBufferCount = 1; submitInfo.pCommandBuffers = &commandBuffer; res = vkQueueSubmit(kernelPreparationApplication.configuration.queue[0], 1, &submitInfo, kernelPreparationApplication.configuration.fence[0]); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SUBMIT_QUEUE; } res = vkWaitForFences(kernelPreparationApplication.configuration.device[0], 1, kernelPreparationApplication.configuration.fence, VK_TRUE, 100000000000); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_WAIT_FOR_FENCES; } res = vkResetFences(kernelPreparationApplication.configuration.device[0], 1, kernelPreparationApplication.configuration.fence); if (res != 0) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_RESET_FENCES; } vkFreeCommandBuffers(kernelPreparationApplication.configuration.device[0], kernelPreparationApplication.configuration.commandPool[0], 1, &commandBuffer); } #elif(VKFFT_BACKEND==1) VkFFTLaunchParams launchParams = {}; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; if (!app->configuration.makeInversePlanOnly) { launchParams.buffer = &app->bufferBluesteinFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = cudaDeviceSynchronize(); if (res != cudaSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } } if ((FFTPlan->numAxisUploads[axis_id] == 1) && (!app->configuration.makeForwardPlanOnly)) { launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, 1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = cudaDeviceSynchronize(); if (res != cudaSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } } #elif(VKFFT_BACKEND==2) VkFFTLaunchParams launchParams = {}; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; if (!app->configuration.makeInversePlanOnly) { launchParams.buffer = &app->bufferBluesteinFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = hipDeviceSynchronize(); if (res != hipSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } } if ((FFTPlan->numAxisUploads[axis_id] == 1) && (!app->configuration.makeForwardPlanOnly)) { launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, 1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = hipDeviceSynchronize(); if (res != hipSuccess) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } } #elif(VKFFT_BACKEND==3) VkFFTLaunchParams launchParams = {}; launchParams.commandQueue = &commandQueue; launchParams.inputBuffer = &app->bufferBluestein[axis_id]; if (!app->configuration.makeInversePlanOnly) { launchParams.buffer = &app->bufferBluesteinFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, -1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = clFinish(commandQueue); if (res != CL_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } } if ((FFTPlan->numAxisUploads[axis_id] == 1) && (!app->configuration.makeForwardPlanOnly)) { launchParams.buffer = &app->bufferBluesteinIFFT[axis_id]; resFFT = VkFFTAppend(&kernelPreparationApplication, 1, &launchParams); if (resFFT != VKFFT_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return resFFT; } res = clFinish(commandQueue); if (res != CL_SUCCESS) { free(phaseVectors); deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } } #endif free(phaseVectors); #if(VKFFT_BACKEND==0) kernelPreparationApplication.configuration.isCompilerInitialized = 0; #elif(VKFFT_BACKEND==3) res = clReleaseCommandQueue(commandQueue); if (res != CL_SUCCESS) return VKFFT_ERROR_FAILED_TO_RELEASE_COMMAND_QUEUE; #endif if (kernelPreparationConfiguration.saveApplicationToString) { app->applicationBluesteinStringSize[axis_id] = kernelPreparationApplication.applicationStringSize; app->applicationBluesteinString[axis_id] = calloc(app->applicationBluesteinStringSize[axis_id], 1); if (!app->applicationBluesteinString[axis_id]) { deleteVkFFT(&kernelPreparationApplication); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(app->applicationBluesteinString[axis_id], kernelPreparationApplication.saveApplicationString, app->applicationBluesteinStringSize[axis_id]); } deleteVkFFT(&kernelPreparationApplication); return resFFT; } static inline VkFFTResult VkFFTCheckUpdateBufferSet(VkFFTApplication* app, VkFFTAxis* axis, uint64_t planStage, VkFFTLaunchParams* launchParams) { uint64_t performBufferSetUpdate = planStage; uint64_t performOffsetUpdate = planStage; if (!planStage) { if (launchParams != 0) { if ((launchParams->buffer != 0) && (app->configuration.buffer != launchParams->buffer)) { app->configuration.buffer = launchParams->buffer; performBufferSetUpdate = 1; } if ((launchParams->inputBuffer != 0) && (app->configuration.inputBuffer != launchParams->inputBuffer)) { app->configuration.inputBuffer = launchParams->inputBuffer; performBufferSetUpdate = 1; } if ((launchParams->outputBuffer != 0) && (app->configuration.outputBuffer != launchParams->outputBuffer)) { app->configuration.outputBuffer = launchParams->outputBuffer; performBufferSetUpdate = 1; } if ((launchParams->tempBuffer != 0) && (app->configuration.tempBuffer != launchParams->tempBuffer)) { app->configuration.tempBuffer = launchParams->tempBuffer; performBufferSetUpdate = 1; } if ((launchParams->kernel != 0) && (app->configuration.kernel != launchParams->kernel)) { app->configuration.kernel = launchParams->kernel; performBufferSetUpdate = 1; } if (app->configuration.inputBuffer == 0) app->configuration.inputBuffer = app->configuration.buffer; if (app->configuration.outputBuffer == 0) app->configuration.outputBuffer = app->configuration.buffer; if (app->configuration.bufferOffset != launchParams->bufferOffset) { app->configuration.bufferOffset = launchParams->bufferOffset; performOffsetUpdate = 1; } if (app->configuration.inputBufferOffset != launchParams->inputBufferOffset) { app->configuration.inputBufferOffset = launchParams->inputBufferOffset; performOffsetUpdate = 1; } if (app->configuration.outputBufferOffset != launchParams->outputBufferOffset) { app->configuration.outputBufferOffset = launchParams->outputBufferOffset; performOffsetUpdate = 1; } if (app->configuration.tempBufferOffset != launchParams->tempBufferOffset) { app->configuration.tempBufferOffset = launchParams->tempBufferOffset; performOffsetUpdate = 1; } if (app->configuration.kernelOffset != launchParams->kernelOffset) { app->configuration.kernelOffset = launchParams->kernelOffset; performOffsetUpdate = 1; } } } if (planStage) { if (app->configuration.buffer == 0) { performBufferSetUpdate = 0; } if ((app->configuration.isInputFormatted) && (app->configuration.inputBuffer == 0)) { performBufferSetUpdate = 0; } if ((app->configuration.isOutputFormatted) && (app->configuration.outputBuffer == 0)) { performBufferSetUpdate = 0; } if ((app->configuration.userTempBuffer) && (app->configuration.tempBuffer == 0)) { performBufferSetUpdate = 0; } if ((app->configuration.performConvolution) && (app->configuration.kernel == 0)) { performBufferSetUpdate = 0; } } else { if (app->configuration.buffer == 0) { return VKFFT_ERROR_EMPTY_buffer; } if ((app->configuration.isInputFormatted) && (app->configuration.inputBuffer == 0)) { return VKFFT_ERROR_EMPTY_inputBuffer; } if ((app->configuration.isOutputFormatted) && (app->configuration.outputBuffer == 0)) { return VKFFT_ERROR_EMPTY_outputBuffer; } if ((app->configuration.userTempBuffer) && (app->configuration.tempBuffer == 0)) { return VKFFT_ERROR_EMPTY_tempBuffer; } if ((app->configuration.performConvolution) && (app->configuration.kernel == 0)) { return VKFFT_ERROR_EMPTY_kernel; } } if (performBufferSetUpdate) { if (planStage) axis->specializationConstants.performBufferSetUpdate = 1; else { if (!app->configuration.makeInversePlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) app->localFFTPlan->axes[i][j].specializationConstants.performBufferSetUpdate = 1; if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan->numAxisUploads[i]; j++) app->localFFTPlan->inverseBluesteinAxes[i][j].specializationConstants.performBufferSetUpdate = 1; } } if (app->localFFTPlan->multiUploadR2C) { app->localFFTPlan->R2Cdecomposition.specializationConstants.performBufferSetUpdate = 1; } } if (!app->configuration.makeForwardPlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) app->localFFTPlan_inverse->axes[i][j].specializationConstants.performBufferSetUpdate = 1; if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].specializationConstants.performBufferSetUpdate = 1; } } if (app->localFFTPlan_inverse->multiUploadR2C) { app->localFFTPlan_inverse->R2Cdecomposition.specializationConstants.performBufferSetUpdate = 1; } } } } if (performOffsetUpdate) { if (planStage) axis->specializationConstants.performOffsetUpdate = 1; else { if (!app->configuration.makeInversePlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) app->localFFTPlan->axes[i][j].specializationConstants.performOffsetUpdate = 1; if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan->numAxisUploads[i]; j++) app->localFFTPlan->inverseBluesteinAxes[i][j].specializationConstants.performOffsetUpdate = 1; } } if (app->localFFTPlan->multiUploadR2C) { app->localFFTPlan->R2Cdecomposition.specializationConstants.performOffsetUpdate = 1; } } if (!app->configuration.makeForwardPlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) app->localFFTPlan_inverse->axes[i][j].specializationConstants.performOffsetUpdate = 1; if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].specializationConstants.performOffsetUpdate = 1; } } if (app->localFFTPlan_inverse->multiUploadR2C) { app->localFFTPlan_inverse->R2Cdecomposition.specializationConstants.performOffsetUpdate = 1; } } } } return VKFFT_SUCCESS; } static inline VkFFTResult VkFFTUpdateBufferSet(VkFFTApplication* app, VkFFTPlan* FFTPlan, VkFFTAxis* axis, uint64_t axis_id, uint64_t axis_upload_id, uint64_t inverse) { if (axis->specializationConstants.performOffsetUpdate || axis->specializationConstants.performBufferSetUpdate) { #if(VKFFT_BACKEND==0) const VkDescriptorType descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER; #endif uint64_t storageComplexSize; if (app->configuration.doublePrecision) storageComplexSize = (2 * sizeof(double)); else if (app->configuration.halfPrecision) storageComplexSize = (2 * 2); else storageComplexSize = (2 * sizeof(float)); for (uint64_t i = 0; i < axis->numBindings; ++i) { for (uint64_t j = 0; j < axis->specializationConstants.numBuffersBound[i]; ++j) { #if(VKFFT_BACKEND==0) VkDescriptorBufferInfo descriptorBufferInfo = { 0 }; #endif if (i == 0) { if ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->configuration.isInputFormatted) && (!axis->specializationConstants.reverseBluesteinMultiUpload) && ( ((axis_id == app->firstAxis) && (!inverse)) || ((axis_id == app->lastAxis) && (inverse) && (!((axis_id == 0) && (axis->specializationConstants.performR2CmultiUpload))) && (!app->configuration.performConvolution) && (!app->configuration.inverseReturnToInputBuffer))) ) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.inputBufferSize) { for (uint64_t l = 0; l < app->configuration.inputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.inputBufferNum; } } } axis->inputBuffer = app->configuration.inputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.inputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.inputBufferOffset; } } else { if ((axis_upload_id == 0) && (app->configuration.numberKernels > 1) && (inverse) && (!app->configuration.performConvolution)) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.outputBufferSize) { for (uint64_t l = 0; l < app->configuration.outputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.outputBufferNum; } } } axis->inputBuffer = app->configuration.outputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.outputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.outputBufferOffset; } } else { uint64_t bufferId = 0; uint64_t offset = j; if (((axis->specializationConstants.reorderFourStep == 1) || (app->useBluesteinFFT[axis_id])) && (FFTPlan->numAxisUploads[axis_id] > 1)) { if ((((axis->specializationConstants.reorderFourStep == 1) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)) || (app->useBluesteinFFT[axis_id] && (axis->specializationConstants.reverseBluesteinMultiUpload == 0) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1))) && (!((axis_id == 0) && (axis->specializationConstants.performR2CmultiUpload) && (axis->specializationConstants.reorderFourStep == 1) && (inverse == 1)))) { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->inputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.bufferOffset; } } else { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.tempBufferSize) { for (uint64_t l = 0; l < app->configuration.tempBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.tempBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.tempBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.tempBufferNum; } } } axis->inputBuffer = app->configuration.tempBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.tempBuffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.tempBufferOffset; } } } else { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->inputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.bufferOffset; } } #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { descriptorBufferInfo.range = (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); } #endif } } //descriptorBufferInfo.offset = 0; } if (i == 1) { if (((axis_upload_id == 0) && (!app->useBluesteinFFT[axis_id]) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution)) || ((axis_id == app->firstAxis) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1))) )) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->useBluesteinFFT[axis_id]) && (axis->specializationConstants.reverseBluesteinMultiUpload || (FFTPlan->numAxisUploads[axis_id] == 1)) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution))) )) || ((app->configuration.numberKernels > 1) && ( (inverse) || (axis_id == app->lastAxis))) ) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.outputBufferSize) { for (uint64_t l = 0; l < app->configuration.outputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.outputBufferNum; } } } axis->outputBuffer = app->configuration.outputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.outputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.outputBufferOffset; } } else { uint64_t bufferId = 0; uint64_t offset = j; if (((axis->specializationConstants.reorderFourStep == 1) || (app->useBluesteinFFT[axis_id])) && (FFTPlan->numAxisUploads[axis_id] > 1)) { if ((inverse) && (axis_id == app->firstAxis) && ( ((axis_upload_id == 0) && (app->configuration.isInputFormatted) && (app->configuration.inverseReturnToInputBuffer) && (!app->useBluesteinFFT[axis_id])) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->configuration.isInputFormatted) && (axis->specializationConstants.actualInverse) && (app->configuration.inverseReturnToInputBuffer) && (app->useBluesteinFFT[axis_id]) && (axis->specializationConstants.reverseBluesteinMultiUpload || (FFTPlan->numAxisUploads[axis_id] == 1)))) ) { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.inputBufferSize) { for (uint64_t l = 0; l < app->configuration.inputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.inputBufferNum; } } } axis->outputBuffer = app->configuration.inputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.inputBuffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.inputBufferOffset; } } else { if (((axis->specializationConstants.reorderFourStep == 1) && (axis_upload_id > 0)) || (app->useBluesteinFFT[axis_id] && (!((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (axis->specializationConstants.reverseBluesteinMultiUpload == 1))))) { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.tempBufferSize) { for (uint64_t l = 0; l < app->configuration.tempBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.tempBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.tempBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.tempBufferNum; } } } axis->outputBuffer = app->configuration.tempBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.tempBuffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.tempBufferOffset; } } else { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->outputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.bufferOffset; } } } } else { if ((inverse) && (axis_id == app->firstAxis) && (axis_upload_id == 0) && (app->configuration.isInputFormatted) && (app->configuration.inverseReturnToInputBuffer)) { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.inputBufferSize) { for (uint64_t l = 0; l < app->configuration.inputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.inputBufferNum; } } } axis->outputBuffer = app->configuration.inputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.inputBuffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.inputBufferOffset; } } else { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->outputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.bufferOffset; } } } #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); } #endif } //descriptorBufferInfo.offset = 0; } if ((i == axis->specializationConstants.convolutionBindingID) && (app->configuration.performConvolution)) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.kernelSize) { for (uint64_t l = 0; l < app->configuration.kernelNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.kernelSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.kernelSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.kernelNum; } } } #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.kernel[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.kernelBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.kernelBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.kernelOffset = app->configuration.kernelOffset; } } if ((i == axis->specializationConstants.LUTBindingID) && (app->configuration.useLUT)) { #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { descriptorBufferInfo.buffer = axis->bufferLUT; descriptorBufferInfo.offset = 0; descriptorBufferInfo.range = axis->bufferLUTSize; } #endif } if ((i == axis->specializationConstants.BluesteinConvolutionBindingID) && (app->useBluesteinFFT[axis_id]) && (axis_upload_id == 0)) { #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { if (axis->specializationConstants.inverseBluestein) descriptorBufferInfo.buffer = app->bufferBluesteinIFFT[axis_id]; else descriptorBufferInfo.buffer = app->bufferBluesteinFFT[axis_id]; descriptorBufferInfo.offset = 0; descriptorBufferInfo.range = app->bufferBluesteinSize[axis_id]; } #endif } if ((i == axis->specializationConstants.BluesteinMultiplicationBindingID) && (app->useBluesteinFFT[axis_id]) && (axis_upload_id == (FFTPlan->numAxisUploads[axis_id] - 1))) { #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { descriptorBufferInfo.buffer = app->bufferBluestein[axis_id]; descriptorBufferInfo.offset = 0; descriptorBufferInfo.range = app->bufferBluesteinSize[axis_id]; } #endif } #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { VkWriteDescriptorSet writeDescriptorSet = { VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET }; writeDescriptorSet.dstSet = axis->descriptorSet; writeDescriptorSet.dstBinding = (uint32_t)i; writeDescriptorSet.dstArrayElement = (uint32_t)j; writeDescriptorSet.descriptorType = descriptorType; writeDescriptorSet.descriptorCount = 1; writeDescriptorSet.pBufferInfo = &descriptorBufferInfo; vkUpdateDescriptorSets(app->configuration.device[0], 1, &writeDescriptorSet, 0, 0); } #endif } } } if (axis->specializationConstants.performBufferSetUpdate) { axis->specializationConstants.performBufferSetUpdate = 0; } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.performOffsetUpdate = 0; } return VKFFT_SUCCESS; } static inline VkFFTResult VkFFTUpdateBufferSetR2CMultiUploadDecomposition(VkFFTApplication* app, VkFFTPlan* FFTPlan, VkFFTAxis* axis, uint64_t axis_id, uint64_t axis_upload_id, uint64_t inverse) { if (axis->specializationConstants.performOffsetUpdate || axis->specializationConstants.performBufferSetUpdate) { #if(VKFFT_BACKEND==0) const VkDescriptorType descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER; #endif uint64_t storageComplexSize; if (app->configuration.doublePrecision) storageComplexSize = (2 * sizeof(double)); else if (app->configuration.halfPrecision) storageComplexSize = (2 * 2); else storageComplexSize = (2 * sizeof(float)); for (uint64_t i = 0; i < axis->numBindings; ++i) { for (uint64_t j = 0; j < axis->specializationConstants.numBuffersBound[i]; ++j) { #if(VKFFT_BACKEND==0) VkDescriptorBufferInfo descriptorBufferInfo = { 0 }; #endif if (i == 0) { uint64_t bufferId = 0; uint64_t offset = j; if (inverse) { if ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->configuration.isInputFormatted) && (!axis->specializationConstants.reverseBluesteinMultiUpload) && ( ((axis_id == app->firstAxis) && (!inverse)) || ((axis_id == app->lastAxis) && (inverse) && (!app->configuration.performConvolution) && (!app->configuration.inverseReturnToInputBuffer))) ) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.inputBufferSize) { for (uint64_t l = 0; l < app->configuration.inputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.inputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.inputBufferNum; } } } axis->inputBuffer = app->configuration.inputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.inputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.inputBufferOffset; } } else { if ((axis_upload_id == 0) && (app->configuration.numberKernels > 1) && (inverse) && (!app->configuration.performConvolution)) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.outputBufferSize) { for (uint64_t l = 0; l < app->configuration.outputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.outputBufferNum; } } } axis->inputBuffer = app->configuration.outputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.outputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.outputBufferOffset; } } else { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->inputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.inputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.bufferOffset; } } } } else { if (((axis_upload_id == 0) && (!app->useBluesteinFFT[axis_id]) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution)) || ((axis_id == app->firstAxis) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1))) )) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->useBluesteinFFT[axis_id]) && (axis->specializationConstants.reverseBluesteinMultiUpload || (FFTPlan->numAxisUploads[axis_id] == 1)) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution))) )) || ((app->configuration.numberKernels > 1) && ( (inverse) || (axis_id == app->lastAxis))) ) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.outputBufferSize) { for (uint64_t l = 0; l < app->configuration.outputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.outputBufferNum; } } } axis->inputBuffer = app->configuration.outputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.outputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.outputBufferOffset; } } else { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->inputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.inputOffset = app->configuration.bufferOffset; } } } } if (i == 1) { if (inverse) { if ((axis_upload_id == 0) && (app->configuration.numberKernels > 1) && (inverse) && (!app->configuration.performConvolution)) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.outputBufferSize) { for (uint64_t l = 0; l < app->configuration.outputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.outputBufferNum; } } } axis->outputBuffer = app->configuration.outputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.outputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.outputBufferOffset; } } else { uint64_t bufferId = 0; uint64_t offset = j; if (axis->specializationConstants.reorderFourStep == 1) { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.tempBufferSize) { for (uint64_t l = 0; l < app->configuration.tempBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.tempBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.tempBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.tempBufferNum; } } } axis->outputBuffer = app->configuration.tempBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.tempBuffer[bufferId]; #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.tempBufferOffset; } } else { if (axis->specializationConstants.performBufferSetUpdate) { if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->outputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.bufferOffset; } } } } else { if (((axis_upload_id == 0) && (!app->useBluesteinFFT[axis_id]) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution)) || ((axis_id == app->firstAxis) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1))) )) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->useBluesteinFFT[axis_id]) && (axis->specializationConstants.reverseBluesteinMultiUpload || (FFTPlan->numAxisUploads[axis_id] == 1)) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution))) )) || ((app->configuration.numberKernels > 1) && ( (inverse) || (axis_id == app->lastAxis))) ) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.outputBufferSize) { for (uint64_t l = 0; l < app->configuration.outputBufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.outputBufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.outputBufferNum; } } } axis->outputBuffer = app->configuration.outputBuffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.outputBuffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.outputBufferOffset; } } else { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.bufferSize) { for (uint64_t l = 0; l < app->configuration.bufferNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.bufferSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.bufferNum; } } } axis->outputBuffer = app->configuration.buffer; #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.buffer[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.outputBufferBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.outputOffset = app->configuration.bufferOffset; } } } } if ((i == 2) && (app->configuration.performConvolution)) { if (axis->specializationConstants.performBufferSetUpdate) { uint64_t bufferId = 0; uint64_t offset = j; if (app->configuration.kernelSize) { for (uint64_t l = 0; l < app->configuration.kernelNum; ++l) { if (offset >= (uint64_t)ceil(app->configuration.kernelSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize))) { bufferId++; offset -= (uint64_t)ceil(app->configuration.kernelSize[l] / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); } else { l = app->configuration.kernelNum; } } } #if(VKFFT_BACKEND==0) descriptorBufferInfo.buffer = app->configuration.kernel[bufferId]; descriptorBufferInfo.range = (axis->specializationConstants.kernelBlockSize * storageComplexSize); descriptorBufferInfo.offset = offset * (axis->specializationConstants.kernelBlockSize * storageComplexSize); #endif } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.kernelOffset = app->configuration.kernelOffset; } } if ((i == axis->numBindings - 1) && (app->configuration.useLUT)) { #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { descriptorBufferInfo.buffer = axis->bufferLUT; descriptorBufferInfo.offset = 0; descriptorBufferInfo.range = axis->bufferLUTSize; } #endif } #if(VKFFT_BACKEND==0) if (axis->specializationConstants.performBufferSetUpdate) { VkWriteDescriptorSet writeDescriptorSet = { VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET }; writeDescriptorSet.dstSet = axis->descriptorSet; writeDescriptorSet.dstBinding = (uint32_t)i; writeDescriptorSet.dstArrayElement = (uint32_t)j; writeDescriptorSet.descriptorType = descriptorType; writeDescriptorSet.descriptorCount = 1; writeDescriptorSet.pBufferInfo = &descriptorBufferInfo; vkUpdateDescriptorSets(app->configuration.device[0], 1, &writeDescriptorSet, 0, 0); } #endif } } } if (axis->specializationConstants.performBufferSetUpdate) { axis->specializationConstants.performBufferSetUpdate = 0; } if (axis->specializationConstants.performOffsetUpdate) { axis->specializationConstants.performOffsetUpdate = 0; } return VKFFT_SUCCESS; } static inline VkFFTResult VkFFTPlanR2CMultiUploadDecomposition(VkFFTApplication* app, VkFFTPlan* FFTPlan, uint64_t inverse) { //get radix stages VkFFTResult resFFT = VKFFT_SUCCESS; #if(VKFFT_BACKEND==0) VkResult res = VK_SUCCESS; #elif(VKFFT_BACKEND==1) cudaError_t res = cudaSuccess; #elif(VKFFT_BACKEND==2) hipError_t res = hipSuccess; #elif(VKFFT_BACKEND==3) cl_int res = CL_SUCCESS; #endif VkFFTAxis* axis = &FFTPlan->R2Cdecomposition; axis->specializationConstants.warpSize = app->configuration.warpSize; axis->specializationConstants.numSharedBanks = app->configuration.numSharedBanks; axis->specializationConstants.useUint64 = app->configuration.useUint64; axis->specializationConstants.numAxisUploads = FFTPlan->numAxisUploads[0]; axis->specializationConstants.reorderFourStep = ((FFTPlan->numAxisUploads[0] > 1) && (!app->useBluesteinFFT[0])) ? app->configuration.reorderFourStep : 0; uint64_t complexSize; if (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) complexSize = (2 * sizeof(double)); else if (app->configuration.halfPrecision) complexSize = (2 * sizeof(float)); else complexSize = (2 * sizeof(float)); axis->specializationConstants.complexSize = complexSize; axis->specializationConstants.supportAxis = 0; axis->specializationConstants.symmetricKernel = app->configuration.symmetricKernel; axis->specializationConstants.conjugateConvolution = app->configuration.conjugateConvolution; axis->specializationConstants.crossPowerSpectrumNormalization = app->configuration.crossPowerSpectrumNormalization; axis->specializationConstants.fft_dim_full = app->configuration.size[0]; axis->specializationConstants.dispatchZactualFFTSize = 1; //allocate LUT if (app->configuration.useLUT) { double double_PI = 3.1415926535897932384626433832795; if (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) { axis->bufferLUTSize = (app->configuration.size[0] / 2) * 2 * sizeof(double); double* tempLUT = (double*)malloc(axis->bufferLUTSize); if (!tempLUT) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } for (uint64_t i = 0; i < app->configuration.size[0] / 2; i++) { double angle = double_PI * i / (app->configuration.size[0] / 2); tempLUT[2 * i] = (double)cos(angle); tempLUT[2 * i + 1] = (double)sin(angle); } axis->referenceLUT = 0; if ((!inverse) && (!app->configuration.makeForwardPlanOnly)) { axis->bufferLUT = app->localFFTPlan_inverse->R2Cdecomposition.bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = app->localFFTPlan_inverse->R2Cdecomposition.bufferLUTDeviceMemory; #endif axis->bufferLUTSize = app->localFFTPlan_inverse->R2Cdecomposition.bufferLUTSize; axis->referenceLUT = 1; } else { #if(VKFFT_BACKEND==0) resFFT = allocateFFTBuffer(app, &axis->bufferLUT, &axis->bufferLUTDeviceMemory, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } resFFT = transferDataFromCPU(app, tempLUT, &axis->bufferLUT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } #elif(VKFFT_BACKEND==1) res = cudaMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = cudaMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, cudaMemcpyHostToDevice); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==2) res = hipMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = hipMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, hipMemcpyHostToDevice); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==3) axis->bufferLUT = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, axis->bufferLUTSize, tempLUT, &res); if (res != CL_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #endif free(tempLUT); tempLUT = 0; } } else { axis->bufferLUTSize = (app->configuration.size[0] / 2) * 2 * sizeof(float); float* tempLUT = (float*)malloc(axis->bufferLUTSize); if (!tempLUT) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } for (uint64_t i = 0; i < app->configuration.size[0] / 2; i++) { double angle = double_PI * i / (app->configuration.size[0] / 2); tempLUT[2 * i] = (float)cos(angle); tempLUT[2 * i + 1] = (float)sin(angle); } axis->referenceLUT = 0; if ((!inverse) && (!app->configuration.makeForwardPlanOnly)) { axis->bufferLUT = app->localFFTPlan_inverse->R2Cdecomposition.bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = app->localFFTPlan_inverse->R2Cdecomposition.bufferLUTDeviceMemory; #endif axis->bufferLUTSize = app->localFFTPlan_inverse->R2Cdecomposition.bufferLUTSize; axis->referenceLUT = 1; } else { #if(VKFFT_BACKEND==0) resFFT = allocateFFTBuffer(app, &axis->bufferLUT, &axis->bufferLUTDeviceMemory, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } resFFT = transferDataFromCPU(app, tempLUT, &axis->bufferLUT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } #elif(VKFFT_BACKEND==1) res = cudaMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = cudaMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, cudaMemcpyHostToDevice); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==2) res = hipMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = hipMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, hipMemcpyHostToDevice); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==3) axis->bufferLUT = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, axis->bufferLUTSize, tempLUT, &res); if (res != CL_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #endif free(tempLUT); tempLUT = 0; } } } //configure strides uint64_t* axisStride = axis->specializationConstants.inputStride; uint64_t* usedStride = 0; if (app->useBluesteinFFT[0] && (FFTPlan->numAxisUploads[0] > 1)) { if (inverse) usedStride = FFTPlan->axes[0][FFTPlan->numAxisUploads[0] - 1].specializationConstants.inputStride; else usedStride = FFTPlan->inverseBluesteinAxes[0][FFTPlan->numAxisUploads[0] - 1].specializationConstants.outputStride; } else { if (inverse) usedStride = FFTPlan->axes[0][FFTPlan->numAxisUploads[0] - 1].specializationConstants.inputStride; else usedStride = FFTPlan->axes[0][0].specializationConstants.outputStride; } axisStride[0] = usedStride[0]; axisStride[1] = usedStride[1]; axisStride[2] = usedStride[2]; axisStride[3] = usedStride[3]; axisStride[4] = usedStride[4]; axisStride = axis->specializationConstants.outputStride; usedStride = axis->specializationConstants.inputStride; axisStride[0] = usedStride[0]; axisStride[1] = usedStride[1]; axisStride[2] = usedStride[2]; axisStride[3] = usedStride[3]; axisStride[4] = usedStride[4]; axis->specializationConstants.inverse = inverse; uint64_t storageComplexSize; if (app->configuration.doublePrecision) storageComplexSize = (2 * sizeof(double)); else if (app->configuration.halfPrecision) storageComplexSize = (2 * 2); else storageComplexSize = (2 * sizeof(float)); uint64_t initPageSize = -1; uint64_t locBufferNum = 1; uint64_t locBufferSize = 0; /*for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { initPageSize += app->configuration.bufferSize[i]; } if (app->configuration.performConvolution) { uint64_t initPageSizeKernel = 0; for (uint64_t i = 0; i < app->configuration.kernelNum; i++) { initPageSizeKernel += app->configuration.kernelSize[i]; } if (initPageSizeKernel > initPageSize) initPageSize = initPageSizeKernel; } if ((!((!app->configuration.reorderFourStep))) && (axis->specializationConstants.inputStride[1] * storageComplexSize > app->configuration.devicePageSize * 1024) && (app->configuration.devicePageSize > 0)) { initPageSize = app->configuration.localPageSize * 1024; }*/ uint64_t axis_id = 0; uint64_t axis_upload_id = 0; { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; if (inverse) { if ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->configuration.isInputFormatted) && (!axis->specializationConstants.reverseBluesteinMultiUpload) && ( ((axis_id == app->firstAxis) && (!inverse)) || ((axis_id == app->lastAxis) && (inverse) && (!app->configuration.performConvolution) && (!app->configuration.inverseReturnToInputBuffer))) ) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.inputBufferNum; if (app->configuration.inputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.inputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.inputBufferNum; i++) { totalSize += app->configuration.inputBufferSize[i]; if (app->configuration.inputBufferSize[i] < locPageSize) locPageSize = app->configuration.inputBufferSize[i]; } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.inputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.inputBufferBlockNum == 1) axis->specializationConstants.inputBufferBlockSize = totalSize / storageComplexSize; } else { if ((axis_upload_id == 0) && (app->configuration.numberKernels > 1) && (inverse) && (!app->configuration.performConvolution)) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.outputBufferNum; if (app->configuration.outputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.outputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { totalSize += app->configuration.outputBufferSize[i]; if (app->configuration.outputBufferSize[i] < locPageSize) locPageSize = app->configuration.outputBufferSize[i]; } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.inputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.inputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } else { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.inputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.inputBufferBlockNum == 1) axis->specializationConstants.inputBufferBlockSize = totalSize / storageComplexSize; } } } else { if (((axis_upload_id == 0) && (!app->useBluesteinFFT[axis_id]) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution)) || ((axis_id == app->firstAxis) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1))) )) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->useBluesteinFFT[axis_id]) && (axis->specializationConstants.reverseBluesteinMultiUpload || (FFTPlan->numAxisUploads[axis_id] == 1)) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution))) )) || ((app->configuration.numberKernels > 1) && ( (inverse) || (axis_id == app->lastAxis))) ) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.outputBufferNum; if (app->configuration.outputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.outputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { totalSize += app->configuration.outputBufferSize[i]; if (app->configuration.outputBufferSize[i] < locPageSize) locPageSize = app->configuration.outputBufferSize[i]; } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } else { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } } } initPageSize = -1; locBufferNum = 1; locBufferSize = -1; { if (inverse) { if ((axis_upload_id == 0) && (app->configuration.numberKernels > 1) && (inverse) && (!app->configuration.performConvolution)) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.outputBufferNum; if (app->configuration.outputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.outputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { totalSize += app->configuration.outputBufferSize[i]; if (app->configuration.outputBufferSize[i] < locPageSize) locPageSize = app->configuration.outputBufferSize[i]; } } axis->specializationConstants.outputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } else { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } axis->specializationConstants.outputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } } else { if (((axis_upload_id == 0) && (!app->useBluesteinFFT[axis_id]) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution)) || ((axis_id == app->firstAxis) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1))) )) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->useBluesteinFFT[axis_id]) && (axis->specializationConstants.reverseBluesteinMultiUpload || (FFTPlan->numAxisUploads[axis_id] == 1)) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution))) )) || ((app->configuration.numberKernels > 1) && ( (inverse) || (axis_id == app->lastAxis))) ) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.outputBufferNum; if (app->configuration.outputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.outputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { totalSize += app->configuration.outputBufferSize[i]; if (app->configuration.outputBufferSize[i] < locPageSize) locPageSize = app->configuration.outputBufferSize[i]; } } axis->specializationConstants.outputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } else { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } axis->specializationConstants.outputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } } } if (axis->specializationConstants.inputBufferBlockNum == 0) axis->specializationConstants.inputBufferBlockNum = 1; if (axis->specializationConstants.outputBufferBlockNum == 0) axis->specializationConstants.outputBufferBlockNum = 1; if (app->configuration.performConvolution) { //need fixing (not used now) uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; if (app->configuration.kernelSize) { for (uint64_t i = 0; i < app->configuration.kernelNum; i++) { totalSize += app->configuration.kernelSize[i]; if (app->configuration.kernelSize[i] < locPageSize) locPageSize = app->configuration.kernelSize[i]; } } axis->specializationConstants.kernelBlockSize = (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.kernelBlockNum = (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.kernelBlockSize * storageComplexSize)); //if (axis->specializationConstants.kernelBlockNum == 1) axis->specializationConstants.inputBufferBlockSize = totalSize / storageComplexSize; if (axis->specializationConstants.kernelBlockNum == 0) axis->specializationConstants.kernelBlockNum = 1; } else { axis->specializationConstants.kernelBlockSize = 0; axis->specializationConstants.kernelBlockNum = 0; } axis->numBindings = 2; axis->specializationConstants.numBuffersBound[0] = axis->specializationConstants.inputBufferBlockNum; axis->specializationConstants.numBuffersBound[1] = axis->specializationConstants.outputBufferBlockNum; axis->specializationConstants.numBuffersBound[2] = 0; axis->specializationConstants.numBuffersBound[3] = 0; #if(VKFFT_BACKEND==0) VkDescriptorPoolSize descriptorPoolSize = { VK_DESCRIPTOR_TYPE_STORAGE_BUFFER }; descriptorPoolSize.descriptorCount = (uint32_t)(axis->specializationConstants.numBuffersBound[0] + axis->specializationConstants.numBuffersBound[1]); #endif if ((axis_id == 0) && (axis_upload_id == 0) && (app->configuration.FFTdim == 1) && (app->configuration.performConvolution)) { axis->specializationConstants.numBuffersBound[axis->numBindings] = axis->specializationConstants.kernelBlockNum; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount += (uint32_t)axis->specializationConstants.kernelBlockNum; #endif axis->numBindings++; } if (app->configuration.useLUT) { axis->specializationConstants.numBuffersBound[axis->numBindings] = 1; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount++; #endif axis->numBindings++; } #if(VKFFT_BACKEND==0) VkDescriptorPoolCreateInfo descriptorPoolCreateInfo = { VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO }; descriptorPoolCreateInfo.poolSizeCount = 1; descriptorPoolCreateInfo.pPoolSizes = &descriptorPoolSize; descriptorPoolCreateInfo.maxSets = 1; res = vkCreateDescriptorPool(app->configuration.device[0], &descriptorPoolCreateInfo, 0, &axis->descriptorPool); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_POOL; } const VkDescriptorType descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER; VkDescriptorSetLayoutBinding* descriptorSetLayoutBindings; descriptorSetLayoutBindings = (VkDescriptorSetLayoutBinding*)malloc(axis->numBindings * sizeof(VkDescriptorSetLayoutBinding)); if (!descriptorSetLayoutBindings) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } for (uint64_t i = 0; i < axis->numBindings; ++i) { descriptorSetLayoutBindings[i].binding = (uint32_t)i; descriptorSetLayoutBindings[i].descriptorType = descriptorType; descriptorSetLayoutBindings[i].descriptorCount = (uint32_t)axis->specializationConstants.numBuffersBound[i]; descriptorSetLayoutBindings[i].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT; } VkDescriptorSetLayoutCreateInfo descriptorSetLayoutCreateInfo = { VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO }; descriptorSetLayoutCreateInfo.bindingCount = (uint32_t)axis->numBindings; descriptorSetLayoutCreateInfo.pBindings = descriptorSetLayoutBindings; res = vkCreateDescriptorSetLayout(app->configuration.device[0], &descriptorSetLayoutCreateInfo, 0, &axis->descriptorSetLayout); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_SET_LAYOUT; } free(descriptorSetLayoutBindings); descriptorSetLayoutBindings = 0; VkDescriptorSetAllocateInfo descriptorSetAllocateInfo = { VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO }; descriptorSetAllocateInfo.descriptorPool = axis->descriptorPool; descriptorSetAllocateInfo.descriptorSetCount = 1; descriptorSetAllocateInfo.pSetLayouts = &axis->descriptorSetLayout; res = vkAllocateDescriptorSets(app->configuration.device[0], &descriptorSetAllocateInfo, &axis->descriptorSet); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_ALLOCATE_DESCRIPTOR_SETS; } #endif if (app->configuration.specifyOffsetsAtLaunch) { axis->specializationConstants.performPostCompilationInputOffset = 1; axis->specializationConstants.performPostCompilationOutputOffset = 1; if (app->configuration.performConvolution) axis->specializationConstants.performPostCompilationKernelOffset = 1; } resFFT = VkFFTCheckUpdateBufferSet(app, axis, 1, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } resFFT = VkFFTUpdateBufferSetR2CMultiUploadDecomposition(app, FFTPlan, axis, axis_id, axis_upload_id, inverse); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } { axis->axisBlock[0] = 128; if (axis->axisBlock[0] > app->configuration.maxThreadsNum) axis->axisBlock[0] = app->configuration.maxThreadsNum; axis->axisBlock[1] = 1; axis->axisBlock[2] = 1; uint64_t tempSize[3] = { (uint64_t)ceil((app->configuration.size[0] * app->configuration.size[1] * app->configuration.size[2]) / (double)(2 * axis->axisBlock[0])), 1, 1 }; tempSize[2] *= app->configuration.numberKernels * app->configuration.numberBatches * app->configuration.coordinateFeatures; if (tempSize[0] > app->configuration.maxComputeWorkGroupCount[0]) axis->specializationConstants.performWorkGroupShift[0] = 1; else axis->specializationConstants.performWorkGroupShift[0] = 0; if (tempSize[1] > app->configuration.maxComputeWorkGroupCount[1]) axis->specializationConstants.performWorkGroupShift[1] = 1; else axis->specializationConstants.performWorkGroupShift[1] = 0; if (tempSize[2] > app->configuration.maxComputeWorkGroupCount[2]) axis->specializationConstants.performWorkGroupShift[2] = 1; else axis->specializationConstants.performWorkGroupShift[2] = 0; axis->specializationConstants.localSize[0] = axis->axisBlock[0]; axis->specializationConstants.localSize[1] = axis->axisBlock[1]; axis->specializationConstants.localSize[2] = axis->axisBlock[2]; axis->specializationConstants.numCoordinates = (app->configuration.matrixConvolution > 1) ? 1 : app->configuration.coordinateFeatures; axis->specializationConstants.matrixConvolution = app->configuration.matrixConvolution; axis->specializationConstants.size[0] = app->configuration.size[0]; axis->specializationConstants.size[1] = app->configuration.size[1]; axis->specializationConstants.size[2] = app->configuration.size[2]; axis->specializationConstants.numBatches = app->configuration.numberBatches; if ((app->configuration.FFTdim == 1) && (app->configuration.size[1] == 1) && ((app->configuration.numberBatches == 1) && (app->actualNumBatches > 1)) && (!app->configuration.performConvolution) && (app->configuration.coordinateFeatures == 1)) { axis->specializationConstants.numBatches = app->actualNumBatches; } axis->specializationConstants.numKernels = app->configuration.numberKernels; axis->specializationConstants.sharedMemSize = app->configuration.sharedMemorySize; axis->specializationConstants.sharedMemSizePow2 = app->configuration.sharedMemorySizePow2; axis->specializationConstants.normalize = app->configuration.normalize; axis->specializationConstants.axis_id = 0; axis->specializationConstants.axis_upload_id = 0; for (uint64_t i = 0; i < 3; i++) { axis->specializationConstants.frequencyZeropadding = app->configuration.frequencyZeroPadding; axis->specializationConstants.performZeropaddingFull[i] = app->configuration.performZeropadding[i]; // don't read if input is zeropadded (0 - off, 1 - on) axis->specializationConstants.fft_zeropad_left_full[i] = app->configuration.fft_zeropad_left[i]; axis->specializationConstants.fft_zeropad_right_full[i] = app->configuration.fft_zeropad_right[i]; } if ((inverse)) { if ((app->configuration.frequencyZeroPadding) && (((!app->configuration.reorderFourStep) && (axis_upload_id == 0)) || ((app->configuration.reorderFourStep) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)))) { axis->specializationConstants.zeropad[0] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_read[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_read[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[0] = 0; if ((!app->configuration.frequencyZeroPadding) && (((!app->configuration.reorderFourStep) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)) || ((app->configuration.reorderFourStep) && (axis_upload_id == 0)))) { axis->specializationConstants.zeropad[1] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_write[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_write[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[1] = 0; } else { if ((!app->configuration.frequencyZeroPadding) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)) { axis->specializationConstants.zeropad[0] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_read[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_read[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[0] = 0; if (((app->configuration.frequencyZeroPadding) && (axis_upload_id == 0)) || (((app->configuration.FFTdim - 1 == axis_id) && (axis_upload_id == 0) && (app->configuration.performConvolution)))) { axis->specializationConstants.zeropad[1] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_write[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_write[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[1] = 0; } if ((app->configuration.FFTdim - 1 == axis_id) && (axis_upload_id == 0) && (app->configuration.performConvolution)) { axis->specializationConstants.convolutionStep = 1; } else axis->specializationConstants.convolutionStep = 0; char floatTypeInputMemory[10]; char floatTypeOutputMemory[10]; char floatTypeKernelMemory[10]; char floatType[10]; axis->specializationConstants.unroll = 1; axis->specializationConstants.LUT = app->configuration.useLUT; if (app->configuration.doublePrecision) { sprintf(floatType, "double"); sprintf(floatTypeInputMemory, "double"); sprintf(floatTypeOutputMemory, "double"); sprintf(floatTypeKernelMemory, "double"); //axis->specializationConstants.unroll = 1; } else { //axis->specializationConstants.unroll = 0; if (app->configuration.halfPrecision) { sprintf(floatType, "float"); if (app->configuration.halfPrecisionMemoryOnly) { //only out of place mode, input/output buffer must be different sprintf(floatTypeInputMemory, "float"); sprintf(floatTypeOutputMemory, "float"); sprintf(floatTypeKernelMemory, "float"); } else { sprintf(floatTypeInputMemory, "half"); sprintf(floatTypeOutputMemory, "half"); sprintf(floatTypeKernelMemory, "half"); } } else { if (app->configuration.doublePrecisionFloatMemory) { sprintf(floatType, "double"); sprintf(floatTypeInputMemory, "float"); sprintf(floatTypeOutputMemory, "float"); sprintf(floatTypeKernelMemory, "float"); } else { sprintf(floatType, "float"); sprintf(floatTypeInputMemory, "float"); sprintf(floatTypeOutputMemory, "float"); sprintf(floatTypeKernelMemory, "float"); } } } char uintType[20] = ""; if (!app->configuration.useUint64) { #if(VKFFT_BACKEND==0) sprintf(uintType, "uint"); #elif(VKFFT_BACKEND==1) sprintf(uintType, "unsigned int"); #elif(VKFFT_BACKEND==2) sprintf(uintType, "unsigned int"); #elif(VKFFT_BACKEND==3) sprintf(uintType, "unsigned int"); #endif } else { #if(VKFFT_BACKEND==0) sprintf(uintType, "uint64_t"); #elif(VKFFT_BACKEND==1) sprintf(uintType, "unsigned long long"); #elif(VKFFT_BACKEND==2) sprintf(uintType, "unsigned long long"); #elif(VKFFT_BACKEND==3) sprintf(uintType, "unsigned long"); #endif } { axis->pushConstants.structSize = 0; if (axis->specializationConstants.performWorkGroupShift[0]) { axis->pushConstants.performWorkGroupShift[0] = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performWorkGroupShift[1]) { axis->pushConstants.performWorkGroupShift[1] = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performWorkGroupShift[2]) { axis->pushConstants.performWorkGroupShift[2] = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performPostCompilationInputOffset) { axis->pushConstants.performPostCompilationInputOffset = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performPostCompilationOutputOffset) { axis->pushConstants.performPostCompilationOutputOffset = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performPostCompilationKernelOffset) { axis->pushConstants.performPostCompilationKernelOffset = 1; axis->pushConstants.structSize += 1; } if (app->configuration.useUint64) axis->pushConstants.structSize *= sizeof(uint64_t); else axis->pushConstants.structSize *= sizeof(uint32_t); axis->specializationConstants.pushConstantsStructSize = axis->pushConstants.structSize; } //uint64_t LUT = app->configuration.useLUT; uint64_t type = 0; axis->specializationConstants.maxCodeLength = app->configuration.maxCodeLength; axis->specializationConstants.maxTempLength = app->configuration.maxTempLength; axis->specializationConstants.code0 = (char*)malloc(sizeof(char) * app->configuration.maxCodeLength); char* code0 = axis->specializationConstants.code0; if (!code0) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } resFFT = shaderGenVkFFT_R2C_decomposition(code0, &axis->specializationConstants, floatType, floatTypeInputMemory, floatTypeOutputMemory, floatTypeKernelMemory, uintType, type); freeShaderGenVkFFT(&axis->specializationConstants); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } #if(VKFFT_BACKEND==0) uint32_t* code; size_t codeSize; if (app->configuration.loadApplicationFromString) { uint32_t* localStrPointer = (uint32_t*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = localStrPointer[0]; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize); app->currentApplicationStringPos += codeSize / (sizeof(uint32_t)) + 1; } else { const glslang_resource_t default_resource = { /* .MaxLights = */ 32, /* .MaxClipPlanes = */ 6, /* .MaxTextureUnits = */ 32, /* .MaxTextureCoords = */ 32, /* .MaxVertexAttribs = */ 64, /* .MaxVertexUniformComponents = */ 4096, /* .MaxVaryingFloats = */ 64, /* .MaxVertexTextureImageUnits = */ 32, /* .MaxCombinedTextureImageUnits = */ 80, /* .MaxTextureImageUnits = */ 32, /* .MaxFragmentUniformComponents = */ 4096, /* .MaxDrawBuffers = */ 32, /* .MaxVertexUniformVectors = */ 128, /* .MaxVaryingVectors = */ 8, /* .MaxFragmentUniformVectors = */ 16, /* .MaxVertexOutputVectors = */ 16, /* .MaxFragmentInputVectors = */ 15, /* .MinProgramTexelOffset = */ -8, /* .MaxProgramTexelOffset = */ 7, /* .MaxClipDistances = */ 8, /* .MaxComputeWorkGroupCountX = */ 65535, /* .MaxComputeWorkGroupCountY = */ 65535, /* .MaxComputeWorkGroupCountZ = */ 65535, /* .MaxComputeWorkGroupSizeX = */ 1024, /* .MaxComputeWorkGroupSizeY = */ 1024, /* .MaxComputeWorkGroupSizeZ = */ 64, /* .MaxComputeUniformComponents = */ 1024, /* .MaxComputeTextureImageUnits = */ 16, /* .MaxComputeImageUniforms = */ 8, /* .MaxComputeAtomicCounters = */ 8, /* .MaxComputeAtomicCounterBuffers = */ 1, /* .MaxVaryingComponents = */ 60, /* .MaxVertexOutputComponents = */ 64, /* .MaxGeometryInputComponents = */ 64, /* .MaxGeometryOutputComponents = */ 128, /* .MaxFragmentInputComponents = */ 128, /* .MaxImageUnits = */ 8, /* .MaxCombinedImageUnitsAndFragmentOutputs = */ 8, /* .MaxCombinedShaderOutputResources = */ 8, /* .MaxImageSamples = */ 0, /* .MaxVertexImageUniforms = */ 0, /* .MaxTessControlImageUniforms = */ 0, /* .MaxTessEvaluationImageUniforms = */ 0, /* .MaxGeometryImageUniforms = */ 0, /* .MaxFragmentImageUniforms = */ 8, /* .MaxCombinedImageUniforms = */ 8, /* .MaxGeometryTextureImageUnits = */ 16, /* .MaxGeometryOutputVertices = */ 256, /* .MaxGeometryTotalOutputComponents = */ 1024, /* .MaxGeometryUniformComponents = */ 1024, /* .MaxGeometryVaryingComponents = */ 64, /* .MaxTessControlInputComponents = */ 128, /* .MaxTessControlOutputComponents = */ 128, /* .MaxTessControlTextureImageUnits = */ 16, /* .MaxTessControlUniformComponents = */ 1024, /* .MaxTessControlTotalOutputComponents = */ 4096, /* .MaxTessEvaluationInputComponents = */ 128, /* .MaxTessEvaluationOutputComponents = */ 128, /* .MaxTessEvaluationTextureImageUnits = */ 16, /* .MaxTessEvaluationUniformComponents = */ 1024, /* .MaxTessPatchComponents = */ 120, /* .MaxPatchVertices = */ 32, /* .MaxTessGenLevel = */ 64, /* .MaxViewports = */ 16, /* .MaxVertexAtomicCounters = */ 0, /* .MaxTessControlAtomicCounters = */ 0, /* .MaxTessEvaluationAtomicCounters = */ 0, /* .MaxGeometryAtomicCounters = */ 0, /* .MaxFragmentAtomicCounters = */ 8, /* .MaxCombinedAtomicCounters = */ 8, /* .MaxAtomicCounterBindings = */ 1, /* .MaxVertexAtomicCounterBuffers = */ 0, /* .MaxTessControlAtomicCounterBuffers = */ 0, /* .MaxTessEvaluationAtomicCounterBuffers = */ 0, /* .MaxGeometryAtomicCounterBuffers = */ 0, /* .MaxFragmentAtomicCounterBuffers = */ 1, /* .MaxCombinedAtomicCounterBuffers = */ 1, /* .MaxAtomicCounterBufferSize = */ 16384, /* .MaxTransformFeedbackBuffers = */ 4, /* .MaxTransformFeedbackInterleavedComponents = */ 64, /* .MaxCullDistances = */ 8, /* .MaxCombinedClipAndCullDistances = */ 8, /* .MaxSamples = */ 4, /* .maxMeshOutputVerticesNV = */ 256, /* .maxMeshOutputPrimitivesNV = */ 512, /* .maxMeshWorkGroupSizeX_NV = */ 32, /* .maxMeshWorkGroupSizeY_NV = */ 1, /* .maxMeshWorkGroupSizeZ_NV = */ 1, /* .maxTaskWorkGroupSizeX_NV = */ 32, /* .maxTaskWorkGroupSizeY_NV = */ 1, /* .maxTaskWorkGroupSizeZ_NV = */ 1, /* .maxMeshViewCountNV = */ 4, /* .maxDualSourceDrawBuffersEXT = */ 1, /* .limits = */ { /* .nonInductiveForLoops = */ 1, /* .whileLoops = */ 1, /* .doWhileLoops = */ 1, /* .generalUniformIndexing = */ 1, /* .generalAttributeMatrixVectorIndexing = */ 1, /* .generalVaryingIndexing = */ 1, /* .generalSamplerIndexing = */ 1, /* .generalVariableIndexing = */ 1, /* .generalConstantMatrixVectorIndexing = */ 1, } }; glslang_target_client_version_t client_version = (app->configuration.halfPrecision) ? GLSLANG_TARGET_VULKAN_1_1 : GLSLANG_TARGET_VULKAN_1_0; glslang_target_language_version_t target_language_version = (app->configuration.halfPrecision) ? GLSLANG_TARGET_SPV_1_3 : GLSLANG_TARGET_SPV_1_0; const glslang_input_t input = { GLSLANG_SOURCE_GLSL, GLSLANG_STAGE_COMPUTE, GLSLANG_CLIENT_VULKAN, client_version, GLSLANG_TARGET_SPV, target_language_version, code0, 450, GLSLANG_NO_PROFILE, 1, 0, GLSLANG_MSG_DEFAULT_BIT, &default_resource, }; //printf("%s\n", code0); glslang_shader_t* shader = glslang_shader_create(&input); const char* err; if (!glslang_shader_preprocess(shader, &input)) { err = glslang_shader_get_info_log(shader); printf("%s\n", code0); printf("%s\nVkFFT shader type: %" PRIu64 "\n", err, type); glslang_shader_delete(shader); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SHADER_PREPROCESS; } if (!glslang_shader_parse(shader, &input)) { err = glslang_shader_get_info_log(shader); printf("%s\n", code0); printf("%s\nVkFFT shader type: %" PRIu64 "\n", err, type); glslang_shader_delete(shader); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SHADER_PARSE; } glslang_program_t* program = glslang_program_create(); glslang_program_add_shader(program, shader); if (!glslang_program_link(program, GLSLANG_MSG_SPV_RULES_BIT | GLSLANG_MSG_VULKAN_RULES_BIT)) { err = glslang_program_get_info_log(program); printf("%s\n", code0); printf("%s\nVkFFT shader type: %" PRIu64 "\n", err, type); glslang_shader_delete(shader); glslang_program_delete(program); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SHADER_LINK; } glslang_program_SPIRV_generate(program, input.stage); if (glslang_program_SPIRV_get_messages(program)) { printf("%s", glslang_program_SPIRV_get_messages(program)); glslang_shader_delete(shader); glslang_program_delete(program); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SPIRV_GENERATE; } glslang_shader_delete(shader); uint32_t* tempCode = glslang_program_SPIRV_get_ptr(program); codeSize = glslang_program_SPIRV_get_size(program) * sizeof(uint32_t); axis->binarySize = codeSize; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; glslang_program_delete(program); deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } axis->binary = code; memcpy(code, tempCode, codeSize); glslang_program_delete(program); } VkPipelineShaderStageCreateInfo pipelineShaderStageCreateInfo = { VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO }; VkComputePipelineCreateInfo computePipelineCreateInfo = { VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO }; pipelineShaderStageCreateInfo.stage = VK_SHADER_STAGE_COMPUTE_BIT; VkShaderModuleCreateInfo createInfo = { VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO }; createInfo.pCode = code; createInfo.codeSize = codeSize; res = vkCreateShaderModule(app->configuration.device[0], &createInfo, 0, &pipelineShaderStageCreateInfo.module); if (res != VK_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_SHADER_MODULE; } VkPipelineLayoutCreateInfo pipelineLayoutCreateInfo = { VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO }; pipelineLayoutCreateInfo.setLayoutCount = 1; pipelineLayoutCreateInfo.pSetLayouts = &axis->descriptorSetLayout; VkPushConstantRange pushConstantRange = { VK_SHADER_STAGE_COMPUTE_BIT }; pushConstantRange.offset = 0; pushConstantRange.size = (uint32_t)axis->pushConstants.structSize; // Push constant ranges are part of the pipeline layout if (axis->pushConstants.structSize) { pipelineLayoutCreateInfo.pushConstantRangeCount = 1; pipelineLayoutCreateInfo.pPushConstantRanges = &pushConstantRange; } res = vkCreatePipelineLayout(app->configuration.device[0], &pipelineLayoutCreateInfo, 0, &axis->pipelineLayout); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE_LAYOUT; } pipelineShaderStageCreateInfo.pName = "main"; pipelineShaderStageCreateInfo.pSpecializationInfo = 0;// &specializationInfo; computePipelineCreateInfo.stage = pipelineShaderStageCreateInfo; computePipelineCreateInfo.layout = axis->pipelineLayout; res = vkCreateComputePipelines(app->configuration.device[0], VK_NULL_HANDLE, 1, &computePipelineCreateInfo, 0, &axis->pipeline); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE; } vkDestroyShaderModule(app->configuration.device[0], pipelineShaderStageCreateInfo.module, 0); if (!app->configuration.saveApplicationToString) { free(code); code = 0; } #elif(VKFFT_BACKEND==1) char* code; size_t codeSize; if (app->configuration.loadApplicationFromString) { char* localStrPointer = (char*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = strtol(localStrPointer, &localStrPointer, 10); code = (char*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize - 1); code[codeSize - 1] = '\0'; app->currentApplicationStringPos += codeSize + (uint64_t)(floor(log10((double)codeSize))) + 1; } else { nvrtcProgram prog; nvrtcResult result = nvrtcCreateProgram(&prog, // prog code0, // buffer "VkFFT.cu", // name 0, // numHeaders 0, // headers 0); // includeNames //free(includeNames); //free(headers); if (result != NVRTC_SUCCESS) { printf("nvrtcCreateProgram error: %s\n", nvrtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } //const char opts[20] = "--fmad=false"; //result = nvrtcAddNameExpression(prog, "&consts"); //if (result != NVRTC_SUCCESS) printf("1.5 error: %s\n", nvrtcGetErrorString(result)); result = nvrtcCompileProgram(prog, // prog 0, // numOptions 0); // options if (result != NVRTC_SUCCESS) { printf("nvrtcCompileProgram error: %s\n", nvrtcGetErrorString(result)); char* log = (char*)malloc(sizeof(char) * 4000000); if (!log) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } else { nvrtcGetProgramLog(prog, log); printf("%s\n", log); free(log); log = 0; printf("%s\n", code0); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } result = nvrtcGetPTXSize(prog, &codeSize); if (result != NVRTC_SUCCESS) { printf("nvrtcGetPTXSize error: %s\n", nvrtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE_SIZE; } axis->binarySize = codeSize; code = (char*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } axis->binary = code; result = nvrtcGetPTX(prog, code); if (result != NVRTC_SUCCESS) { printf("nvrtcGetPTX error: %s\n", nvrtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE; } result = nvrtcDestroyProgram(&prog); if (result != NVRTC_SUCCESS) { printf("nvrtcDestroyProgram error: %s\n", nvrtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_DESTROY_PROGRAM; } } CUresult result2 = cuModuleLoadDataEx(&axis->VkFFTModule, code, 0, 0, 0); if (result2 != CUDA_SUCCESS) { printf("cuModuleLoadDataEx error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_LOAD_MODULE; } result2 = cuModuleGetFunction(&axis->VkFFTKernel, axis->VkFFTModule, "VkFFT_main_R2C"); if (result2 != CUDA_SUCCESS) { printf("cuModuleGetFunction error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_FUNCTION; } if (axis->specializationConstants.usedSharedMemory > app->configuration.sharedMemorySizeStatic) { result2 = cuFuncSetAttribute(axis->VkFFTKernel, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, (int)axis->specializationConstants.usedSharedMemory); if (result2 != CUDA_SUCCESS) { printf("cuFuncSetAttribute error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_SET_DYNAMIC_SHARED_MEMORY; } } if (axis->pushConstants.structSize) { size_t size = axis->pushConstants.structSize; result2 = cuModuleGetGlobal(&axis->consts_addr, &size, axis->VkFFTModule, "consts"); if (result2 != CUDA_SUCCESS) { printf("cuModuleGetGlobal error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_MODULE_GET_GLOBAL; } } if (!app->configuration.saveApplicationToString) { free(code); code = 0; } #elif(VKFFT_BACKEND==2) uint32_t* code; size_t codeSize; if (app->configuration.loadApplicationFromString) { uint32_t* localStrPointer = (uint32_t*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = localStrPointer[0]; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize); app->currentApplicationStringPos += codeSize / (sizeof(uint32_t)) + 1; } else { hiprtcProgram prog; enum hiprtcResult result = hiprtcCreateProgram(&prog, // prog code0, // buffer "VkFFT.hip", // name 0, // numHeaders 0, // headers 0); // includeNames if (result != HIPRTC_SUCCESS) { printf("hiprtcCreateProgram error: %s\n", hiprtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } if (axis->pushConstants.structSize) { result = hiprtcAddNameExpression(prog, "&consts"); if (result != HIPRTC_SUCCESS) { printf("hiprtcAddNameExpression error: %s\n", hiprtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_ADD_NAME_EXPRESSION; } } result = hiprtcCompileProgram(prog, // prog 0, // numOptions 0); // options if (result != HIPRTC_SUCCESS) { printf("hiprtcCompileProgram error: %s\n", hiprtcGetErrorString(result)); char* log = (char*)malloc(sizeof(char) * 100000); if (!log) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } else { hiprtcGetProgramLog(prog, log); printf("%s\n", log); free(log); log = 0; printf("%s\n", code0); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } result = hiprtcGetCodeSize(prog, &codeSize); if (result != HIPRTC_SUCCESS) { printf("hiprtcGetCodeSize error: %s\n", hiprtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE; } axis->binarySize = codeSize; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } axis->binary = code; result = hiprtcGetCode(prog, (char*)code); if (result != HIPRTC_SUCCESS) { printf("hiprtcGetCode error: %s\n", hiprtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE_SIZE; } //printf("%s\n", code); // Destroy the program. result = hiprtcDestroyProgram(&prog); if (result != HIPRTC_SUCCESS) { printf("hiprtcDestroyProgram error: %s\n", hiprtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_DESTROY_PROGRAM; } } hipError_t result2 = hipModuleLoadDataEx(&axis->VkFFTModule, code, 0, 0, 0); if (result2 != hipSuccess) { printf("hipModuleLoadDataEx error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_LOAD_MODULE; } result2 = hipModuleGetFunction(&axis->VkFFTKernel, axis->VkFFTModule, "VkFFT_main_R2C"); if (result2 != hipSuccess) { printf("hipModuleGetFunction error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_FUNCTION; } if (axis->specializationConstants.usedSharedMemory > app->configuration.sharedMemorySizeStatic) { result2 = hipFuncSetAttribute(axis->VkFFTKernel, hipFuncAttributeMaxDynamicSharedMemorySize, (int)axis->specializationConstants.usedSharedMemory); //result2 = hipFuncSetCacheConfig(axis->VkFFTKernel, hipFuncCachePreferShared); if (result2 != hipSuccess) { printf("hipFuncSetAttribute error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_SET_DYNAMIC_SHARED_MEMORY; } } if (axis->pushConstants.structSize) { size_t size = axis->pushConstants.structSize; result2 = hipModuleGetGlobal(&axis->consts_addr, &size, axis->VkFFTModule, "consts"); if (result2 != hipSuccess) { printf("hipModuleGetGlobal error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_MODULE_GET_GLOBAL; } } if (!app->configuration.saveApplicationToString) { free(code); code = 0; } #elif(VKFFT_BACKEND==3) if (app->configuration.loadApplicationFromString) { char* code; size_t codeSize; char* localStrPointer = (char*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = strtol(localStrPointer, &localStrPointer, 10); code = (char*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize - 2); code[codeSize - 2] = '\0'; app->currentApplicationStringPos += codeSize + (uint64_t)(floor(log10((double)codeSize))); axis->program = clCreateProgramWithBinary(app->configuration.context[0], 1, app->configuration.device, &codeSize, (const unsigned char**)(&code), 0, &res); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } } else { size_t codelen = strlen(code0); axis->program = clCreateProgramWithSource(app->configuration.context[0], 1, (const char**)&code0, &codelen, &res); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } } res = clBuildProgram(axis->program, 1, app->configuration.device, 0, 0, 0); if (res != CL_SUCCESS) { size_t log_size; clGetProgramBuildInfo(axis->program, app->configuration.device[0], CL_PROGRAM_BUILD_LOG, 0, 0, &log_size); char* log = (char*)malloc(log_size); if (!log) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } else { clGetProgramBuildInfo(axis->program, app->configuration.device[0], CL_PROGRAM_BUILD_LOG, log_size, log, 0); printf("%s\n", log); free(log); log = 0; printf("%s\n", code0); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } if (app->configuration.saveApplicationToString) { size_t codeSize; res = clGetProgramInfo(axis->program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &codeSize, NULL); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } axis->binarySize = codeSize; axis->binary = (char*)malloc(axis->binarySize); if (!axis->binary) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } res = clGetProgramInfo(axis->program, CL_PROGRAM_BINARIES, axis->binarySize, &axis->binary, NULL); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } axis->kernel = clCreateKernel(axis->program, "VkFFT_main_R2C", &res); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_SHADER_MODULE; } #endif if (!app->configuration.keepShaderCode) { free(code0); code0 = 0; axis->specializationConstants.code0 = 0; } } return resFFT; } static inline VkFFTResult VkFFTPlanAxis(VkFFTApplication* app, VkFFTPlan* FFTPlan, uint64_t axis_id, uint64_t axis_upload_id, uint64_t inverse, uint64_t reverseBluesteinMultiUpload) { //get radix stages VkFFTResult resFFT = VKFFT_SUCCESS; #if(VKFFT_BACKEND==0) VkResult res = VK_SUCCESS; #elif(VKFFT_BACKEND==1) cudaError_t res = cudaSuccess; #elif(VKFFT_BACKEND==2) hipError_t res = hipSuccess; #elif(VKFFT_BACKEND==3) cl_int res = CL_SUCCESS; #endif VkFFTAxis* axis = (reverseBluesteinMultiUpload) ? &FFTPlan->inverseBluesteinAxes[axis_id][axis_upload_id] : &FFTPlan->axes[axis_id][axis_upload_id]; axis->specializationConstants.sourceFFTSize = app->configuration.size[axis_id]; axis->specializationConstants.numBatches = app->configuration.numberBatches; if ((app->configuration.FFTdim == 1) && (FFTPlan->actualFFTSizePerAxis[axis_id][1] == 1) && ((app->configuration.numberBatches > 1) || (app->actualNumBatches > 1)) && (!app->configuration.performConvolution) && (app->configuration.coordinateFeatures == 1)) { if (app->configuration.numberBatches > 1) { app->actualNumBatches = app->configuration.numberBatches; app->configuration.numberBatches = 1; } FFTPlan->actualFFTSizePerAxis[axis_id][1] = app->actualNumBatches; } axis->specializationConstants.warpSize = app->configuration.warpSize; axis->specializationConstants.numSharedBanks = app->configuration.numSharedBanks; axis->specializationConstants.useUint64 = app->configuration.useUint64; axis->specializationConstants.numAxisUploads = FFTPlan->numAxisUploads[axis_id]; uint64_t complexSize; if (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) complexSize = (2 * sizeof(double)); else if (app->configuration.halfPrecision) complexSize = (2 * sizeof(float)); else complexSize = (2 * sizeof(float)); axis->specializationConstants.complexSize = complexSize; axis->specializationConstants.supportAxis = 0; axis->specializationConstants.symmetricKernel = app->configuration.symmetricKernel; axis->specializationConstants.conjugateConvolution = app->configuration.conjugateConvolution; axis->specializationConstants.crossPowerSpectrumNormalization = app->configuration.crossPowerSpectrumNormalization; uint64_t maxSequenceLengthSharedMemory = app->configuration.sharedMemorySize / complexSize; uint64_t maxSequenceLengthSharedMemoryPow2 = app->configuration.sharedMemorySizePow2 / complexSize; uint64_t maxSingleSizeStrided = (app->configuration.coalescedMemory > complexSize) ? app->configuration.sharedMemorySize / (app->configuration.coalescedMemory) : app->configuration.sharedMemorySize / complexSize; uint64_t maxSingleSizeStridedPow2 = (app->configuration.coalescedMemory > complexSize) ? app->configuration.sharedMemorySizePow2 / (app->configuration.coalescedMemory) : app->configuration.sharedMemorySizePow2 / complexSize; axis->specializationConstants.stageStartSize = 1; for (uint64_t i = 0; i < axis_upload_id; i++) axis->specializationConstants.stageStartSize *= FFTPlan->axisSplit[axis_id][i]; axis->specializationConstants.firstStageStartSize = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id] / FFTPlan->axisSplit[axis_id][FFTPlan->numAxisUploads[axis_id] - 1]; axis->specializationConstants.dispatchZactualFFTSize = (axis_id < 2) ? FFTPlan->actualFFTSizePerAxis[axis_id][2] : FFTPlan->actualFFTSizePerAxis[axis_id][1]; if (axis_id == 0) { //configure radix stages axis->specializationConstants.fft_dim_x = axis->specializationConstants.stageStartSize; } else { axis->specializationConstants.fft_dim_x = FFTPlan->actualFFTSizePerAxis[axis_id][0]; } if (app->useBluesteinFFT[axis_id]) { axis->specializationConstants.useBluesteinFFT = 1; } if (app->configuration.performDCT == 3) { axis->specializationConstants.actualInverse = inverse; axis->specializationConstants.inverse = !inverse; } else { if (app->configuration.performDCT == 4) { axis->specializationConstants.actualInverse = inverse; axis->specializationConstants.inverse = 1; } else { axis->specializationConstants.actualInverse = inverse; axis->specializationConstants.inverse = inverse; } } if (app->useBluesteinFFT[axis_id]) { axis->specializationConstants.actualInverse = inverse; axis->specializationConstants.inverse = reverseBluesteinMultiUpload; if (app->configuration.performDCT == 3) { axis->specializationConstants.inverseBluestein = !inverse; } else { if (app->configuration.performDCT == 4) { axis->specializationConstants.inverseBluestein = 1; } else { axis->specializationConstants.inverseBluestein = inverse; } } } axis->specializationConstants.reverseBluesteinMultiUpload = reverseBluesteinMultiUpload; axis->specializationConstants.reorderFourStep = ((FFTPlan->numAxisUploads[axis_id] > 1) && (!app->useBluesteinFFT[axis_id])) ? app->configuration.reorderFourStep : 0; if ((axis_id == 0) && ((FFTPlan->numAxisUploads[axis_id] == 1) || ((axis_upload_id == 0) && (!axis->specializationConstants.reorderFourStep)))) { maxSequenceLengthSharedMemory *= axis->specializationConstants.registerBoost; maxSequenceLengthSharedMemoryPow2 = (uint64_t)pow(2, (uint64_t)log2(maxSequenceLengthSharedMemory)); } else { maxSingleSizeStrided *= axis->specializationConstants.registerBoost; maxSingleSizeStridedPow2 = (uint64_t)pow(2, (uint64_t)log2(maxSingleSizeStrided)); } axis->specializationConstants.performR2C = FFTPlan->actualPerformR2CPerAxis[axis_id]; axis->specializationConstants.performR2CmultiUpload = FFTPlan->multiUploadR2C; if (app->configuration.performDCT == 3) { axis->specializationConstants.performDCT = 2; } else { axis->specializationConstants.performDCT = app->configuration.performDCT; } if ((axis->specializationConstants.performR2CmultiUpload) && (app->configuration.size[0] % 2 != 0)) return VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_R2C; axis->specializationConstants.mergeSequencesR2C = ((axis->specializationConstants.fftDim < maxSequenceLengthSharedMemory) && ((FFTPlan->actualFFTSizePerAxis[axis_id][1] % 2) == 0) && ((FFTPlan->actualPerformR2CPerAxis[axis_id]) || (((app->configuration.performDCT == 3) || (app->configuration.performDCT == 2) || (app->configuration.performDCT == 1) || ((app->configuration.performDCT == 4) && ((app->configuration.size[axis_id] % 2) != 0))) && (axis_id == 0)))) ? (1 - app->configuration.disableMergeSequencesR2C) : 0; //uint64_t passID = FFTPlan->numAxisUploads[axis_id] - 1 - axis_upload_id; axis->specializationConstants.fft_dim_full = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; if ((FFTPlan->numAxisUploads[axis_id] > 1) && (axis->specializationConstants.reorderFourStep || app->useBluesteinFFT[axis_id]) && (!app->configuration.userTempBuffer) && (app->configuration.allocateTempBuffer == 0)) { app->configuration.allocateTempBuffer = 1; #if(VKFFT_BACKEND==0) app->configuration.tempBuffer = (VkBuffer*)malloc(sizeof(VkBuffer)); if (!app->configuration.tempBuffer) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } resFFT = allocateFFTBuffer(app, app->configuration.tempBuffer, &app->configuration.tempBufferDeviceMemory, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, app->configuration.tempBufferSize[0]); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } #elif(VKFFT_BACKEND==1) app->configuration.tempBuffer = (void**)malloc(sizeof(void*)); if (!app->configuration.tempBuffer) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } res = cudaMalloc(app->configuration.tempBuffer, app->configuration.tempBufferSize[0]); if (res != cudaSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==2) app->configuration.tempBuffer = (void**)malloc(sizeof(void*)); if (!app->configuration.tempBuffer) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } res = hipMalloc(app->configuration.tempBuffer, app->configuration.tempBufferSize[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==3) app->configuration.tempBuffer = (cl_mem*)malloc(sizeof(cl_mem)); if (!app->configuration.tempBuffer) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } app->configuration.tempBuffer[0] = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_WRITE, app->configuration.tempBufferSize[0], 0, &res); if (res != CL_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #endif } //allocate LUT if (app->configuration.useLUT) { double double_PI = 3.1415926535897932384626433832795; uint64_t dimMult = 1; uint64_t maxStageSum = 0; for (uint64_t i = 0; i < axis->specializationConstants.numStages; i++) { switch (axis->specializationConstants.stageRadix[i]) { case 2: maxStageSum += dimMult; break; case 3: maxStageSum += dimMult * 2; break; case 4: maxStageSum += dimMult * 2; break; case 5: maxStageSum += dimMult * 4; break; case 7: maxStageSum += dimMult * 6; break; case 8: maxStageSum += dimMult * 3; break; case 11: maxStageSum += dimMult * 10; break; case 13: maxStageSum += dimMult * 12; break; } dimMult *= axis->specializationConstants.stageRadix[i]; } axis->specializationConstants.maxStageSumLUT = maxStageSum; dimMult = 1; if (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) { if (axis_upload_id > 0) { if ((app->configuration.performDCT == 2) || (app->configuration.performDCT == 3)) { axis->specializationConstants.startDCT3LUT = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim); axis->bufferLUTSize = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim + (app->configuration.size[axis_id] / 2 + 2)) * 2 * sizeof(double); } else { if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) { axis->specializationConstants.startDCT3LUT = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim); axis->specializationConstants.startDCT4LUT = (axis->specializationConstants.startDCT3LUT + (app->configuration.size[axis_id] / 4 + 2)); axis->bufferLUTSize = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim + (app->configuration.size[axis_id] / 4 + 2) + app->configuration.size[axis_id] / 2) * 2 * sizeof(double); } else axis->bufferLUTSize = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim) * 2 * sizeof(double); } } else { if ((app->configuration.performDCT == 2) || (app->configuration.performDCT == 3)) { axis->specializationConstants.startDCT3LUT = (maxStageSum); axis->bufferLUTSize = (maxStageSum + (app->configuration.size[axis_id] / 2 + 2)) * 2 * sizeof(double); } else { if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) { axis->specializationConstants.startDCT3LUT = (maxStageSum); axis->specializationConstants.startDCT4LUT = (axis->specializationConstants.startDCT3LUT + (app->configuration.size[axis_id] / 4 + 2)); axis->bufferLUTSize = (maxStageSum + (app->configuration.size[axis_id] / 4 + 2) + app->configuration.size[axis_id] / 2) * 2 * sizeof(double); } else axis->bufferLUTSize = (maxStageSum) * 2 * sizeof(double); } } double* tempLUT = (double*)malloc(axis->bufferLUTSize); if (!tempLUT) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } uint64_t localStageSize = 1; uint64_t localStageSum = 0; for (uint64_t i = 0; i < axis->specializationConstants.numStages; i++) { if ((axis->specializationConstants.stageRadix[i] & (axis->specializationConstants.stageRadix[i] - 1)) == 0) { for (uint64_t k = 0; k < log2(axis->specializationConstants.stageRadix[i]); k++) { for (uint64_t j = 0; j < localStageSize; j++) { tempLUT[2 * (j + localStageSum)] = cos(j * double_PI / localStageSize / pow(2, k)); tempLUT[2 * (j + localStageSum) + 1] = sin(j * double_PI / localStageSize / pow(2, k)); } localStageSum += localStageSize; } localStageSize *= axis->specializationConstants.stageRadix[i]; } else { for (uint64_t k = (axis->specializationConstants.stageRadix[i] - 1); k > 0; k--) { for (uint64_t j = 0; j < localStageSize; j++) { tempLUT[2 * (j + localStageSum)] = cos(j * 2.0 * k / axis->specializationConstants.stageRadix[i] * double_PI / localStageSize); tempLUT[2 * (j + localStageSum) + 1] = sin(j * 2.0 * k / axis->specializationConstants.stageRadix[i] * double_PI / localStageSize); } localStageSum += localStageSize; } localStageSize *= axis->specializationConstants.stageRadix[i]; } } if (axis_upload_id > 0) { for (uint64_t i = 0; i < axis->specializationConstants.stageStartSize; i++) { for (uint64_t j = 0; j < axis->specializationConstants.fftDim; j++) { double angle = 2 * double_PI * ((i * j) / (double)(axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim)); tempLUT[maxStageSum * 2 + 2 * (i + j * axis->specializationConstants.stageStartSize)] = cos(angle); tempLUT[maxStageSum * 2 + 2 * (i + j * axis->specializationConstants.stageStartSize) + 1] = sin(angle); } } } if ((app->configuration.performDCT == 2) || (app->configuration.performDCT == 3)) { for (uint64_t j = 0; j < app->configuration.size[axis_id] / 2 + 2; j++) { double angle = (double_PI / 2.0 / (double)(app->configuration.size[axis_id])) * j; tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j] = cos(angle); tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j + 1] = sin(angle); } } if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) { for (uint64_t j = 0; j < app->configuration.size[axis_id] / 4 + 2; j++) { double angle = (double_PI / 2.0 / (double)(app->configuration.size[axis_id] / 2)) * j; tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j] = cos(angle); tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j + 1] = sin(angle); } for (uint64_t j = 0; j < app->configuration.size[axis_id] / 2; j++) { double angle = (-double_PI / 8.0 / (double)(app->configuration.size[axis_id] / 2)) * (2 * j + 1); tempLUT[2 * axis->specializationConstants.startDCT4LUT + 2 * j] = cos(angle); tempLUT[2 * axis->specializationConstants.startDCT4LUT + 2 * j + 1] = sin(angle); } } axis->referenceLUT = 0; if (reverseBluesteinMultiUpload == 1) { axis->bufferLUT = FFTPlan->axes[axis_id][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = FFTPlan->axes[axis_id][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = FFTPlan->axes[axis_id][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { if ((!inverse) && (!app->configuration.makeForwardPlanOnly)) { axis->bufferLUT = app->localFFTPlan_inverse->axes[axis_id][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = app->localFFTPlan_inverse->axes[axis_id][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = app->localFFTPlan_inverse->axes[axis_id][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { if (((axis_id == 1) || (axis_id == 2)) && (!((!axis->specializationConstants.reorderFourStep) && (FFTPlan->numAxisUploads[axis_id] > 1))) && ((axis->specializationConstants.fft_dim_full == FFTPlan->axes[0][0].specializationConstants.fft_dim_full) && (FFTPlan->numAxisUploads[axis_id] == 1) && (axis->specializationConstants.fft_dim_full < maxSingleSizeStrided / axis->specializationConstants.registerBoost)) && ((!app->configuration.performDCT) || (app->configuration.size[axis_id] == app->configuration.size[0]))) { axis->bufferLUT = FFTPlan->axes[0][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = FFTPlan->axes[0][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = FFTPlan->axes[0][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { if ((axis_id == 2) && (axis->specializationConstants.fft_dim_full == FFTPlan->axes[1][0].specializationConstants.fft_dim_full) && ((!app->configuration.performDCT) || (app->configuration.size[2] == app->configuration.size[1]))) { axis->bufferLUT = FFTPlan->axes[1][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = FFTPlan->axes[1][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = FFTPlan->axes[1][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { #if(VKFFT_BACKEND==0) resFFT = allocateFFTBuffer(app, &axis->bufferLUT, &axis->bufferLUTDeviceMemory, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } resFFT = transferDataFromCPU(app, tempLUT, &axis->bufferLUT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } #elif(VKFFT_BACKEND==1) res = cudaMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = cudaMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, cudaMemcpyHostToDevice); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==2) res = hipMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = hipMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, hipMemcpyHostToDevice); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==3) axis->bufferLUT = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, axis->bufferLUTSize, tempLUT, &res); if (res != CL_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #endif } } } } free(tempLUT); tempLUT = 0; } else { if (axis_upload_id > 0) { if ((app->configuration.performDCT == 2) || (app->configuration.performDCT == 3)) { axis->specializationConstants.startDCT3LUT = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim); axis->bufferLUTSize = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim + (app->configuration.size[axis_id] / 2 + 2)) * 2 * sizeof(float); } else { if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) { axis->specializationConstants.startDCT3LUT = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim); axis->specializationConstants.startDCT4LUT = (axis->specializationConstants.startDCT3LUT + (axis->specializationConstants.fftDim / 4 + 2)); axis->bufferLUTSize = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim + (app->configuration.size[axis_id] / 4 + 2) + app->configuration.size[axis_id] / 2) * 2 * sizeof(float); } else axis->bufferLUTSize = (maxStageSum + axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim) * 2 * sizeof(float); } } else { if ((app->configuration.performDCT == 2) || (app->configuration.performDCT == 3)) { axis->specializationConstants.startDCT3LUT = (maxStageSum); axis->bufferLUTSize = (maxStageSum + (app->configuration.size[axis_id] / 2 + 2)) * 2 * sizeof(float); } else { if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) { axis->specializationConstants.startDCT3LUT = (maxStageSum); axis->specializationConstants.startDCT4LUT = (axis->specializationConstants.startDCT3LUT + (app->configuration.size[axis_id] / 4 + 2)); axis->bufferLUTSize = (maxStageSum + (app->configuration.size[axis_id] / 4 + 2) + app->configuration.size[axis_id] / 2) * 2 * sizeof(float); } else axis->bufferLUTSize = (maxStageSum) * 2 * sizeof(float); } } float* tempLUT = (float*)malloc(axis->bufferLUTSize); if (!tempLUT) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } uint64_t localStageSize = 1; uint64_t localStageSum = 0; for (uint64_t i = 0; i < axis->specializationConstants.numStages; i++) { if ((axis->specializationConstants.stageRadix[i] & (axis->specializationConstants.stageRadix[i] - 1)) == 0) { for (uint64_t k = 0; k < log2(axis->specializationConstants.stageRadix[i]); k++) { for (uint64_t j = 0; j < localStageSize; j++) { tempLUT[2 * (j + localStageSum)] = (float)cos(j * double_PI / localStageSize / pow(2, k)); tempLUT[2 * (j + localStageSum) + 1] = (float)sin(j * double_PI / localStageSize / pow(2, k)); } localStageSum += localStageSize; } localStageSize *= axis->specializationConstants.stageRadix[i]; } else { for (uint64_t k = (axis->specializationConstants.stageRadix[i] - 1); k > 0; k--) { for (uint64_t j = 0; j < localStageSize; j++) { tempLUT[2 * (j + localStageSum)] = (float)cos(j * 2.0 * k / axis->specializationConstants.stageRadix[i] * double_PI / localStageSize); tempLUT[2 * (j + localStageSum) + 1] = (float)sin(j * 2.0 * k / axis->specializationConstants.stageRadix[i] * double_PI / localStageSize); } localStageSum += localStageSize; } localStageSize *= axis->specializationConstants.stageRadix[i]; } } if (axis_upload_id > 0) { for (uint64_t i = 0; i < axis->specializationConstants.stageStartSize; i++) { for (uint64_t j = 0; j < axis->specializationConstants.fftDim; j++) { double angle = 2 * double_PI * ((i * j) / (double)(axis->specializationConstants.stageStartSize * axis->specializationConstants.fftDim)); tempLUT[maxStageSum * 2 + 2 * (i + j * axis->specializationConstants.stageStartSize)] = (float)cos(angle); tempLUT[maxStageSum * 2 + 2 * (i + j * axis->specializationConstants.stageStartSize) + 1] = (float)sin(angle); } } } if ((app->configuration.performDCT == 2) || (app->configuration.performDCT == 3)) { for (uint64_t j = 0; j < app->configuration.size[axis_id] / 2 + 2; j++) { double angle = (double_PI / 2.0 / (double)(app->configuration.size[axis_id])) * j; tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j] = (float)cos(angle); tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j + 1] = (float)sin(angle); } } if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) { for (uint64_t j = 0; j < app->configuration.size[axis_id] / 4 + 2; j++) { double angle = (double_PI / 2.0 / (double)(app->configuration.size[axis_id] / 2)) * j; tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j] = (float)cos(angle); tempLUT[2 * axis->specializationConstants.startDCT3LUT + 2 * j + 1] = (float)sin(angle); } for (uint64_t j = 0; j < app->configuration.size[axis_id] / 2; j++) { double angle = (-double_PI / 8.0 / (double)(app->configuration.size[axis_id] / 2)) * (2 * j + 1); tempLUT[2 * axis->specializationConstants.startDCT4LUT + 2 * j] = (float)cos(angle); tempLUT[2 * axis->specializationConstants.startDCT4LUT + 2 * j + 1] = (float)sin(angle); } } axis->referenceLUT = 0; if (reverseBluesteinMultiUpload == 1) { axis->bufferLUT = FFTPlan->axes[axis_id][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = FFTPlan->axes[axis_id][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = FFTPlan->axes[axis_id][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { if ((!inverse) && (!app->configuration.makeForwardPlanOnly)) { axis->bufferLUT = app->localFFTPlan_inverse->axes[axis_id][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = app->localFFTPlan_inverse->axes[axis_id][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = app->localFFTPlan_inverse->axes[axis_id][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { if (((axis_id == 1) || (axis_id == 2)) && (!((!axis->specializationConstants.reorderFourStep) && (FFTPlan->numAxisUploads[axis_id] > 1))) && ((axis->specializationConstants.fft_dim_full == FFTPlan->axes[0][0].specializationConstants.fft_dim_full) && (FFTPlan->numAxisUploads[axis_id] == 1) && (axis->specializationConstants.fft_dim_full < maxSingleSizeStrided / axis->specializationConstants.registerBoost)) && ((!app->configuration.performDCT) || (app->configuration.size[axis_id] == app->configuration.size[0]))) { axis->bufferLUT = FFTPlan->axes[0][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = FFTPlan->axes[0][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = FFTPlan->axes[0][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { if ((axis_id == 2) && (axis->specializationConstants.fft_dim_full == FFTPlan->axes[1][0].specializationConstants.fft_dim_full) && ((!app->configuration.performDCT) || (app->configuration.size[2] == app->configuration.size[1]))) { axis->bufferLUT = FFTPlan->axes[1][axis_upload_id].bufferLUT; #if(VKFFT_BACKEND==0) axis->bufferLUTDeviceMemory = FFTPlan->axes[1][axis_upload_id].bufferLUTDeviceMemory; #endif axis->bufferLUTSize = FFTPlan->axes[1][axis_upload_id].bufferLUTSize; axis->referenceLUT = 1; } else { #if(VKFFT_BACKEND==0) resFFT = allocateFFTBuffer(app, &axis->bufferLUT, &axis->bufferLUTDeviceMemory, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } resFFT = transferDataFromCPU(app, tempLUT, &axis->bufferLUT, axis->bufferLUTSize); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return resFFT; } #elif(VKFFT_BACKEND==1) res = cudaMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = cudaMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, cudaMemcpyHostToDevice); if (res != cudaSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==2) res = hipMalloc((void**)&axis->bufferLUT, axis->bufferLUTSize); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } res = hipMemcpy(axis->bufferLUT, tempLUT, axis->bufferLUTSize, hipMemcpyHostToDevice); if (res != hipSuccess) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #elif(VKFFT_BACKEND==3) axis->bufferLUT = clCreateBuffer(app->configuration.context[0], CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, axis->bufferLUTSize, tempLUT, &res); if (res != CL_SUCCESS) { deleteVkFFT(app); free(tempLUT); tempLUT = 0; return VKFFT_ERROR_FAILED_TO_ALLOCATE; } #endif } } } } free(tempLUT); tempLUT = 0; } } //configure strides uint64_t* axisStride = axis->specializationConstants.inputStride; uint64_t* usedStride = app->configuration.bufferStride; if ((!inverse) && (axis_id == app->firstAxis) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->configuration.isInputFormatted)) usedStride = app->configuration.inputBufferStride; if ((inverse) && (axis_id == app->lastAxis) && (((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (axis->specializationConstants.reorderFourStep)) || ((axis_upload_id == 0) && (!axis->specializationConstants.reorderFourStep))) && (app->configuration.isInputFormatted) && (!app->configuration.inverseReturnToInputBuffer)) usedStride = app->configuration.inputBufferStride; axisStride[0] = 1; if (axis_id == 0) { axisStride[1] = usedStride[0]; axisStride[2] = usedStride[1]; } if (axis_id == 1) { axisStride[1] = usedStride[0]; axisStride[2] = usedStride[1]; } if (axis_id == 2) { axisStride[1] = usedStride[1]; axisStride[2] = usedStride[0]; } axisStride[3] = usedStride[2]; axisStride[4] = axisStride[3] * app->configuration.coordinateFeatures; if (app->useBluesteinFFT[axis_id] && (FFTPlan->numAxisUploads[axis_id] > 1) && (!((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (reverseBluesteinMultiUpload == 0)))) { axisStride[0] = 1; if (axis_id == 0) { axisStride[1] = FFTPlan->actualFFTSizePerAxis[axis_id][0]; axisStride[2] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1]; } if (axis_id == 1) { axisStride[1] = FFTPlan->actualFFTSizePerAxis[axis_id][0]; axisStride[2] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1]; } if (axis_id == 2) { axisStride[1] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1]; axisStride[2] = FFTPlan->actualFFTSizePerAxis[axis_id][0]; } axisStride[3] = axisStride[2] * FFTPlan->actualFFTSizePerAxis[axis_id][2]; axisStride[4] = axisStride[3] * app->configuration.coordinateFeatures; } if ((!inverse) && (axis_id == 0) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (reverseBluesteinMultiUpload == 0) && (axis->specializationConstants.performR2C || FFTPlan->multiUploadR2C) && (!(app->configuration.isInputFormatted))) { axisStride[1] *= 2; axisStride[2] *= 2; axisStride[3] *= 2; axisStride[4] *= 2; } if ((FFTPlan->multiUploadR2C) && (!inverse) && (axis_id == 0) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (reverseBluesteinMultiUpload == 0)) { for (uint64_t i = 1; i < 5; i++) { axisStride[i] /= 2; } } axisStride = axis->specializationConstants.outputStride; usedStride = app->configuration.bufferStride; if ((!inverse) && (axis_id == app->lastAxis) && (axis_upload_id == 0) && (app->configuration.isOutputFormatted)) usedStride = app->configuration.outputBufferStride; if ((inverse) && (axis_id == app->firstAxis) && (((axis_upload_id == 0) && (axis->specializationConstants.reorderFourStep)) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (!axis->specializationConstants.reorderFourStep))) && ((app->configuration.isOutputFormatted))) usedStride = app->configuration.outputBufferStride; if ((inverse) && (axis_id == app->firstAxis) && (((axis_upload_id == 0) && (app->configuration.isInputFormatted)) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (!axis->specializationConstants.reorderFourStep))) && (app->configuration.inverseReturnToInputBuffer)) usedStride = app->configuration.inputBufferStride; axisStride[0] = 1; if (axis_id == 0) { axisStride[1] = usedStride[0]; axisStride[2] = usedStride[1]; } if (axis_id == 1) { axisStride[1] = usedStride[0]; axisStride[2] = usedStride[1]; } if (axis_id == 2) { axisStride[1] = usedStride[1]; axisStride[2] = usedStride[0]; } axisStride[3] = usedStride[2]; axisStride[4] = axisStride[3] * app->configuration.coordinateFeatures; if (app->useBluesteinFFT[axis_id] && (FFTPlan->numAxisUploads[axis_id] > 1) && (!((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (reverseBluesteinMultiUpload == 1)))) { axisStride[0] = 1; if (axis_id == 0) { axisStride[1] = FFTPlan->actualFFTSizePerAxis[axis_id][0]; axisStride[2] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1]; } if (axis_id == 1) { axisStride[1] = FFTPlan->actualFFTSizePerAxis[axis_id][0]; axisStride[2] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1]; } if (axis_id == 2) { axisStride[1] = FFTPlan->actualFFTSizePerAxis[axis_id][0] * FFTPlan->actualFFTSizePerAxis[axis_id][1]; axisStride[2] = FFTPlan->actualFFTSizePerAxis[axis_id][0]; } axisStride[3] = axisStride[2] * FFTPlan->actualFFTSizePerAxis[axis_id][2]; axisStride[4] = axisStride[3] * app->configuration.coordinateFeatures; } if ((inverse) && (axis_id == 0) && (((!app->useBluesteinFFT[axis_id]) && (axis_upload_id == 0)) || ((app->useBluesteinFFT[axis_id]) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && ((reverseBluesteinMultiUpload == 1) || (FFTPlan->numAxisUploads[axis_id] == 1)))) && (axis->specializationConstants.performR2C || FFTPlan->multiUploadR2C) && (!((app->configuration.isInputFormatted) && (app->configuration.inverseReturnToInputBuffer))) && (!app->configuration.isOutputFormatted)) { axisStride[1] *= 2; axisStride[2] *= 2; axisStride[3] *= 2; axisStride[4] *= 2; } if ((FFTPlan->multiUploadR2C) && (inverse) && (axis_id == 0) && (((!app->useBluesteinFFT[axis_id]) && (axis_upload_id == 0)) || ((app->useBluesteinFFT[axis_id]) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && ((reverseBluesteinMultiUpload == 1) || (FFTPlan->numAxisUploads[axis_id] == 1))))) { for (uint64_t i = 1; i < 5; i++) { axisStride[i] /= 2; } } /*axis->specializationConstants.inputStride[3] = (app->configuration.coordinateFeatures == 1) ? 0 : axis->specializationConstants.inputStride[3]; axis->specializationConstants.outputStride[3] = (app->configuration.coordinateFeatures == 1) ? 0 : axis->specializationConstants.outputStride[3]; axis->specializationConstants.inputStride[4] = ((app->configuration.numberBatches == 1) && (app->configuration.numberKernels == 1)) ? 0 : axis->specializationConstants.inputStride[3] * app->configuration.coordinateFeatures; axis->specializationConstants.outputStride[4] = ((app->configuration.numberBatches == 1) && (app->configuration.numberKernels == 1)) ? 0 : axis->specializationConstants.outputStride[3] * app->configuration.coordinateFeatures; */ uint64_t storageComplexSize; if (app->configuration.doublePrecision) storageComplexSize = (2 * sizeof(double)); else if (app->configuration.halfPrecision) storageComplexSize = (2 * 2); else storageComplexSize = (2 * sizeof(float)); uint64_t initPageSize = -1; uint64_t locBufferNum = 1; uint64_t locBufferSize = -1; /*for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { initPageSize += app->configuration.bufferSize[i]; }*/ /*if (app->configuration.performConvolution) { uint64_t initPageSizeKernel = 0; for (uint64_t i = 0; i < app->configuration.kernelNum; i++) { initPageSizeKernel += app->configuration.kernelSize[i]; } if (initPageSizeKernel > initPageSize) initPageSize = initPageSizeKernel; } if (axis_id == 0) { if ((!((!axis->specializationConstants.reorderFourStep) && (axis_upload_id == 0))) && (axis->specializationConstants.inputStride[1] * storageComplexSize > app->configuration.devicePageSize * 1024) && (app->configuration.devicePageSize > 0)) { initPageSize = app->configuration.localPageSize * 1024; } } if (axis_id == 1) { if ((app->configuration.bufferStride[1] * storageComplexSize > app->configuration.devicePageSize * 1024) && (app->configuration.devicePageSize > 0)) { initPageSize = app->configuration.localPageSize * 1024; } } if (axis_id == 2) { if ((app->configuration.bufferStride[2] * storageComplexSize > app->configuration.devicePageSize * 1024) && (app->configuration.devicePageSize > 0)) { initPageSize = app->configuration.localPageSize * 1024; } } */ if ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->configuration.isInputFormatted) && (!axis->specializationConstants.reverseBluesteinMultiUpload) && ( ((axis_id == app->firstAxis) && (!inverse)) || ((axis_id == app->lastAxis) && (inverse) && (!((axis_id == 0) && (axis->specializationConstants.performR2CmultiUpload))) && (!app->configuration.performConvolution) && (!app->configuration.inverseReturnToInputBuffer))) ) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.inputBufferNum; if (app->configuration.inputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.inputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.inputBufferNum; i++) { totalSize += app->configuration.inputBufferSize[i]; if (app->configuration.inputBufferSize[i] < locPageSize) locPageSize = app->configuration.inputBufferSize[i]; } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.inputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.inputBufferBlockNum == 1) axis->specializationConstants.inputBufferBlockSize = totalSize / storageComplexSize; } else { if ((axis_upload_id == 0) && (app->configuration.numberKernels > 1) && (inverse) && (!app->configuration.performConvolution)) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.outputBufferNum; if (app->configuration.outputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.outputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { totalSize += app->configuration.outputBufferSize[i]; if (app->configuration.outputBufferSize[i] < locPageSize) locPageSize = app->configuration.outputBufferSize[i]; } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.inputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.inputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } else { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; if (((axis->specializationConstants.reorderFourStep == 1) || (app->useBluesteinFFT[axis_id])) && (FFTPlan->numAxisUploads[axis_id] > 1)) { if (((axis->specializationConstants.reorderFourStep == 1) && (axis_upload_id > 0)) || (app->useBluesteinFFT[axis_id] && (reverseBluesteinMultiUpload == 0) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1))) { locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } } else { locBufferNum = app->configuration.tempBufferNum; if (app->configuration.tempBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.tempBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.tempBufferNum; i++) { totalSize += app->configuration.tempBufferSize[i]; if (app->configuration.tempBufferSize[i] < locPageSize) locPageSize = app->configuration.tempBufferSize[i]; } } } } else { locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } } axis->specializationConstants.inputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.inputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.inputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.inputBufferBlockNum == 1) axis->specializationConstants.inputBufferBlockSize = totalSize / storageComplexSize; } } initPageSize = -1; locBufferNum = 1; locBufferSize = -1; if (((axis_upload_id == 0) && (!app->useBluesteinFFT[axis_id]) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution)) || ((axis_id == app->firstAxis) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1))) )) || ((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (app->useBluesteinFFT[axis_id]) && (axis->specializationConstants.reverseBluesteinMultiUpload || (FFTPlan->numAxisUploads[axis_id] == 1)) && (app->configuration.isOutputFormatted && ( ((axis_id == app->firstAxis) && (inverse)) || ((axis_id == app->lastAxis) && (!inverse) && (!app->configuration.performConvolution))) )) || ((app->configuration.numberKernels > 1) && ( (inverse) || (axis_id == app->lastAxis))) ) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.outputBufferNum; if (app->configuration.outputBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.outputBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { totalSize += app->configuration.outputBufferSize[i]; if (app->configuration.outputBufferSize[i] < locPageSize) locPageSize = app->configuration.outputBufferSize[i]; } } axis->specializationConstants.outputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } else { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; if (((axis->specializationConstants.reorderFourStep == 1) || (app->useBluesteinFFT[axis_id])) && (FFTPlan->numAxisUploads[axis_id] > 1)) { if (((axis->specializationConstants.reorderFourStep == 1) && (axis_upload_id == 1)) || (app->useBluesteinFFT[axis_id] && (!((axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (axis->specializationConstants.reverseBluesteinMultiUpload == 1))))) { locBufferNum = app->configuration.tempBufferNum; if (app->configuration.tempBufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.tempBufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.tempBufferNum; i++) { totalSize += app->configuration.tempBufferSize[i]; if (app->configuration.tempBufferSize[i] < locPageSize) locPageSize = app->configuration.tempBufferSize[i]; } } } else { locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } } } else { locBufferNum = app->configuration.bufferNum; if (app->configuration.bufferSize) { locBufferSize = (uint64_t)ceil(app->configuration.bufferSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { totalSize += app->configuration.bufferSize[i]; if (app->configuration.bufferSize[i] < locPageSize) locPageSize = app->configuration.bufferSize[i]; } } } axis->specializationConstants.outputBufferBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.outputBufferBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.outputBufferBlockSize * storageComplexSize)); //if (axis->specializationConstants.outputBufferBlockNum == 1) axis->specializationConstants.outputBufferBlockSize = totalSize / storageComplexSize; } if (axis->specializationConstants.inputBufferBlockNum == 0) axis->specializationConstants.inputBufferBlockNum = 1; if (axis->specializationConstants.outputBufferBlockNum == 0) axis->specializationConstants.outputBufferBlockNum = 1; if (app->configuration.performConvolution) { uint64_t totalSize = 0; uint64_t locPageSize = initPageSize; locBufferNum = app->configuration.kernelNum; if (app->configuration.kernelSize) { locBufferSize = (uint64_t)ceil(app->configuration.kernelSize[0] / (double)storageComplexSize); for (uint64_t i = 0; i < app->configuration.kernelNum; i++) { totalSize += app->configuration.kernelSize[i]; if (app->configuration.kernelSize[i] < locPageSize) locPageSize = app->configuration.kernelSize[i]; } } axis->specializationConstants.kernelBlockSize = (locBufferNum == 1) ? locBufferSize : (uint64_t)ceil(locPageSize / (double)storageComplexSize); axis->specializationConstants.kernelBlockNum = (locBufferNum == 1) ? 1 : (uint64_t)ceil(totalSize / (double)(axis->specializationConstants.kernelBlockSize * storageComplexSize)); //if (axis->specializationConstants.kernelBlockNum == 1) axis->specializationConstants.inputBufferBlockSize = totalSize / storageComplexSize; if (axis->specializationConstants.kernelBlockNum == 0) axis->specializationConstants.kernelBlockNum = 1; } else { axis->specializationConstants.kernelBlockSize = 0; axis->specializationConstants.kernelBlockNum = 0; } axis->numBindings = 2; axis->specializationConstants.numBuffersBound[0] = axis->specializationConstants.inputBufferBlockNum; axis->specializationConstants.numBuffersBound[1] = axis->specializationConstants.outputBufferBlockNum; axis->specializationConstants.numBuffersBound[2] = 0; axis->specializationConstants.numBuffersBound[3] = 0; #if(VKFFT_BACKEND==0) VkDescriptorPoolSize descriptorPoolSize = { VK_DESCRIPTOR_TYPE_STORAGE_BUFFER }; descriptorPoolSize.descriptorCount = (uint32_t)(axis->specializationConstants.inputBufferBlockNum + axis->specializationConstants.outputBufferBlockNum); #endif axis->specializationConstants.convolutionBindingID = -1; if ((axis_id == 0) && (axis_upload_id == 0) && (app->configuration.FFTdim == 1) && (app->configuration.performConvolution)) { axis->specializationConstants.convolutionBindingID = axis->numBindings; axis->specializationConstants.numBuffersBound[axis->numBindings] = axis->specializationConstants.kernelBlockNum; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount += (uint32_t)axis->specializationConstants.kernelBlockNum; #endif axis->numBindings++; } if ((axis_id == 1) && (axis_upload_id == 0) && (app->configuration.FFTdim == 2) && (app->configuration.performConvolution)) { axis->specializationConstants.convolutionBindingID = axis->numBindings; axis->specializationConstants.numBuffersBound[axis->numBindings] = axis->specializationConstants.kernelBlockNum; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount += (uint32_t)axis->specializationConstants.kernelBlockNum; #endif axis->numBindings++; } if ((axis_id == 2) && (axis_upload_id == 0) && (app->configuration.FFTdim == 3) && (app->configuration.performConvolution)) { axis->specializationConstants.convolutionBindingID = axis->numBindings; axis->specializationConstants.numBuffersBound[axis->numBindings] = axis->specializationConstants.kernelBlockNum; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount += (uint32_t)axis->specializationConstants.kernelBlockNum; #endif axis->numBindings++; } if (app->configuration.useLUT) { axis->specializationConstants.LUTBindingID = axis->numBindings; axis->specializationConstants.numBuffersBound[axis->numBindings] = 1; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount++; #endif axis->numBindings++; } if ((app->useBluesteinFFT[axis_id]) && (axis_upload_id == 0)) { if (axis->specializationConstants.inverseBluestein) axis->bufferBluesteinFFT = &app->bufferBluesteinIFFT[axis_id]; else axis->bufferBluesteinFFT = &app->bufferBluesteinFFT[axis_id]; axis->specializationConstants.BluesteinConvolutionBindingID = axis->numBindings; axis->specializationConstants.numBuffersBound[axis->numBindings] = 1; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount++; #endif axis->numBindings++; } if ((app->useBluesteinFFT[axis_id]) && (axis_upload_id == (FFTPlan->numAxisUploads[axis_id] - 1))) { axis->bufferBluestein = &app->bufferBluestein[axis_id]; axis->specializationConstants.BluesteinMultiplicationBindingID = axis->numBindings; axis->specializationConstants.numBuffersBound[axis->numBindings] = 1; #if(VKFFT_BACKEND==0) descriptorPoolSize.descriptorCount++; #endif axis->numBindings++; } #if(VKFFT_BACKEND==0) VkDescriptorPoolCreateInfo descriptorPoolCreateInfo = { VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO }; descriptorPoolCreateInfo.poolSizeCount = 1; descriptorPoolCreateInfo.pPoolSizes = &descriptorPoolSize; descriptorPoolCreateInfo.maxSets = 1; res = vkCreateDescriptorPool(app->configuration.device[0], &descriptorPoolCreateInfo, 0, &axis->descriptorPool); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_POOL; } const VkDescriptorType descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER; VkDescriptorSetLayoutBinding* descriptorSetLayoutBindings; descriptorSetLayoutBindings = (VkDescriptorSetLayoutBinding*)malloc(axis->numBindings * sizeof(VkDescriptorSetLayoutBinding)); if (!descriptorSetLayoutBindings) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } for (uint64_t i = 0; i < axis->numBindings; ++i) { descriptorSetLayoutBindings[i].binding = (uint32_t)i; descriptorSetLayoutBindings[i].descriptorType = descriptorType; descriptorSetLayoutBindings[i].descriptorCount = (uint32_t)axis->specializationConstants.numBuffersBound[i]; descriptorSetLayoutBindings[i].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT; } VkDescriptorSetLayoutCreateInfo descriptorSetLayoutCreateInfo = { VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO }; descriptorSetLayoutCreateInfo.bindingCount = (uint32_t)axis->numBindings; descriptorSetLayoutCreateInfo.pBindings = descriptorSetLayoutBindings; res = vkCreateDescriptorSetLayout(app->configuration.device[0], &descriptorSetLayoutCreateInfo, 0, &axis->descriptorSetLayout); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_DESCRIPTOR_SET_LAYOUT; } free(descriptorSetLayoutBindings); descriptorSetLayoutBindings = 0; VkDescriptorSetAllocateInfo descriptorSetAllocateInfo = { VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO }; descriptorSetAllocateInfo.descriptorPool = axis->descriptorPool; descriptorSetAllocateInfo.descriptorSetCount = 1; descriptorSetAllocateInfo.pSetLayouts = &axis->descriptorSetLayout; res = vkAllocateDescriptorSets(app->configuration.device[0], &descriptorSetAllocateInfo, &axis->descriptorSet); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_ALLOCATE_DESCRIPTOR_SETS; } #endif if (app->configuration.specifyOffsetsAtLaunch) { axis->specializationConstants.performPostCompilationInputOffset = 1; axis->specializationConstants.performPostCompilationOutputOffset = 1; if (app->configuration.performConvolution) axis->specializationConstants.performPostCompilationKernelOffset = 1; } resFFT = VkFFTCheckUpdateBufferSet(app, axis, 1, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } resFFT = VkFFTUpdateBufferSet(app, FFTPlan, axis, axis_id, axis_upload_id, inverse); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } { uint64_t maxBatchCoalesced = app->configuration.coalescedMemory / complexSize; axis->groupedBatch = maxBatchCoalesced; /*if ((FFTPlan->actualFFTSizePerAxis[axis_id][0] < 4096) && (FFTPlan->actualFFTSizePerAxis[axis_id][1] < 512) && (FFTPlan->actualFFTSizePerAxis[axis_id][2] == 1)) { if (app->configuration.sharedMemorySize / axis->specializationConstants.fftDim >= app->configuration.coalescedMemory) { if (1024 / axis->specializationConstants.fftDim < maxSequenceLengthSharedMemory / axis->specializationConstants.fftDim) { if (1024 / axis->specializationConstants.fftDim > axis->groupedBatch) axis->groupedBatch = 1024 / axis->specializationConstants.fftDim; else axis->groupedBatch = maxSequenceLengthSharedMemory / axis->specializationConstants.fftDim; } } } else { axis->groupedBatch = (app->configuration.sharedMemorySize / axis->specializationConstants.fftDim >= app->configuration.coalescedMemory) ? maxSequenceLengthSharedMemory / axis->specializationConstants.fftDim : axis->groupedBatch; }*/ //if (axis->groupedBatch * (uint64_t)ceil(axis->specializationConstants.fftDim / 8.0) < app->configuration.warpSize) axis->groupedBatch = app->configuration.warpSize / (uint64_t)ceil(axis->specializationConstants.fftDim / 8.0); //axis->groupedBatch = (app->configuration.sharedMemorySize / axis->specializationConstants.fftDim >= app->configuration.coalescedMemory) ? maxSequenceLengthSharedMemory / axis->specializationConstants.fftDim : axis->groupedBatch; if (((FFTPlan->numAxisUploads[axis_id] == 1) && (axis_id == 0)) || ((axis_id == 0) && (!axis->specializationConstants.reorderFourStep) && (axis_upload_id == 0))) { axis->groupedBatch = (maxSequenceLengthSharedMemoryPow2 / axis->specializationConstants.fftDim > axis->groupedBatch) ? maxSequenceLengthSharedMemoryPow2 / axis->specializationConstants.fftDim : axis->groupedBatch; } else { axis->groupedBatch = (maxSingleSizeStridedPow2 / axis->specializationConstants.fftDim > 1) ? maxSingleSizeStridedPow2 / axis->specializationConstants.fftDim * axis->groupedBatch : axis->groupedBatch; } //axis->groupedBatch = 8; //shared memory bank conflict resolve //#if(VKFFT_BACKEND!=2)//for some reason, hip doesn't get performance increase from having variable shared memory strides. if ((FFTPlan->numAxisUploads[axis_id] == 2) && (axis_upload_id == 0) && (axis->specializationConstants.fftDim * maxBatchCoalesced <= maxSequenceLengthSharedMemory)) { axis->groupedBatch = (uint64_t)ceil(axis->groupedBatch / 2.0); } //#endif if ((FFTPlan->numAxisUploads[axis_id] == 3) && (axis_upload_id == 0) && (axis->specializationConstants.fftDim < maxSequenceLengthSharedMemory / (2 * complexSize))) { axis->groupedBatch = (uint64_t)ceil(axis->groupedBatch / 2.0); } if (axis->groupedBatch < maxBatchCoalesced) axis->groupedBatch = maxBatchCoalesced; axis->groupedBatch = (axis->groupedBatch / maxBatchCoalesced) * maxBatchCoalesced; //half bandiwdth technique if (!((axis_id == 0) && (FFTPlan->numAxisUploads[axis_id] == 1)) && !((axis_id == 0) && (axis_upload_id == 0) && (!axis->specializationConstants.reorderFourStep)) && (axis->specializationConstants.fftDim > maxSingleSizeStrided)) { axis->groupedBatch = (uint64_t)ceil(axis->groupedBatch / 2.0); } if ((app->configuration.halfThreads) && (axis->groupedBatch * axis->specializationConstants.fftDim * complexSize >= app->configuration.sharedMemorySize)) axis->groupedBatch = (uint64_t)ceil(axis->groupedBatch / 2.0); if (axis->groupedBatch > app->configuration.warpSize) axis->groupedBatch = (axis->groupedBatch / app->configuration.warpSize) * app->configuration.warpSize; if (axis->groupedBatch > 2 * maxBatchCoalesced) axis->groupedBatch = (axis->groupedBatch / (2 * maxBatchCoalesced)) * (2 * maxBatchCoalesced); if (axis->groupedBatch > 4 * maxBatchCoalesced) axis->groupedBatch = (axis->groupedBatch / (4 * maxBatchCoalesced)) * (2 * maxBatchCoalesced); uint64_t maxThreadNum = maxSequenceLengthSharedMemory / (axis->specializationConstants.min_registers_per_thread * axis->specializationConstants.registerBoost); if (maxThreadNum > app->configuration.maxThreadsNum) maxThreadNum = app->configuration.maxThreadsNum; axis->specializationConstants.axisSwapped = 0; uint64_t r2cmult = (axis->specializationConstants.mergeSequencesR2C) ? 2 : 1; if (axis_id == 0) { if (axis_upload_id == 0) { axis->axisBlock[0] = (axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost > 1) ? axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost : 1; if (axis->axisBlock[0] > maxThreadNum) axis->axisBlock[0] = maxThreadNum; if (axis->axisBlock[0] > app->configuration.maxComputeWorkGroupSize[0]) axis->axisBlock[0] = app->configuration.maxComputeWorkGroupSize[0]; if (axis->specializationConstants.reorderFourStep && (FFTPlan->numAxisUploads[axis_id] > 1)) axis->axisBlock[1] = axis->groupedBatch; else { //axis->axisBlock[1] = (axis->axisBlock[0] < app->configuration.warpSize) ? app->configuration.warpSize / axis->axisBlock[0] : 1; axis->axisBlock[1] = ((axis->axisBlock[0] < app->configuration.aimThreads) && ((axis->axisBlock[0] < 32) || ((axis->axisBlock[0] & (axis->axisBlock[0] - 1)) != 0))) ? app->configuration.aimThreads / axis->axisBlock[0] : 1; } uint64_t currentAxisBlock1 = axis->axisBlock[1]; for (uint64_t i = currentAxisBlock1; i < 2 * currentAxisBlock1; i++) { if (((FFTPlan->numAxisUploads[0] > 1) && (((FFTPlan->actualFFTSizePerAxis[axis_id][0] / axis->specializationConstants.fftDim) % i) == 0)) || ((FFTPlan->numAxisUploads[0] == 1) && (((FFTPlan->actualFFTSizePerAxis[axis_id][1] / r2cmult) % i) == 0))) { if (i * axis->specializationConstants.fftDim * complexSize <= app->configuration.sharedMemorySize) axis->axisBlock[1] = i; i = 2 * currentAxisBlock1; } } if ((FFTPlan->numAxisUploads[0] > 1) && ((uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][0] / axis->specializationConstants.fftDim) < axis->axisBlock[1])) axis->axisBlock[1] = (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][0] / axis->specializationConstants.fftDim); if ((axis->specializationConstants.mergeSequencesR2C != 0) && (axis->specializationConstants.fftDim * axis->axisBlock[1] >= maxSequenceLengthSharedMemory)) { axis->specializationConstants.mergeSequencesR2C = 0; /*if ((!inverse) && (axis_id == 0) && (axis_upload_id == 0) && (!(app->configuration.isInputFormatted))) { axis->specializationConstants.inputStride[1] /= 2; axis->specializationConstants.inputStride[2] /= 2; axis->specializationConstants.inputStride[3] /= 2; axis->specializationConstants.inputStride[4] /= 2; } if ((inverse) && (axis_id == 0) && (axis_upload_id == 0) && (!((app->configuration.isInputFormatted) && (app->configuration.inverseReturnToInputBuffer))) && (!app->configuration.isOutputFormatted)) { axis->specializationConstants.outputStride[1] /= 2; axis->specializationConstants.outputStride[2] /= 2; axis->specializationConstants.outputStride[3] /= 2; axis->specializationConstants.outputStride[4] /= 2; }*/ r2cmult = 1; } if ((FFTPlan->numAxisUploads[0] == 1) && ((uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][1] / (double)r2cmult) < axis->axisBlock[1])) axis->axisBlock[1] = (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][1] / (double)r2cmult); if (axis->axisBlock[1] > app->configuration.maxComputeWorkGroupSize[1]) axis->axisBlock[1] = app->configuration.maxComputeWorkGroupSize[1]; //if (axis->axisBlock[0] * axis->axisBlock[1] > app->configuration.maxThreadsNum) axis->axisBlock[1] /= 2; if (axis->axisBlock[0] * axis->axisBlock[1] > maxThreadNum) { for (uint64_t i = 1; i <= axis->axisBlock[1]; i++) { if ((axis->axisBlock[1] / i) * axis->axisBlock[0] <= maxThreadNum) { axis->axisBlock[1] /= i; i = axis->axisBlock[1] + 1; } } } while ((axis->axisBlock[1] * (axis->specializationConstants.fftDim / axis->specializationConstants.registerBoost)) > maxSequenceLengthSharedMemory) axis->axisBlock[1] /= 2; if (((axis->specializationConstants.fftDim % 2 == 0) || (axis->axisBlock[0] < app->configuration.numSharedBanks / 4)) && (!(((!axis->specializationConstants.reorderFourStep) || (axis->specializationConstants.useBluesteinFFT)) && (FFTPlan->numAxisUploads[0] > 1))) && (axis->axisBlock[1] > 1) && (axis->axisBlock[1] * axis->specializationConstants.fftDim < maxSequenceLengthSharedMemoryPow2) && (!((app->configuration.performZeropadding[0] || app->configuration.performZeropadding[1] || app->configuration.performZeropadding[2])))) { /*#if (VKFFT_BACKEND==0) if (((axis->specializationConstants.fftDim & (axis->specializationConstants.fftDim - 1)) != 0)) { uint64_t temp = axis->axisBlock[1]; axis->axisBlock[1] = axis->axisBlock[0]; axis->axisBlock[0] = temp; axis->specializationConstants.axisSwapped = 1; } #else*/ uint64_t temp = axis->axisBlock[1]; axis->axisBlock[1] = axis->axisBlock[0]; axis->axisBlock[0] = temp; axis->specializationConstants.axisSwapped = 1; //#endif } axis->axisBlock[2] = 1; axis->axisBlock[3] = axis->specializationConstants.fftDim; } else { axis->axisBlock[1] = (axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost > 1) ? axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost : 1; uint64_t scale = app->configuration.aimThreads / axis->axisBlock[1] / axis->groupedBatch; if (scale > 1) axis->groupedBatch *= scale; axis->axisBlock[0] = (axis->specializationConstants.stageStartSize > axis->groupedBatch) ? axis->groupedBatch : axis->specializationConstants.stageStartSize; if (axis->axisBlock[0] > app->configuration.maxComputeWorkGroupSize[0]) axis->axisBlock[0] = app->configuration.maxComputeWorkGroupSize[0]; if (axis->axisBlock[0] * axis->axisBlock[1] > maxThreadNum) { for (uint64_t i = 1; i <= axis->axisBlock[0]; i++) { if ((axis->axisBlock[0] / i) * axis->axisBlock[1] <= maxThreadNum) { axis->axisBlock[0] /= i; i = axis->axisBlock[0] + 1; } } } axis->axisBlock[2] = 1; axis->axisBlock[3] = axis->specializationConstants.fftDim; } } if (axis_id == 1) { axis->axisBlock[1] = (axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost > 1) ? axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost : 1; axis->axisBlock[0] = (FFTPlan->actualFFTSizePerAxis[axis_id][0] > axis->groupedBatch) ? axis->groupedBatch : FFTPlan->actualFFTSizePerAxis[axis_id][0]; if (axis->axisBlock[0] > app->configuration.maxComputeWorkGroupSize[0]) axis->axisBlock[0] = app->configuration.maxComputeWorkGroupSize[0]; if (axis->axisBlock[0] * axis->axisBlock[1] > maxThreadNum) { for (uint64_t i = 1; i <= axis->axisBlock[0]; i++) { if ((axis->axisBlock[0] / i) * axis->axisBlock[1] <= maxThreadNum) { axis->axisBlock[0] /= i; i = axis->axisBlock[0] + 1; } } } axis->axisBlock[2] = 1; axis->axisBlock[3] = axis->specializationConstants.fftDim; } if (axis_id == 2) { axis->axisBlock[1] = (axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost > 1) ? axis->specializationConstants.fftDim / axis->specializationConstants.min_registers_per_thread / axis->specializationConstants.registerBoost : 1; axis->axisBlock[0] = (FFTPlan->actualFFTSizePerAxis[axis_id][0] > axis->groupedBatch) ? axis->groupedBatch : FFTPlan->actualFFTSizePerAxis[axis_id][0]; if (axis->axisBlock[0] > app->configuration.maxComputeWorkGroupSize[0]) axis->axisBlock[0] = app->configuration.maxComputeWorkGroupSize[0]; if (axis->axisBlock[0] * axis->axisBlock[1] > maxThreadNum) { for (uint64_t i = 1; i <= axis->axisBlock[0]; i++) { if ((axis->axisBlock[0] / i) * axis->axisBlock[1] <= maxThreadNum) { axis->axisBlock[0] /= i; i = axis->axisBlock[0] + 1; } } } axis->axisBlock[2] = 1; axis->axisBlock[3] = axis->specializationConstants.fftDim; } /*VkSpecializationMapEntry specializationMapEntries[36] = { {} }; for (uint64_t i = 0; i < 36; i++) { specializationMapEntries[i].constantID = i + 1; specializationMapEntries[i].size = sizeof(uint64_t); specializationMapEntries[i].offset = i * sizeof(uint64_t); } VkSpecializationInfo specializationInfo = { 0 }; specializationInfo.dataSize = 36 * sizeof(uint64_t); specializationInfo.mapEntryCount = 36; specializationInfo.pMapEntries = specializationMapEntries;*/ axis->specializationConstants.localSize[0] = axis->axisBlock[0]; axis->specializationConstants.localSize[1] = axis->axisBlock[1]; axis->specializationConstants.localSize[2] = axis->axisBlock[2]; //specializationInfo.pData = &axis->specializationConstants; //uint64_t registerBoost = (FFTPlan->numAxisUploads[axis_id] > 1) ? app->configuration.registerBoost4Step : app->configuration.registerBoost; axis->specializationConstants.numCoordinates = (app->configuration.matrixConvolution > 1) ? 1 : app->configuration.coordinateFeatures; axis->specializationConstants.matrixConvolution = app->configuration.matrixConvolution; axis->specializationConstants.numKernels = app->configuration.numberKernels; axis->specializationConstants.sharedMemSize = app->configuration.sharedMemorySize; axis->specializationConstants.sharedMemSizePow2 = app->configuration.sharedMemorySizePow2; axis->specializationConstants.normalize = (reverseBluesteinMultiUpload) ? 1 : app->configuration.normalize; axis->specializationConstants.size[0] = FFTPlan->actualFFTSizePerAxis[axis_id][0]; axis->specializationConstants.size[1] = FFTPlan->actualFFTSizePerAxis[axis_id][1]; axis->specializationConstants.size[2] = FFTPlan->actualFFTSizePerAxis[axis_id][2]; axis->specializationConstants.axis_id = axis_id; axis->specializationConstants.axis_upload_id = axis_upload_id; for (uint64_t i = 0; i < 3; i++) { axis->specializationConstants.frequencyZeropadding = app->configuration.frequencyZeroPadding; axis->specializationConstants.performZeropaddingFull[i] = app->configuration.performZeropadding[i]; // don't read if input is zeropadded (0 - off, 1 - on) axis->specializationConstants.fft_zeropad_left_full[i] = app->configuration.fft_zeropad_left[i]; axis->specializationConstants.fft_zeropad_right_full[i] = app->configuration.fft_zeropad_right[i]; } if (axis->specializationConstants.useBluesteinFFT && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && ((reverseBluesteinMultiUpload == 0) || (FFTPlan->numAxisUploads[axis_id] == 1))) { axis->specializationConstants.zeropadBluestein[0] = 1; axis->specializationConstants.fft_zeropad_Bluestein_left_read[axis_id] = app->configuration.size[axis_id]; if ((FFTPlan->multiUploadR2C) && (axis_id == 0)) axis->specializationConstants.fft_zeropad_Bluestein_left_read[axis_id] /= 2; if (app->configuration.performDCT == 1) axis->specializationConstants.fft_zeropad_Bluestein_left_read[axis_id] = 2 * axis->specializationConstants.fft_zeropad_Bluestein_left_read[axis_id] - 2; if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) axis->specializationConstants.fft_zeropad_Bluestein_left_read[axis_id] /= 2; axis->specializationConstants.fft_zeropad_Bluestein_right_read[axis_id] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; } if (axis->specializationConstants.useBluesteinFFT && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && ((reverseBluesteinMultiUpload == 1) || (FFTPlan->numAxisUploads[axis_id] == 1))) { axis->specializationConstants.zeropadBluestein[1] = 1; axis->specializationConstants.fft_zeropad_Bluestein_left_write[axis_id] = app->configuration.size[axis_id]; if ((FFTPlan->multiUploadR2C) && (axis_id == 0)) axis->specializationConstants.fft_zeropad_Bluestein_left_write[axis_id] /= 2; if (app->configuration.performDCT == 1) axis->specializationConstants.fft_zeropad_Bluestein_left_write[axis_id] = 2 * axis->specializationConstants.fft_zeropad_Bluestein_left_write[axis_id] - 2; if ((app->configuration.performDCT == 4) && (app->configuration.size[axis_id] % 2 == 0)) axis->specializationConstants.fft_zeropad_Bluestein_left_write[axis_id] /= 2; axis->specializationConstants.fft_zeropad_Bluestein_right_write[axis_id] = FFTPlan->actualFFTSizePerAxis[axis_id][axis_id]; } if ((inverse)) { if ((app->configuration.frequencyZeroPadding) && (((!axis->specializationConstants.reorderFourStep) && (axis_upload_id == 0)) || ((axis->specializationConstants.reorderFourStep) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)))) { axis->specializationConstants.zeropad[0] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_read[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_read[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[0] = 0; if ((!app->configuration.frequencyZeroPadding) && (((!axis->specializationConstants.reorderFourStep) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)) || ((axis->specializationConstants.reorderFourStep) && (axis_upload_id == 0)))) { axis->specializationConstants.zeropad[1] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_write[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_write[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[1] = 0; } else { if ((!app->configuration.frequencyZeroPadding) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)) { axis->specializationConstants.zeropad[0] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_read[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_read[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[0] = 0; if (((app->configuration.frequencyZeroPadding) && (axis_upload_id == 0)) || (((app->configuration.FFTdim - 1 == axis_id) && (axis_upload_id == 0) && (app->configuration.performConvolution)))) { axis->specializationConstants.zeropad[1] = app->configuration.performZeropadding[axis_id]; axis->specializationConstants.fft_zeropad_left_write[axis_id] = app->configuration.fft_zeropad_left[axis_id]; axis->specializationConstants.fft_zeropad_right_write[axis_id] = app->configuration.fft_zeropad_right[axis_id]; } else axis->specializationConstants.zeropad[1] = 0; } if ((app->configuration.FFTdim - 1 == axis_id) && (axis_upload_id == 0) && (app->configuration.performConvolution)) { axis->specializationConstants.convolutionStep = 1; } else axis->specializationConstants.convolutionStep = 0; if (app->useBluesteinFFT[axis_id] && (axis_upload_id == 0)) axis->specializationConstants.BluesteinConvolutionStep = 1; else axis->specializationConstants.BluesteinConvolutionStep = 0; if (app->useBluesteinFFT[axis_id] && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (reverseBluesteinMultiUpload == 0)) axis->specializationConstants.BluesteinPreMultiplication = 1; else axis->specializationConstants.BluesteinPreMultiplication = 0; if (app->useBluesteinFFT[axis_id] && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && ((reverseBluesteinMultiUpload == 1) || (FFTPlan->numAxisUploads[axis_id] == 1))) axis->specializationConstants.BluesteinPostMultiplication = 1; else axis->specializationConstants.BluesteinPostMultiplication = 0; uint64_t tempSize[3] = { FFTPlan->actualFFTSizePerAxis[axis_id][0], FFTPlan->actualFFTSizePerAxis[axis_id][1], FFTPlan->actualFFTSizePerAxis[axis_id][2] }; if (axis_id == 0) { if (axis_upload_id == 0) tempSize[0] = FFTPlan->actualFFTSizePerAxis[axis_id][0] / axis->specializationConstants.fftDim / axis->axisBlock[1]; else tempSize[0] = FFTPlan->actualFFTSizePerAxis[axis_id][0] / axis->specializationConstants.fftDim / axis->axisBlock[0]; if ((FFTPlan->actualPerformR2CPerAxis[axis_id] == 1) && (axis->specializationConstants.mergeSequencesR2C)) tempSize[1] = (uint64_t)ceil(tempSize[1] / 2.0); tempSize[2] *= app->configuration.numberKernels * app->configuration.numberBatches; if (!(axis->specializationConstants.convolutionStep && (app->configuration.matrixConvolution > 1))) tempSize[2] *= app->configuration.coordinateFeatures; //if (app->configuration.performZeropadding[1]) tempSize[1] = (uint64_t)ceil(tempSize[1] / 2.0); //if (app->configuration.performZeropadding[2]) tempSize[2] = (uint64_t)ceil(tempSize[2] / 2.0); if (tempSize[0] > app->configuration.maxComputeWorkGroupCount[0]) axis->specializationConstants.performWorkGroupShift[0] = 1; else axis->specializationConstants.performWorkGroupShift[0] = 0; if (tempSize[1] > app->configuration.maxComputeWorkGroupCount[1]) axis->specializationConstants.performWorkGroupShift[1] = 1; else axis->specializationConstants.performWorkGroupShift[1] = 0; if (tempSize[2] > app->configuration.maxComputeWorkGroupCount[2]) axis->specializationConstants.performWorkGroupShift[2] = 1; else axis->specializationConstants.performWorkGroupShift[2] = 0; } if (axis_id == 1) { tempSize[0] = (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][0] / (double)axis->axisBlock[0] * FFTPlan->actualFFTSizePerAxis[axis_id][1] / (double)axis->specializationConstants.fftDim); tempSize[1] = 1; tempSize[2] = FFTPlan->actualFFTSizePerAxis[axis_id][2]; tempSize[2] *= app->configuration.numberKernels * app->configuration.numberBatches; if (!(axis->specializationConstants.convolutionStep && (app->configuration.matrixConvolution > 1))) tempSize[2] *= app->configuration.coordinateFeatures; //if (app->configuration.actualPerformR2C == 1) tempSize[0] = (uint64_t)ceil(tempSize[0] / 2.0); //if (app->configuration.performZeropadding[2]) tempSize[2] = (uint64_t)ceil(tempSize[2] / 2.0); if (tempSize[0] > app->configuration.maxComputeWorkGroupCount[0]) axis->specializationConstants.performWorkGroupShift[0] = 1; else axis->specializationConstants.performWorkGroupShift[0] = 0; if (tempSize[1] > app->configuration.maxComputeWorkGroupCount[1]) axis->specializationConstants.performWorkGroupShift[1] = 1; else axis->specializationConstants.performWorkGroupShift[1] = 0; if (tempSize[2] > app->configuration.maxComputeWorkGroupCount[2]) axis->specializationConstants.performWorkGroupShift[2] = 1; else axis->specializationConstants.performWorkGroupShift[2] = 0; } if (axis_id == 2) { tempSize[0] = (uint64_t)ceil(FFTPlan->actualFFTSizePerAxis[axis_id][0] / (double)axis->axisBlock[0] * FFTPlan->actualFFTSizePerAxis[axis_id][2] / (double)axis->specializationConstants.fftDim); tempSize[1] = 1; tempSize[2] = FFTPlan->actualFFTSizePerAxis[axis_id][1]; tempSize[2] *= app->configuration.numberKernels * app->configuration.numberBatches; if (!(axis->specializationConstants.convolutionStep && (app->configuration.matrixConvolution > 1))) tempSize[2] *= app->configuration.coordinateFeatures; //if (app->configuration.actualPerformR2C == 1) tempSize[0] = (uint64_t)ceil(tempSize[0] / 2.0); if (tempSize[0] > app->configuration.maxComputeWorkGroupCount[0]) axis->specializationConstants.performWorkGroupShift[0] = 1; else axis->specializationConstants.performWorkGroupShift[0] = 0; if (tempSize[1] > app->configuration.maxComputeWorkGroupCount[1]) axis->specializationConstants.performWorkGroupShift[1] = 1; else axis->specializationConstants.performWorkGroupShift[1] = 0; if (tempSize[2] > app->configuration.maxComputeWorkGroupCount[2]) axis->specializationConstants.performWorkGroupShift[2] = 1; else axis->specializationConstants.performWorkGroupShift[2] = 0; } char floatTypeInputMemory[10]; char floatTypeOutputMemory[10]; char floatTypeKernelMemory[10]; char floatType[10]; axis->specializationConstants.unroll = 1; axis->specializationConstants.LUT = app->configuration.useLUT; if (app->configuration.doublePrecision) { sprintf(floatType, "double"); sprintf(floatTypeInputMemory, "double"); sprintf(floatTypeOutputMemory, "double"); sprintf(floatTypeKernelMemory, "double"); //axis->specializationConstants.unroll = 1; } else { //axis->specializationConstants.unroll = 0; if (app->configuration.halfPrecision) { sprintf(floatType, "float"); if (app->configuration.halfPrecisionMemoryOnly) { //only out of place mode, input/output buffer must be different sprintf(floatTypeKernelMemory, "float"); if ((axis_id == app->firstAxis) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1) && (!axis->specializationConstants.actualInverse)) sprintf(floatTypeInputMemory, "half"); else sprintf(floatTypeInputMemory, "float"); if ((axis_id == app->firstAxis) && (((!axis->specializationConstants.reorderFourStep) && (axis_upload_id == FFTPlan->numAxisUploads[axis_id] - 1)) || ((axis->specializationConstants.reorderFourStep) && (axis_upload_id == 0))) && (axis->specializationConstants.actualInverse)) sprintf(floatTypeOutputMemory, "half"); else sprintf(floatTypeOutputMemory, "float"); } else { sprintf(floatTypeInputMemory, "half"); sprintf(floatTypeOutputMemory, "half"); sprintf(floatTypeKernelMemory, "half"); } } else { if (app->configuration.doublePrecisionFloatMemory) { sprintf(floatType, "double"); sprintf(floatTypeInputMemory, "float"); sprintf(floatTypeOutputMemory, "float"); sprintf(floatTypeKernelMemory, "float"); } else { sprintf(floatType, "float"); sprintf(floatTypeInputMemory, "float"); sprintf(floatTypeOutputMemory, "float"); sprintf(floatTypeKernelMemory, "float"); } } } char uintType[20] = ""; if (!app->configuration.useUint64) { #if(VKFFT_BACKEND==0) sprintf(uintType, "uint"); #elif(VKFFT_BACKEND==1) sprintf(uintType, "unsigned int"); #elif(VKFFT_BACKEND==2) sprintf(uintType, "unsigned int"); #elif(VKFFT_BACKEND==3) sprintf(uintType, "unsigned int"); #endif } else { #if(VKFFT_BACKEND==0) sprintf(uintType, "uint64_t"); #elif(VKFFT_BACKEND==1) sprintf(uintType, "unsigned long long"); #elif(VKFFT_BACKEND==2) sprintf(uintType, "unsigned long long"); #elif(VKFFT_BACKEND==3) sprintf(uintType, "unsigned long"); #endif } { axis->pushConstants.structSize = 0; if (axis->specializationConstants.performWorkGroupShift[0]) { axis->pushConstants.performWorkGroupShift[0] = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performWorkGroupShift[1]) { axis->pushConstants.performWorkGroupShift[1] = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performWorkGroupShift[2]) { axis->pushConstants.performWorkGroupShift[2] = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performPostCompilationInputOffset) { axis->pushConstants.performPostCompilationInputOffset = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performPostCompilationOutputOffset) { axis->pushConstants.performPostCompilationOutputOffset = 1; axis->pushConstants.structSize += 1; } if (axis->specializationConstants.performPostCompilationKernelOffset) { axis->pushConstants.performPostCompilationKernelOffset = 1; axis->pushConstants.structSize += 1; } if (app->configuration.useUint64) axis->pushConstants.structSize *= sizeof(uint64_t); else axis->pushConstants.structSize *= sizeof(uint32_t); axis->specializationConstants.pushConstantsStructSize = axis->pushConstants.structSize; } //uint64_t LUT = app->configuration.useLUT; uint64_t type = 0; if ((axis_id == 0) && (axis_upload_id == 0)) type = 0; if (axis_id != 0) type = 1; if ((axis_id == 0) && (axis_upload_id > 0)) type = 2; //if ((axis->specializationConstants.fftDim == 8 * maxSequenceLengthSharedMemory) && (app->configuration.registerBoost >= 8)) axis->specializationConstants.registerBoost = 8; if ((axis_id == 0) && (!axis->specializationConstants.actualInverse) && (FFTPlan->actualPerformR2CPerAxis[axis_id])) type = 5; if ((axis_id == 0) && (axis->specializationConstants.actualInverse) && (FFTPlan->actualPerformR2CPerAxis[axis_id])) type = 6; if ((axis_id == 0) && (app->configuration.performDCT == 1)) type = 110; if ((axis_id != 0) && (app->configuration.performDCT == 1)) type = 111; if ((axis_id == 0) && (((app->configuration.performDCT == 2) && (!inverse)) || ((app->configuration.performDCT == 3) && (inverse)))) type = 120; if ((axis_id != 0) && (((app->configuration.performDCT == 2) && (!inverse)) || ((app->configuration.performDCT == 3) && (inverse)))) type = 121; if ((axis_id == 0) && (((app->configuration.performDCT == 2) && (inverse)) || ((app->configuration.performDCT == 3) && (!inverse)))) type = 130; if ((axis_id != 0) && (((app->configuration.performDCT == 2) && (inverse)) || ((app->configuration.performDCT == 3) && (!inverse)))) type = 131; if ((axis_id == 0) && (app->configuration.performDCT == 4) && ((app->configuration.size[axis_id] % 2) == 0)) type = 142; if ((axis_id == 0) && (app->configuration.performDCT == 4) && ((app->configuration.size[axis_id] % 2) == 1)) type = 144; if ((axis_id != 0) && (app->configuration.performDCT == 4) && ((app->configuration.size[axis_id] % 2) == 0)) type = 143; if ((axis_id != 0) && (app->configuration.performDCT == 4) && ((app->configuration.size[axis_id] % 2) == 1)) type = 145; #if(VKFFT_BACKEND==0) axis->specializationConstants.cacheShuffle = ((FFTPlan->numAxisUploads[axis_id] > 1) && ((axis->specializationConstants.fftDim & (axis->specializationConstants.fftDim - 1)) == 0) && (!app->configuration.doublePrecision) && (!axis->specializationConstants.useBluesteinFFT) && (!app->configuration.doublePrecisionFloatMemory) && ((type == 0) || (type == 5) || (type == 6))) ? 1 : 0; #elif(VKFFT_BACKEND==1) axis->specializationConstants.cacheShuffle = 0; #elif(VKFFT_BACKEND==2) axis->specializationConstants.cacheShuffle = 0; #elif(VKFFT_BACKEND==3) axis->specializationConstants.cacheShuffle = 0; #endif axis->specializationConstants.maxCodeLength = app->configuration.maxCodeLength; axis->specializationConstants.maxTempLength = app->configuration.maxTempLength; axis->specializationConstants.code0 = (char*)malloc(sizeof(char) * app->configuration.maxCodeLength); char* code0 = axis->specializationConstants.code0; if (!code0) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } resFFT = shaderGenVkFFT(code0, &axis->specializationConstants, floatType, floatTypeInputMemory, floatTypeOutputMemory, floatTypeKernelMemory, uintType, type); freeShaderGenVkFFT(&axis->specializationConstants); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } #if(VKFFT_BACKEND==0) uint32_t* code; size_t codeSize; if (app->configuration.loadApplicationFromString) { uint32_t* localStrPointer = (uint32_t*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = localStrPointer[0]; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize); app->currentApplicationStringPos += codeSize / (sizeof(uint32_t)) + 1; } else { const glslang_resource_t default_resource = { /* .MaxLights = */ 32, /* .MaxClipPlanes = */ 6, /* .MaxTextureUnits = */ 32, /* .MaxTextureCoords = */ 32, /* .MaxVertexAttribs = */ 64, /* .MaxVertexUniformComponents = */ 4096, /* .MaxVaryingFloats = */ 64, /* .MaxVertexTextureImageUnits = */ 32, /* .MaxCombinedTextureImageUnits = */ 80, /* .MaxTextureImageUnits = */ 32, /* .MaxFragmentUniformComponents = */ 4096, /* .MaxDrawBuffers = */ 32, /* .MaxVertexUniformVectors = */ 128, /* .MaxVaryingVectors = */ 8, /* .MaxFragmentUniformVectors = */ 16, /* .MaxVertexOutputVectors = */ 16, /* .MaxFragmentInputVectors = */ 15, /* .MinProgramTexelOffset = */ -8, /* .MaxProgramTexelOffset = */ 7, /* .MaxClipDistances = */ 8, /* .MaxComputeWorkGroupCountX = */ 65535, /* .MaxComputeWorkGroupCountY = */ 65535, /* .MaxComputeWorkGroupCountZ = */ 65535, /* .MaxComputeWorkGroupSizeX = */ 1024, /* .MaxComputeWorkGroupSizeY = */ 1024, /* .MaxComputeWorkGroupSizeZ = */ 64, /* .MaxComputeUniformComponents = */ 1024, /* .MaxComputeTextureImageUnits = */ 16, /* .MaxComputeImageUniforms = */ 8, /* .MaxComputeAtomicCounters = */ 8, /* .MaxComputeAtomicCounterBuffers = */ 1, /* .MaxVaryingComponents = */ 60, /* .MaxVertexOutputComponents = */ 64, /* .MaxGeometryInputComponents = */ 64, /* .MaxGeometryOutputComponents = */ 128, /* .MaxFragmentInputComponents = */ 128, /* .MaxImageUnits = */ 8, /* .MaxCombinedImageUnitsAndFragmentOutputs = */ 8, /* .MaxCombinedShaderOutputResources = */ 8, /* .MaxImageSamples = */ 0, /* .MaxVertexImageUniforms = */ 0, /* .MaxTessControlImageUniforms = */ 0, /* .MaxTessEvaluationImageUniforms = */ 0, /* .MaxGeometryImageUniforms = */ 0, /* .MaxFragmentImageUniforms = */ 8, /* .MaxCombinedImageUniforms = */ 8, /* .MaxGeometryTextureImageUnits = */ 16, /* .MaxGeometryOutputVertices = */ 256, /* .MaxGeometryTotalOutputComponents = */ 1024, /* .MaxGeometryUniformComponents = */ 1024, /* .MaxGeometryVaryingComponents = */ 64, /* .MaxTessControlInputComponents = */ 128, /* .MaxTessControlOutputComponents = */ 128, /* .MaxTessControlTextureImageUnits = */ 16, /* .MaxTessControlUniformComponents = */ 1024, /* .MaxTessControlTotalOutputComponents = */ 4096, /* .MaxTessEvaluationInputComponents = */ 128, /* .MaxTessEvaluationOutputComponents = */ 128, /* .MaxTessEvaluationTextureImageUnits = */ 16, /* .MaxTessEvaluationUniformComponents = */ 1024, /* .MaxTessPatchComponents = */ 120, /* .MaxPatchVertices = */ 32, /* .MaxTessGenLevel = */ 64, /* .MaxViewports = */ 16, /* .MaxVertexAtomicCounters = */ 0, /* .MaxTessControlAtomicCounters = */ 0, /* .MaxTessEvaluationAtomicCounters = */ 0, /* .MaxGeometryAtomicCounters = */ 0, /* .MaxFragmentAtomicCounters = */ 8, /* .MaxCombinedAtomicCounters = */ 8, /* .MaxAtomicCounterBindings = */ 1, /* .MaxVertexAtomicCounterBuffers = */ 0, /* .MaxTessControlAtomicCounterBuffers = */ 0, /* .MaxTessEvaluationAtomicCounterBuffers = */ 0, /* .MaxGeometryAtomicCounterBuffers = */ 0, /* .MaxFragmentAtomicCounterBuffers = */ 1, /* .MaxCombinedAtomicCounterBuffers = */ 1, /* .MaxAtomicCounterBufferSize = */ 16384, /* .MaxTransformFeedbackBuffers = */ 4, /* .MaxTransformFeedbackInterleavedComponents = */ 64, /* .MaxCullDistances = */ 8, /* .MaxCombinedClipAndCullDistances = */ 8, /* .MaxSamples = */ 4, /* .maxMeshOutputVerticesNV = */ 256, /* .maxMeshOutputPrimitivesNV = */ 512, /* .maxMeshWorkGroupSizeX_NV = */ 32, /* .maxMeshWorkGroupSizeY_NV = */ 1, /* .maxMeshWorkGroupSizeZ_NV = */ 1, /* .maxTaskWorkGroupSizeX_NV = */ 32, /* .maxTaskWorkGroupSizeY_NV = */ 1, /* .maxTaskWorkGroupSizeZ_NV = */ 1, /* .maxMeshViewCountNV = */ 4, /* .maxDualSourceDrawBuffersEXT = */ 1, /* .limits = */ { /* .nonInductiveForLoops = */ 1, /* .whileLoops = */ 1, /* .doWhileLoops = */ 1, /* .generalUniformIndexing = */ 1, /* .generalAttributeMatrixVectorIndexing = */ 1, /* .generalVaryingIndexing = */ 1, /* .generalSamplerIndexing = */ 1, /* .generalVariableIndexing = */ 1, /* .generalConstantMatrixVectorIndexing = */ 1, } }; glslang_target_client_version_t client_version = (app->configuration.halfPrecision) ? GLSLANG_TARGET_VULKAN_1_1 : GLSLANG_TARGET_VULKAN_1_0; glslang_target_language_version_t target_language_version = (app->configuration.halfPrecision) ? GLSLANG_TARGET_SPV_1_3 : GLSLANG_TARGET_SPV_1_0; const glslang_input_t input = { GLSLANG_SOURCE_GLSL, GLSLANG_STAGE_COMPUTE, GLSLANG_CLIENT_VULKAN, client_version, GLSLANG_TARGET_SPV, target_language_version, code0, 450, GLSLANG_NO_PROFILE, 1, 0, GLSLANG_MSG_DEFAULT_BIT, &default_resource, }; //printf("%s\n", code0); glslang_shader_t* shader = glslang_shader_create(&input); const char* err; if (!glslang_shader_preprocess(shader, &input)) { err = glslang_shader_get_info_log(shader); printf("%s\n", code0); printf("%s\nVkFFT shader type: %" PRIu64 "\n", err, type); glslang_shader_delete(shader); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SHADER_PREPROCESS; } if (!glslang_shader_parse(shader, &input)) { err = glslang_shader_get_info_log(shader); printf("%s\n", code0); printf("%s\nVkFFT shader type: %" PRIu64 "\n", err, type); glslang_shader_delete(shader); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SHADER_PARSE; } glslang_program_t* program = glslang_program_create(); glslang_program_add_shader(program, shader); if (!glslang_program_link(program, GLSLANG_MSG_SPV_RULES_BIT | GLSLANG_MSG_VULKAN_RULES_BIT)) { err = glslang_program_get_info_log(program); printf("%s\n", code0); printf("%s\nVkFFT shader type: %" PRIu64 "\n", err, type); glslang_shader_delete(shader); glslang_program_delete(program); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SHADER_LINK; } glslang_program_SPIRV_generate(program, input.stage); if (glslang_program_SPIRV_get_messages(program)) { printf("%s", glslang_program_SPIRV_get_messages(program)); glslang_shader_delete(shader); glslang_program_delete(program); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_SPIRV_GENERATE; } glslang_shader_delete(shader); uint32_t* tempCode = glslang_program_SPIRV_get_ptr(program); codeSize = glslang_program_SPIRV_get_size(program) * sizeof(uint32_t); axis->binarySize = codeSize; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; glslang_program_delete(program); deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } axis->binary = code; memcpy(code, tempCode, codeSize); glslang_program_delete(program); } VkPipelineShaderStageCreateInfo pipelineShaderStageCreateInfo = { VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO }; VkComputePipelineCreateInfo computePipelineCreateInfo = { VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO }; pipelineShaderStageCreateInfo.stage = VK_SHADER_STAGE_COMPUTE_BIT; VkShaderModuleCreateInfo createInfo = { VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO }; createInfo.pCode = code; createInfo.codeSize = codeSize; res = vkCreateShaderModule(app->configuration.device[0], &createInfo, 0, &pipelineShaderStageCreateInfo.module); if (res != VK_SUCCESS) { free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_SHADER_MODULE; } VkPipelineLayoutCreateInfo pipelineLayoutCreateInfo = { VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO }; pipelineLayoutCreateInfo.setLayoutCount = 1; pipelineLayoutCreateInfo.pSetLayouts = &axis->descriptorSetLayout; VkPushConstantRange pushConstantRange = { VK_SHADER_STAGE_COMPUTE_BIT }; pushConstantRange.offset = 0; pushConstantRange.size = (uint32_t)axis->pushConstants.structSize; // Push constant ranges are part of the pipeline layout if (axis->pushConstants.structSize) { pipelineLayoutCreateInfo.pushConstantRangeCount = 1; pipelineLayoutCreateInfo.pPushConstantRanges = &pushConstantRange; } res = vkCreatePipelineLayout(app->configuration.device[0], &pipelineLayoutCreateInfo, 0, &axis->pipelineLayout); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE_LAYOUT; } pipelineShaderStageCreateInfo.pName = "main"; pipelineShaderStageCreateInfo.pSpecializationInfo = 0;// &specializationInfo; computePipelineCreateInfo.stage = pipelineShaderStageCreateInfo; computePipelineCreateInfo.layout = axis->pipelineLayout; res = vkCreateComputePipelines(app->configuration.device[0], VK_NULL_HANDLE, 1, &computePipelineCreateInfo, 0, &axis->pipeline); if (res != VK_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PIPELINE; } vkDestroyShaderModule(app->configuration.device[0], pipelineShaderStageCreateInfo.module, 0); if (!app->configuration.saveApplicationToString) { free(code); code = 0; } #elif(VKFFT_BACKEND==1) char* code; size_t codeSize; if (app->configuration.loadApplicationFromString) { char* localStrPointer = (char*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = strtol(localStrPointer, &localStrPointer, 10); code = (char*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize - 1); code[codeSize - 1] = '\0'; //printf("%s\n", code); app->currentApplicationStringPos += codeSize + (uint64_t)(floor(log10((double)codeSize))) + 1; } else { nvrtcProgram prog; nvrtcResult result = nvrtcCreateProgram(&prog, // prog code0, // buffer "VkFFT.cu", // name 0, // numHeaders 0, // headers 0); // includeNames //free(includeNames); //free(headers); if (result != NVRTC_SUCCESS) { printf("nvrtcCreateProgram error: %s\n", nvrtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } //const char opts[20] = "--fmad=false"; //result = nvrtcAddNameExpression(prog, "&consts"); //if (result != NVRTC_SUCCESS) printf("1.5 error: %s\n", nvrtcGetErrorString(result)); result = nvrtcCompileProgram(prog, // prog 0, // numOptions 0); // options if (result != NVRTC_SUCCESS) { printf("nvrtcCompileProgram error: %s\n", nvrtcGetErrorString(result)); char* log = (char*)malloc(sizeof(char) * 4000000); if (!log) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } else { nvrtcGetProgramLog(prog, log); printf("%s\n", log); free(log); log = 0; printf("%s\n", code0); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } result = nvrtcGetPTXSize(prog, &codeSize); if (result != NVRTC_SUCCESS) { printf("nvrtcGetPTXSize error: %s\n", nvrtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE_SIZE; } axis->binarySize = codeSize; code = (char*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } axis->binary = code; result = nvrtcGetPTX(prog, code); if (result != NVRTC_SUCCESS) { printf("nvrtcGetPTX error: %s\n", nvrtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE; } result = nvrtcDestroyProgram(&prog); if (result != NVRTC_SUCCESS) { printf("nvrtcDestroyProgram error: %s\n", nvrtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_DESTROY_PROGRAM; } } CUresult result2 = cuModuleLoadDataEx(&axis->VkFFTModule, code, 0, 0, 0); if (result2 != CUDA_SUCCESS) { printf("cuModuleLoadDataEx error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_LOAD_MODULE; } result2 = cuModuleGetFunction(&axis->VkFFTKernel, axis->VkFFTModule, "VkFFT_main"); if (result2 != CUDA_SUCCESS) { printf("cuModuleGetFunction error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_FUNCTION; } if (axis->specializationConstants.usedSharedMemory > app->configuration.sharedMemorySizeStatic) { result2 = cuFuncSetAttribute(axis->VkFFTKernel, CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, (int)axis->specializationConstants.usedSharedMemory); if (result2 != CUDA_SUCCESS) { printf("cuFuncSetAttribute error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_SET_DYNAMIC_SHARED_MEMORY; } } if (axis->pushConstants.structSize) { size_t size = axis->pushConstants.structSize; result2 = cuModuleGetGlobal(&axis->consts_addr, &size, axis->VkFFTModule, "consts"); if (result2 != CUDA_SUCCESS) { printf("cuModuleGetGlobal error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_MODULE_GET_GLOBAL; } } if (!app->configuration.saveApplicationToString) { free(code); code = 0; } #elif(VKFFT_BACKEND==2) uint32_t* code; size_t codeSize; if (app->configuration.loadApplicationFromString) { uint32_t* localStrPointer = (uint32_t*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = localStrPointer[0]; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize); app->currentApplicationStringPos += codeSize / (sizeof(uint32_t)) + 1; } else { hiprtcProgram prog; enum hiprtcResult result = hiprtcCreateProgram(&prog, // prog code0, // buffer "VkFFT.hip", // name 0, // numHeaders 0, // headers 0); // includeNames if (result != HIPRTC_SUCCESS) { printf("hiprtcCreateProgram error: %s\n", hiprtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } if (axis->pushConstants.structSize) { result = hiprtcAddNameExpression(prog, "&consts"); if (result != HIPRTC_SUCCESS) { printf("hiprtcAddNameExpression error: %s\n", hiprtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_ADD_NAME_EXPRESSION; } } result = hiprtcCompileProgram(prog, // prog 0, // numOptions 0); // options if (result != HIPRTC_SUCCESS) { printf("hiprtcCompileProgram error: %s\n", hiprtcGetErrorString(result)); char* log = (char*)malloc(sizeof(char) * 100000); if (!log) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } else { hiprtcGetProgramLog(prog, log); printf("%s\n", log); free(log); log = 0; printf("%s\n", code0); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } result = hiprtcGetCodeSize(prog, &codeSize); if (result != HIPRTC_SUCCESS) { printf("hiprtcGetCodeSize error: %s\n", hiprtcGetErrorString(result)); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE; } axis->binarySize = codeSize; code = (uint32_t*)malloc(codeSize); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } axis->binary = code; result = hiprtcGetCode(prog, (char*)code); if (result != HIPRTC_SUCCESS) { printf("hiprtcGetCode error: %s\n", hiprtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_CODE_SIZE; } //printf("%s\n", code); // Destroy the program. result = hiprtcDestroyProgram(&prog); if (result != HIPRTC_SUCCESS) { printf("hiprtcDestroyProgram error: %s\n", hiprtcGetErrorString(result)); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_DESTROY_PROGRAM; } } hipError_t result2 = hipModuleLoadDataEx(&axis->VkFFTModule, code, 0, 0, 0); if (result2 != hipSuccess) { printf("hipModuleLoadDataEx error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_LOAD_MODULE; } result2 = hipModuleGetFunction(&axis->VkFFTKernel, axis->VkFFTModule, "VkFFT_main"); if (result2 != hipSuccess) { printf("hipModuleGetFunction error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_FUNCTION; } if (axis->specializationConstants.usedSharedMemory > app->configuration.sharedMemorySizeStatic) { result2 = hipFuncSetAttribute(axis->VkFFTKernel, hipFuncAttributeMaxDynamicSharedMemorySize, (int)axis->specializationConstants.usedSharedMemory); //result2 = hipFuncSetCacheConfig(axis->VkFFTKernel, hipFuncCachePreferShared); if (result2 != hipSuccess) { printf("hipFuncSetAttribute error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_SET_DYNAMIC_SHARED_MEMORY; } } if (axis->pushConstants.structSize) { size_t size = axis->pushConstants.structSize; result2 = hipModuleGetGlobal(&axis->consts_addr, &size, axis->VkFFTModule, "consts"); if (result2 != hipSuccess) { printf("hipModuleGetGlobal error: %d\n", result2); free(code); code = 0; free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_MODULE_GET_GLOBAL; } } if (!app->configuration.saveApplicationToString) { free(code); code = 0; } #elif(VKFFT_BACKEND==3) if (app->configuration.loadApplicationFromString) { char* code; size_t codeSize; char* localStrPointer = (char*)app->configuration.loadApplicationString + app->currentApplicationStringPos; codeSize = strtol(localStrPointer, &localStrPointer, 10); code = (char*)malloc(codeSize - 1); if (!code) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } memcpy(code, localStrPointer + 1, codeSize - 2); code[codeSize - 2] = '\0'; app->currentApplicationStringPos += codeSize + (uint64_t)(floor(log10((double)codeSize))); axis->program = clCreateProgramWithBinary(app->configuration.context[0], 1, app->configuration.device, &codeSize, (const unsigned char**)(&code), 0, &res); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } } else { size_t codelen = strlen(code0); axis->program = clCreateProgramWithSource(app->configuration.context[0], 1, (const char**)&code0, &codelen, &res); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_PROGRAM; } } res = clBuildProgram(axis->program, 1, app->configuration.device, 0, 0, 0); if (res != CL_SUCCESS) { size_t log_size; clGetProgramBuildInfo(axis->program, app->configuration.device[0], CL_PROGRAM_BUILD_LOG, 0, 0, &log_size); char* log = (char*)malloc(log_size); if (!log) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } else { clGetProgramBuildInfo(axis->program, app->configuration.device[0], CL_PROGRAM_BUILD_LOG, log_size, log, 0); printf("%s\n", log); free(log); log = 0; printf("%s\n", code0); free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } if (app->configuration.saveApplicationToString) { size_t codeSize; res = clGetProgramInfo(axis->program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &codeSize, NULL); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } axis->binarySize = codeSize; axis->binary = (char*)malloc(axis->binarySize); if (!axis->binary) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } res = clGetProgramInfo(axis->program, CL_PROGRAM_BINARIES, axis->binarySize, &axis->binary, NULL); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_COMPILE_PROGRAM; } } axis->kernel = clCreateKernel(axis->program, "VkFFT_main", &res); if (res != CL_SUCCESS) { free(code0); code0 = 0; deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_SHADER_MODULE; } #endif if (!app->configuration.keepShaderCode) { free(code0); code0 = 0; axis->specializationConstants.code0 = 0; } } if (axis->specializationConstants.axisSwapped) {//swap back for correct dispatch uint64_t temp = axis->axisBlock[1]; axis->axisBlock[1] = axis->axisBlock[0]; axis->axisBlock[0] = temp; axis->specializationConstants.axisSwapped = 0; } return resFFT; } static inline VkFFTResult initializeVkFFT(VkFFTApplication* app, VkFFTConfiguration inputLaunchConfiguration) { //app->configuration = {};// inputLaunchConfiguration; if (inputLaunchConfiguration.doublePrecision != 0) app->configuration.doublePrecision = inputLaunchConfiguration.doublePrecision; if (inputLaunchConfiguration.doublePrecisionFloatMemory != 0) app->configuration.doublePrecisionFloatMemory = inputLaunchConfiguration.doublePrecisionFloatMemory; if (inputLaunchConfiguration.halfPrecision != 0) app->configuration.halfPrecision = inputLaunchConfiguration.halfPrecision; if (inputLaunchConfiguration.halfPrecisionMemoryOnly != 0) app->configuration.halfPrecisionMemoryOnly = inputLaunchConfiguration.halfPrecisionMemoryOnly; //set device parameters #if(VKFFT_BACKEND==0) if (!inputLaunchConfiguration.isCompilerInitialized) { if (!app->configuration.isCompilerInitialized) { int resGlslangInitialize = glslang_initialize_process(); if (!resGlslangInitialize) return VKFFT_ERROR_FAILED_TO_INITIALIZE; app->configuration.isCompilerInitialized = 1; } } if (inputLaunchConfiguration.physicalDevice == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_PHYSICAL_DEVICE; } app->configuration.physicalDevice = inputLaunchConfiguration.physicalDevice; if (inputLaunchConfiguration.device == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_DEVICE; } app->configuration.device = inputLaunchConfiguration.device; if (inputLaunchConfiguration.queue == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_QUEUE; } app->configuration.queue = inputLaunchConfiguration.queue; if (inputLaunchConfiguration.commandPool == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_COMMAND_POOL; } app->configuration.commandPool = inputLaunchConfiguration.commandPool; if (inputLaunchConfiguration.fence == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_FENCE; } app->configuration.fence = inputLaunchConfiguration.fence; VkPhysicalDeviceProperties physicalDeviceProperties = { 0 }; vkGetPhysicalDeviceProperties(app->configuration.physicalDevice[0], &physicalDeviceProperties); app->configuration.maxThreadsNum = physicalDeviceProperties.limits.maxComputeWorkGroupInvocations; if (physicalDeviceProperties.vendorID == 0x8086) app->configuration.maxThreadsNum = 256; //Intel fix app->configuration.maxComputeWorkGroupCount[0] = physicalDeviceProperties.limits.maxComputeWorkGroupCount[0]; app->configuration.maxComputeWorkGroupCount[1] = physicalDeviceProperties.limits.maxComputeWorkGroupCount[1]; app->configuration.maxComputeWorkGroupCount[2] = physicalDeviceProperties.limits.maxComputeWorkGroupCount[2]; app->configuration.maxComputeWorkGroupSize[0] = physicalDeviceProperties.limits.maxComputeWorkGroupSize[0]; app->configuration.maxComputeWorkGroupSize[1] = physicalDeviceProperties.limits.maxComputeWorkGroupSize[1]; app->configuration.maxComputeWorkGroupSize[2] = physicalDeviceProperties.limits.maxComputeWorkGroupSize[2]; //if ((physicalDeviceProperties.vendorID == 0x8086) && (!app->configuration.doublePrecision) && (!app->configuration.doublePrecisionFloatMemory)) app->configuration.halfThreads = 1; app->configuration.sharedMemorySize = physicalDeviceProperties.limits.maxComputeSharedMemorySize; app->configuration.sharedMemorySizePow2 = (uint64_t)pow(2, (uint64_t)log2(physicalDeviceProperties.limits.maxComputeSharedMemorySize)); switch (physicalDeviceProperties.vendorID) { case 0x10DE://NVIDIA app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 64 : 32;//the coalesced memory is equal to 32 bytes between L2 and VRAM. app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.warpSize = 32; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = 4; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; break; case 0x8086://INTEL app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 128 : 64; app->configuration.useLUT = 1; app->configuration.warpSize = 32; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = (physicalDeviceProperties.limits.maxComputeSharedMemorySize >= 65536) ? 1 : 2; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; break; case 0x1002://AMD app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 64 : 32; app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.warpSize = 64; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = (physicalDeviceProperties.limits.maxComputeSharedMemorySize >= 65536) ? 2 : 4; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; break; default: app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 128 : 64; app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.warpSize = 32; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = 1; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; break; } #elif(VKFFT_BACKEND==1) CUresult res = CUDA_SUCCESS; cudaError_t res_t = cudaSuccess; if (inputLaunchConfiguration.device == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_DEVICE; } app->configuration.device = inputLaunchConfiguration.device; if (inputLaunchConfiguration.num_streams != 0) app->configuration.num_streams = inputLaunchConfiguration.num_streams; if (inputLaunchConfiguration.stream != 0) app->configuration.stream = inputLaunchConfiguration.stream; app->configuration.streamID = 0; int value = 0; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxThreadsNum = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupCount[0] = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupCount[1] = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Z, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupCount[2] = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupSize[0] = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Y, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupSize[1] = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Z, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupSize[2] = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.sharedMemorySizeStatic = value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.sharedMemorySize = value;// (value > 65536) ? 65536 : value; res = cuDeviceGetAttribute(&value, CU_DEVICE_ATTRIBUTE_WARP_SIZE, app->configuration.device[0]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.warpSize = value; app->configuration.sharedMemorySizePow2 = (uint64_t)pow(2, (uint64_t)log2(app->configuration.sharedMemorySize)); if (app->configuration.num_streams > 1) { app->configuration.stream_event = (cudaEvent_t*)malloc(app->configuration.num_streams * sizeof(cudaEvent_t)); if (!app->configuration.stream_event) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } for (uint64_t i = 0; i < app->configuration.num_streams; i++) { res_t = cudaEventCreate(&app->configuration.stream_event[i]); if (res != CUDA_SUCCESS) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_EVENT; } } } app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 64 : 32;//the coalesced memory is equal to 32 bytes between L2 and VRAM. app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = 1; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; #elif(VKFFT_BACKEND==2) hipError_t res = hipSuccess; if (inputLaunchConfiguration.device == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_DEVICE; } app->configuration.device = inputLaunchConfiguration.device; if (inputLaunchConfiguration.num_streams != 0) app->configuration.num_streams = inputLaunchConfiguration.num_streams; if (inputLaunchConfiguration.stream != 0) app->configuration.stream = inputLaunchConfiguration.stream; app->configuration.streamID = 0; int value = 0; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxThreadsPerBlock, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxThreadsNum = value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxGridDimX, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupCount[0] = value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxGridDimY, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupCount[1] = value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxGridDimZ, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupCount[2] = value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxBlockDimX, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupSize[0] = value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxBlockDimY, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupSize[1] = value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxBlockDimZ, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupSize[2] = value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeMaxSharedMemoryPerBlock, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.sharedMemorySizeStatic = value; //hipDeviceGetAttribute(&value, hipDeviceAttributeMaxSharedMemoryPerBlockOptin, app->configuration.device[0]); app->configuration.sharedMemorySize = value;// (value > 65536) ? 65536 : value; res = hipDeviceGetAttribute(&value, hipDeviceAttributeWarpSize, app->configuration.device[0]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.warpSize = value; app->configuration.sharedMemorySizePow2 = (uint64_t)pow(2, (uint64_t)log2(app->configuration.sharedMemorySize)); if (app->configuration.num_streams > 1) { app->configuration.stream_event = (hipEvent_t*)malloc(app->configuration.num_streams * sizeof(hipEvent_t)); if (!app->configuration.stream_event) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } for (uint64_t i = 0; i < app->configuration.num_streams; i++) { res = hipEventCreate(&app->configuration.stream_event[i]); if (res != hipSuccess) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_CREATE_EVENT; } } } app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 64 : 32; app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = 1; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; #elif(VKFFT_BACKEND==3) cl_int res = 0; if (inputLaunchConfiguration.device == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_DEVICE; } app->configuration.device = inputLaunchConfiguration.device; if (inputLaunchConfiguration.context == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_CONTEXT; } app->configuration.context = inputLaunchConfiguration.context; if (inputLaunchConfiguration.platform == 0) { deleteVkFFT(app); return VKFFT_ERROR_INVALID_PLATFORM; } app->configuration.platform = inputLaunchConfiguration.platform; cl_uint vendorID; size_t value_int64; cl_uint value_cl_uint; res = clGetDeviceInfo(app->configuration.device[0], CL_DEVICE_VENDOR_ID, sizeof(cl_int), &vendorID, 0); if (res != 0) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } res = clGetDeviceInfo(app->configuration.device[0], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &value_int64, 0); if (res != 0) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxThreadsNum = value_int64; res = clGetDeviceInfo(app->configuration.device[0], CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(cl_uint), &value_cl_uint, 0); if (res != 0) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } size_t* dims = (size_t*)malloc(sizeof(size_t) * value_cl_uint); if (dims) { res = clGetDeviceInfo(app->configuration.device[0], CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(size_t) * value_cl_uint, dims, 0); if (res != 0) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.maxComputeWorkGroupSize[0] = dims[0]; app->configuration.maxComputeWorkGroupSize[1] = dims[1]; app->configuration.maxComputeWorkGroupSize[2] = dims[2]; free(dims); dims = 0; } else { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } app->configuration.maxComputeWorkGroupCount[0] = UINT64_MAX; app->configuration.maxComputeWorkGroupCount[1] = UINT64_MAX; app->configuration.maxComputeWorkGroupCount[2] = UINT64_MAX; //if ((vendorID == 0x8086) && (!app->configuration.doublePrecision) && (!app->configuration.doublePrecisionFloatMemory)) app->configuration.halfThreads = 1; cl_ulong sharedMemorySize; res = clGetDeviceInfo(app->configuration.device[0], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &sharedMemorySize, 0); if (res != 0) { deleteVkFFT(app); return VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE; } app->configuration.sharedMemorySize = sharedMemorySize; app->configuration.sharedMemorySizePow2 = (uint64_t)pow(2, (uint64_t)log2(sharedMemorySize)); switch (vendorID) { case 0x10DE://NVIDIA app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 64 : 32;//the coalesced memory is equal to 32 bytes between L2 and VRAM. app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.warpSize = 32; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = 4; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; app->configuration.sharedMemorySize -= 0x10;//reserved by system app->configuration.sharedMemorySizePow2 = (uint64_t)pow(2, (uint64_t)log2(app->configuration.sharedMemorySize)); break; case 0x8086://INTEL app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 128 : 64; app->configuration.useLUT = 1; app->configuration.warpSize = 32; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = (sharedMemorySize >= 65536) ? 1 : 2; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; break; case 0x1002://AMD app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 64 : 32; app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.warpSize = 64; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = (sharedMemorySize >= 65536) ? 2 : 4; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; break; default: app->configuration.coalescedMemory = (app->configuration.halfPrecision) ? 128 : 64; app->configuration.useLUT = (app->configuration.doublePrecision || app->configuration.doublePrecisionFloatMemory) ? 1 : 0; app->configuration.warpSize = 32; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost = 1; app->configuration.registerBoost4Step = 1; app->configuration.swapTo3Stage4Step = 0; break; } #endif //set main parameters: if (inputLaunchConfiguration.FFTdim == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_FFTdim; } app->configuration.FFTdim = inputLaunchConfiguration.FFTdim; if (inputLaunchConfiguration.size[0] == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_size; } app->configuration.size[0] = inputLaunchConfiguration.size[0]; if (inputLaunchConfiguration.bufferStride[0] == 0) { if (inputLaunchConfiguration.performR2C) app->configuration.bufferStride[0] = app->configuration.size[0] / 2 + 1; else app->configuration.bufferStride[0] = app->configuration.size[0]; } else app->configuration.bufferStride[0] = inputLaunchConfiguration.bufferStride[0]; if (inputLaunchConfiguration.inputBufferStride[0] == 0) { if (inputLaunchConfiguration.performR2C) app->configuration.inputBufferStride[0] = app->configuration.size[0] + 2; else app->configuration.inputBufferStride[0] = app->configuration.size[0]; } else app->configuration.inputBufferStride[0] = inputLaunchConfiguration.inputBufferStride[0]; if (inputLaunchConfiguration.outputBufferStride[0] == 0) { if (inputLaunchConfiguration.performR2C) app->configuration.outputBufferStride[0] = app->configuration.size[0] + 2; else app->configuration.outputBufferStride[0] = app->configuration.size[0]; } else app->configuration.outputBufferStride[0] = inputLaunchConfiguration.outputBufferStride[0]; for (uint64_t i = 1; i < 3; i++) { if (inputLaunchConfiguration.size[i] == 0) app->configuration.size[i] = 1; else app->configuration.size[i] = inputLaunchConfiguration.size[i]; if (inputLaunchConfiguration.bufferStride[i] == 0) app->configuration.bufferStride[i] = app->configuration.bufferStride[i - 1] * app->configuration.size[i]; else app->configuration.bufferStride[i] = inputLaunchConfiguration.bufferStride[i]; if (inputLaunchConfiguration.inputBufferStride[i] == 0) app->configuration.inputBufferStride[i] = app->configuration.inputBufferStride[i - 1] * app->configuration.size[i]; else app->configuration.inputBufferStride[i] = inputLaunchConfiguration.inputBufferStride[i]; if (inputLaunchConfiguration.outputBufferStride[i] == 0) app->configuration.outputBufferStride[i] = app->configuration.outputBufferStride[i - 1] * app->configuration.size[i]; else app->configuration.outputBufferStride[i] = inputLaunchConfiguration.outputBufferStride[i]; } app->configuration.isInputFormatted = inputLaunchConfiguration.isInputFormatted; app->configuration.isOutputFormatted = inputLaunchConfiguration.isOutputFormatted; app->configuration.performConvolution = inputLaunchConfiguration.performConvolution; if (inputLaunchConfiguration.bufferNum == 0) app->configuration.bufferNum = 1; else app->configuration.bufferNum = inputLaunchConfiguration.bufferNum; #if(VKFFT_BACKEND==0) if (inputLaunchConfiguration.bufferSize == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_bufferSize; } #endif app->configuration.bufferSize = inputLaunchConfiguration.bufferSize; if (app->configuration.bufferSize != 0) { for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { if (app->configuration.bufferSize[i] == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_bufferSize; } } } app->configuration.buffer = inputLaunchConfiguration.buffer; if (inputLaunchConfiguration.userTempBuffer != 0) app->configuration.userTempBuffer = inputLaunchConfiguration.userTempBuffer; if (app->configuration.userTempBuffer != 0) { if (inputLaunchConfiguration.tempBufferNum == 0) app->configuration.tempBufferNum = 1; else app->configuration.tempBufferNum = inputLaunchConfiguration.tempBufferNum; #if(VKFFT_BACKEND==0) if (inputLaunchConfiguration.tempBufferSize == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_tempBufferSize; } #endif app->configuration.tempBufferSize = inputLaunchConfiguration.tempBufferSize; if (app->configuration.tempBufferSize != 0) { for (uint64_t i = 0; i < app->configuration.tempBufferNum; i++) { if (app->configuration.tempBufferSize[i] == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_tempBufferSize; } } } app->configuration.tempBuffer = inputLaunchConfiguration.tempBuffer; } else { app->configuration.tempBufferNum = 1; app->configuration.tempBufferSize = (uint64_t*)malloc(sizeof(uint64_t)); if (!app->configuration.tempBufferSize) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } app->configuration.tempBufferSize[0] = 0; } if (app->configuration.isInputFormatted) { if (inputLaunchConfiguration.inputBufferNum == 0) app->configuration.inputBufferNum = 1; else app->configuration.inputBufferNum = inputLaunchConfiguration.inputBufferNum; #if(VKFFT_BACKEND==0) if (inputLaunchConfiguration.inputBufferSize == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_inputBufferSize; } #endif app->configuration.inputBufferSize = inputLaunchConfiguration.inputBufferSize; if (app->configuration.inputBufferSize != 0) { for (uint64_t i = 0; i < app->configuration.inputBufferNum; i++) { if (app->configuration.inputBufferSize[i] == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_inputBufferSize; } } } app->configuration.inputBuffer = inputLaunchConfiguration.inputBuffer; } else { app->configuration.inputBufferNum = app->configuration.bufferNum; app->configuration.inputBufferSize = app->configuration.bufferSize; app->configuration.inputBuffer = app->configuration.buffer; } if (app->configuration.isOutputFormatted) { if (inputLaunchConfiguration.outputBufferNum == 0) app->configuration.outputBufferNum = 1; else app->configuration.outputBufferNum = inputLaunchConfiguration.outputBufferNum; #if(VKFFT_BACKEND==0) if (inputLaunchConfiguration.outputBufferSize == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_outputBufferSize; } #endif app->configuration.outputBufferSize = inputLaunchConfiguration.outputBufferSize; if (app->configuration.outputBufferSize != 0) { for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { if (app->configuration.outputBufferSize[i] == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_outputBufferSize; } } } app->configuration.outputBuffer = inputLaunchConfiguration.outputBuffer; } else { app->configuration.outputBufferNum = app->configuration.bufferNum; app->configuration.outputBufferSize = app->configuration.bufferSize; app->configuration.outputBuffer = app->configuration.buffer; } if (app->configuration.performConvolution) { if (inputLaunchConfiguration.kernelNum == 0) app->configuration.kernelNum = 1; else app->configuration.kernelNum = inputLaunchConfiguration.kernelNum; #if(VKFFT_BACKEND==0) if (inputLaunchConfiguration.kernelSize == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_kernelSize; } #endif app->configuration.kernelSize = inputLaunchConfiguration.kernelSize; if (app->configuration.kernelSize != 0) { for (uint64_t i = 0; i < app->configuration.kernelNum; i++) { if (app->configuration.kernelSize[i] == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_kernelSize; } } } app->configuration.kernel = inputLaunchConfiguration.kernel; } if (inputLaunchConfiguration.bufferOffset != 0) app->configuration.bufferOffset = inputLaunchConfiguration.bufferOffset; if (inputLaunchConfiguration.tempBufferOffset != 0) app->configuration.tempBufferOffset = inputLaunchConfiguration.tempBufferOffset; if (inputLaunchConfiguration.inputBufferOffset != 0) app->configuration.inputBufferOffset = inputLaunchConfiguration.inputBufferOffset; if (inputLaunchConfiguration.outputBufferOffset != 0) app->configuration.outputBufferOffset = inputLaunchConfiguration.outputBufferOffset; if (inputLaunchConfiguration.kernelOffset != 0) app->configuration.kernelOffset = inputLaunchConfiguration.kernelOffset; if (inputLaunchConfiguration.specifyOffsetsAtLaunch != 0) app->configuration.specifyOffsetsAtLaunch = inputLaunchConfiguration.specifyOffsetsAtLaunch; //set optional parameters: uint64_t checkBufferSizeFor64BitAddressing = 0; for (uint64_t i = 0; i < app->configuration.bufferNum; i++) { if (app->configuration.bufferSize) checkBufferSizeFor64BitAddressing += app->configuration.bufferSize[i]; else { checkBufferSizeFor64BitAddressing = app->configuration.size[0] * app->configuration.size[1] * app->configuration.size[2] * app->configuration.coordinateFeatures * app->configuration.numberBatches * app->configuration.numberKernels * 8; if (app->configuration.doublePrecision) checkBufferSizeFor64BitAddressing *= 2; } } if (checkBufferSizeFor64BitAddressing >= (uint64_t)pow((uint64_t)2, (uint64_t)34)) app->configuration.useUint64 = 1; checkBufferSizeFor64BitAddressing = 0; for (uint64_t i = 0; i < app->configuration.inputBufferNum; i++) { if (app->configuration.inputBufferSize) checkBufferSizeFor64BitAddressing += app->configuration.inputBufferSize[i]; } if (checkBufferSizeFor64BitAddressing >= (uint64_t)pow((uint64_t)2, (uint64_t)34)) app->configuration.useUint64 = 1; checkBufferSizeFor64BitAddressing = 0; for (uint64_t i = 0; i < app->configuration.outputBufferNum; i++) { if (app->configuration.outputBufferSize) checkBufferSizeFor64BitAddressing += app->configuration.outputBufferSize[i]; } if (checkBufferSizeFor64BitAddressing >= (uint64_t)pow((uint64_t)2, (uint64_t)34)) app->configuration.useUint64 = 1; checkBufferSizeFor64BitAddressing = 0; for (uint64_t i = 0; i < app->configuration.kernelNum; i++) { if (app->configuration.kernelSize) checkBufferSizeFor64BitAddressing += app->configuration.kernelSize[i]; } if (checkBufferSizeFor64BitAddressing >= (uint64_t)pow((uint64_t)2, (uint64_t)34)) app->configuration.useUint64 = 1; if (inputLaunchConfiguration.useUint64 != 0) app->configuration.useUint64 = inputLaunchConfiguration.useUint64; if (inputLaunchConfiguration.coalescedMemory != 0) app->configuration.coalescedMemory = inputLaunchConfiguration.coalescedMemory; app->configuration.aimThreads = 128; if (inputLaunchConfiguration.aimThreads != 0) app->configuration.aimThreads = inputLaunchConfiguration.aimThreads; app->configuration.numSharedBanks = 32; if (inputLaunchConfiguration.numSharedBanks != 0) app->configuration.numSharedBanks = inputLaunchConfiguration.numSharedBanks; if (inputLaunchConfiguration.inverseReturnToInputBuffer != 0) app->configuration.inverseReturnToInputBuffer = inputLaunchConfiguration.inverseReturnToInputBuffer; if (inputLaunchConfiguration.useLUT != 0) app->configuration.useLUT = inputLaunchConfiguration.useLUT; if (inputLaunchConfiguration.fixMaxRadixBluestein != 0) app->configuration.fixMaxRadixBluestein = inputLaunchConfiguration.fixMaxRadixBluestein; if (inputLaunchConfiguration.performR2C != 0) { app->configuration.performR2C = inputLaunchConfiguration.performR2C; } if (inputLaunchConfiguration.performDCT != 0) { app->configuration.performDCT = inputLaunchConfiguration.performDCT; } if (inputLaunchConfiguration.disableMergeSequencesR2C != 0) { app->configuration.disableMergeSequencesR2C = inputLaunchConfiguration.disableMergeSequencesR2C; } app->configuration.normalize = 0; if (inputLaunchConfiguration.normalize != 0) app->configuration.normalize = inputLaunchConfiguration.normalize; if (inputLaunchConfiguration.makeForwardPlanOnly != 0) app->configuration.makeForwardPlanOnly = inputLaunchConfiguration.makeForwardPlanOnly; if (inputLaunchConfiguration.makeInversePlanOnly != 0) app->configuration.makeInversePlanOnly = inputLaunchConfiguration.makeInversePlanOnly; app->configuration.reorderFourStep = 1; if (inputLaunchConfiguration.disableReorderFourStep != 0) app->configuration.reorderFourStep = 0; if (inputLaunchConfiguration.frequencyZeroPadding != 0) app->configuration.frequencyZeroPadding = inputLaunchConfiguration.frequencyZeroPadding; for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (inputLaunchConfiguration.performZeropadding[i] != 0) { app->configuration.performZeropadding[i] = inputLaunchConfiguration.performZeropadding[i]; app->configuration.fft_zeropad_left[i] = inputLaunchConfiguration.fft_zeropad_left[i]; app->configuration.fft_zeropad_right[i] = inputLaunchConfiguration.fft_zeropad_right[i]; } } if (inputLaunchConfiguration.registerBoost != 0) app->configuration.registerBoost = inputLaunchConfiguration.registerBoost; if (inputLaunchConfiguration.registerBoostNonPow2 != 0) app->configuration.registerBoostNonPow2 = inputLaunchConfiguration.registerBoostNonPow2; if (inputLaunchConfiguration.registerBoost4Step != 0) app->configuration.registerBoost4Step = inputLaunchConfiguration.registerBoost4Step; if (app->configuration.performR2C != 0) { app->configuration.registerBoost = 1; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost4Step = 1; } app->configuration.coordinateFeatures = 1; app->configuration.numberBatches = 1; if (inputLaunchConfiguration.coordinateFeatures != 0) app->configuration.coordinateFeatures = inputLaunchConfiguration.coordinateFeatures; if (inputLaunchConfiguration.numberBatches != 0) app->configuration.numberBatches = inputLaunchConfiguration.numberBatches; app->configuration.matrixConvolution = 1; app->configuration.numberKernels = 1; if (inputLaunchConfiguration.kernelConvolution != 0) { app->configuration.kernelConvolution = inputLaunchConfiguration.kernelConvolution; app->configuration.reorderFourStep = 0; app->configuration.registerBoost = 1; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost4Step = 1; } if (app->configuration.performConvolution) { if (inputLaunchConfiguration.matrixConvolution != 0) app->configuration.matrixConvolution = inputLaunchConfiguration.matrixConvolution; if (inputLaunchConfiguration.numberKernels != 0) app->configuration.numberKernels = inputLaunchConfiguration.numberKernels; if (inputLaunchConfiguration.symmetricKernel != 0) app->configuration.symmetricKernel = inputLaunchConfiguration.symmetricKernel; if (inputLaunchConfiguration.conjugateConvolution != 0) app->configuration.conjugateConvolution = inputLaunchConfiguration.conjugateConvolution; if (inputLaunchConfiguration.crossPowerSpectrumNormalization != 0) app->configuration.crossPowerSpectrumNormalization = inputLaunchConfiguration.crossPowerSpectrumNormalization; app->configuration.reorderFourStep = 0; app->configuration.registerBoost = 1; app->configuration.registerBoostNonPow2 = 0; app->configuration.registerBoost4Step = 1; if (app->configuration.matrixConvolution > 1) app->configuration.coordinateFeatures = app->configuration.matrixConvolution; } app->firstAxis = 0; app->lastAxis = app->configuration.FFTdim - 1; if (inputLaunchConfiguration.omitDimension[0] != 0) { app->configuration.omitDimension[0] = inputLaunchConfiguration.omitDimension[0]; app->firstAxis++; if (app->configuration.performConvolution) { deleteVkFFT(app); return VKFFT_ERROR_UNSUPPORTED_FFT_OMIT; } if (app->configuration.performR2C) { deleteVkFFT(app); return VKFFT_ERROR_UNSUPPORTED_FFT_OMIT; } } if (inputLaunchConfiguration.omitDimension[2] != 0) { app->configuration.omitDimension[2] = inputLaunchConfiguration.omitDimension[2]; app->lastAxis--; if (app->configuration.performConvolution) { deleteVkFFT(app); return VKFFT_ERROR_UNSUPPORTED_FFT_OMIT; } } if (inputLaunchConfiguration.omitDimension[1] != 0) { app->configuration.omitDimension[1] = inputLaunchConfiguration.omitDimension[1]; if (app->configuration.omitDimension[0] == 1) app->firstAxis++; if (app->configuration.omitDimension[2] == 1) app->lastAxis--; if (app->configuration.performConvolution) { deleteVkFFT(app); return VKFFT_ERROR_UNSUPPORTED_FFT_OMIT; } } if (app->firstAxis > app->lastAxis) { deleteVkFFT(app); return VKFFT_ERROR_UNSUPPORTED_FFT_OMIT; } if (inputLaunchConfiguration.reorderFourStep != 0) app->configuration.reorderFourStep = inputLaunchConfiguration.reorderFourStep; app->configuration.maxCodeLength = 4000000; if (inputLaunchConfiguration.maxCodeLength != 0) app->configuration.maxCodeLength = inputLaunchConfiguration.maxCodeLength; app->configuration.maxTempLength = 5000; if (inputLaunchConfiguration.maxTempLength != 0) app->configuration.maxTempLength = inputLaunchConfiguration.maxTempLength; if (inputLaunchConfiguration.halfThreads != 0) app->configuration.halfThreads = inputLaunchConfiguration.halfThreads; if (inputLaunchConfiguration.swapTo3Stage4Step != 0) app->configuration.swapTo3Stage4Step = inputLaunchConfiguration.swapTo3Stage4Step; if (app->configuration.performDCT > 0) app->configuration.performBandwidthBoost = -1; if (inputLaunchConfiguration.performBandwidthBoost != 0) app->configuration.performBandwidthBoost = inputLaunchConfiguration.performBandwidthBoost; if (inputLaunchConfiguration.devicePageSize != 0) app->configuration.devicePageSize = inputLaunchConfiguration.devicePageSize; if (inputLaunchConfiguration.localPageSize != 0) app->configuration.localPageSize = inputLaunchConfiguration.localPageSize; if (inputLaunchConfiguration.keepShaderCode != 0) app->configuration.keepShaderCode = inputLaunchConfiguration.keepShaderCode; if (inputLaunchConfiguration.printMemoryLayout != 0) app->configuration.printMemoryLayout = inputLaunchConfiguration.printMemoryLayout; if (inputLaunchConfiguration.considerAllAxesStrided != 0) app->configuration.considerAllAxesStrided = inputLaunchConfiguration.considerAllAxesStrided; if (inputLaunchConfiguration.loadApplicationString != 0) app->configuration.loadApplicationString = inputLaunchConfiguration.loadApplicationString; if (inputLaunchConfiguration.saveApplicationToString != 0) app->configuration.saveApplicationToString = inputLaunchConfiguration.saveApplicationToString; if (inputLaunchConfiguration.loadApplicationFromString != 0) { app->configuration.loadApplicationFromString = inputLaunchConfiguration.loadApplicationFromString; if (app->configuration.saveApplicationToString != 0) { deleteVkFFT(app); return VKFFT_ERROR_ENABLED_saveApplicationToString; } if (app->configuration.loadApplicationString == 0) { deleteVkFFT(app); return VKFFT_ERROR_EMPTY_applicationString; } app->currentApplicationStringPos = 0; } //temporary set: app->configuration.registerBoost4Step = 1; #if(VKFFT_BACKEND==0) app->configuration.useUint64 = 0; //No physical addressing mode in Vulkan shaders. Use multiple-buffer support to achieve emulation of physical addressing. #endif VkFFTResult resFFT = VKFFT_SUCCESS; uint64_t initSharedMemory = app->configuration.sharedMemorySize; if (!app->configuration.makeForwardPlanOnly) { app->localFFTPlan_inverse = (VkFFTPlan*)calloc(1, sizeof(VkFFTPlan)); if (app->localFFTPlan_inverse) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { app->configuration.sharedMemorySize = ((app->configuration.size[i] & (app->configuration.size[i] - 1)) == 0) ? app->configuration.sharedMemorySizePow2 : initSharedMemory; resFFT = VkFFTScheduler(app, app->localFFTPlan_inverse, i, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { app->localFFTPlan_inverse->inverseBluesteinAxes[i][j] = app->localFFTPlan_inverse->axes[i][j]; } } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { app->configuration.sharedMemorySize = ((app->configuration.size[i] & (app->configuration.size[i] - 1)) == 0) ? app->configuration.sharedMemorySizePow2 : initSharedMemory; for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { resFFT = VkFFTPlanAxis(app, app->localFFTPlan_inverse, i, j, 1, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } } if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { resFFT = VkFFTPlanAxis(app, app->localFFTPlan_inverse, i, j, 1, 1); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } } } if ((app->localFFTPlan_inverse->multiUploadR2C) && (i == 0)) { resFFT = VkFFTPlanR2CMultiUploadDecomposition(app, app->localFFTPlan_inverse, 1); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } } } } else { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } } if (!app->configuration.makeInversePlanOnly) { app->localFFTPlan = (VkFFTPlan*)calloc(1, sizeof(VkFFTPlan)); if (app->localFFTPlan) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { app->configuration.sharedMemorySize = ((app->configuration.size[i] & (app->configuration.size[i] - 1)) == 0) ? app->configuration.sharedMemorySizePow2 : initSharedMemory; resFFT = VkFFTScheduler(app, app->localFFTPlan, i, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) { app->localFFTPlan->inverseBluesteinAxes[i][j] = app->localFFTPlan->axes[i][j]; } } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { app->configuration.sharedMemorySize = ((app->configuration.size[i] & (app->configuration.size[i] - 1)) == 0) ? app->configuration.sharedMemorySizePow2 : initSharedMemory; for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) { resFFT = VkFFTPlanAxis(app, app->localFFTPlan, i, j, 0, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } } if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan->numAxisUploads[i]; j++) { resFFT = VkFFTPlanAxis(app, app->localFFTPlan, i, j, 0, 1); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } } } if ((app->localFFTPlan->multiUploadR2C) && (i == 0)) { resFFT = VkFFTPlanR2CMultiUploadDecomposition(app, app->localFFTPlan, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } } } } else { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->useBluesteinFFT[i]) { if (!app->configuration.makeInversePlanOnly) resFFT = VkFFTGeneratePhaseVectors(app, app->localFFTPlan, i, 0); else resFFT = VkFFTGeneratePhaseVectors(app, app->localFFTPlan_inverse, i, 0); if (resFFT != VKFFT_SUCCESS) { deleteVkFFT(app); return resFFT; } } } if (inputLaunchConfiguration.saveApplicationToString != 0) { #if((VKFFT_BACKEND==0)||(VKFFT_BACKEND==2)) uint64_t totalBinarySize = 0; if (!app->configuration.makeForwardPlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan_inverse->axes[i][j].binarySize + sizeof(uint32_t); } if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binarySize + sizeof(uint32_t); } } if ((app->localFFTPlan_inverse->multiUploadR2C) && (i == 0)) { totalBinarySize += app->localFFTPlan_inverse->R2Cdecomposition.binarySize + sizeof(uint32_t); } } } if (!app->configuration.makeInversePlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan->axes[i][j].binarySize + sizeof(uint32_t); } if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan->inverseBluesteinAxes[i][j].binarySize + sizeof(uint32_t); } } if ((app->localFFTPlan->multiUploadR2C) && (i == 0)) { totalBinarySize += app->localFFTPlan->R2Cdecomposition.binarySize + sizeof(uint32_t); } } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->useBluesteinFFT[i]) { totalBinarySize += app->applicationBluesteinStringSize[i]; } } app->saveApplicationString = calloc(totalBinarySize, 1); if (!app->saveApplicationString) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } app->applicationStringSize = totalBinarySize; uint64_t currentPos = 0; uint32_t* localApplicationStringCast = (uint32_t*)app->saveApplicationString; if (!app->configuration.makeForwardPlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { localApplicationStringCast[currentPos] = (uint32_t)app->localFFTPlan_inverse->axes[i][j].binarySize; currentPos++; memcpy(localApplicationStringCast + currentPos, app->localFFTPlan_inverse->axes[i][j].binary, app->localFFTPlan_inverse->axes[i][j].binarySize); currentPos += app->localFFTPlan_inverse->axes[i][j].binarySize / sizeof(uint32_t); } if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { localApplicationStringCast[currentPos] = (uint32_t)app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binarySize; currentPos++; memcpy(localApplicationStringCast + currentPos, app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binary, app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binarySize); currentPos += app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binarySize / sizeof(uint32_t); } } if ((app->localFFTPlan_inverse->multiUploadR2C) && (i == 0)) { localApplicationStringCast[currentPos] = (uint32_t)app->localFFTPlan_inverse->R2Cdecomposition.binarySize; currentPos++; memcpy(localApplicationStringCast + currentPos, app->localFFTPlan_inverse->R2Cdecomposition.binary, app->localFFTPlan_inverse->R2Cdecomposition.binarySize); currentPos += app->localFFTPlan_inverse->R2Cdecomposition.binarySize / sizeof(uint32_t); } } } if (!app->configuration.makeInversePlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) { localApplicationStringCast[currentPos] = (uint32_t)app->localFFTPlan->axes[i][j].binarySize; currentPos++; memcpy(localApplicationStringCast + currentPos, app->localFFTPlan->axes[i][j].binary, app->localFFTPlan->axes[i][j].binarySize); currentPos += app->localFFTPlan->axes[i][j].binarySize / sizeof(uint32_t); } if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan->numAxisUploads[i]; j++) { localApplicationStringCast[currentPos] = (uint32_t)app->localFFTPlan->inverseBluesteinAxes[i][j].binarySize; currentPos++; memcpy(localApplicationStringCast + currentPos, app->localFFTPlan->inverseBluesteinAxes[i][j].binary, app->localFFTPlan->inverseBluesteinAxes[i][j].binarySize); currentPos += app->localFFTPlan->inverseBluesteinAxes[i][j].binarySize / sizeof(uint32_t); } } if ((app->localFFTPlan->multiUploadR2C) && (i == 0)) { localApplicationStringCast[currentPos] = (uint32_t)app->localFFTPlan->R2Cdecomposition.binarySize; currentPos++; memcpy(localApplicationStringCast + currentPos, app->localFFTPlan->R2Cdecomposition.binary, app->localFFTPlan->R2Cdecomposition.binarySize); currentPos += app->localFFTPlan->R2Cdecomposition.binarySize / sizeof(uint32_t); } } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->useBluesteinFFT[i]) { memcpy(localApplicationStringCast + currentPos, app->applicationBluesteinString[i], app->applicationBluesteinStringSize[i]); currentPos += app->applicationBluesteinStringSize[i] / sizeof(uint32_t); } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->applicationBluesteinString[i] != 0) { free(app->applicationBluesteinString[i]); app->applicationBluesteinString[i] = 0; } } #else uint64_t totalBinarySize = 1; if (!app->configuration.makeForwardPlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan_inverse->axes[i][j].binarySize + (uint64_t)(floor(log10((double)app->localFFTPlan_inverse->axes[i][j].binarySize))) + 1; } if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binarySize + (uint64_t)(floor(log10((double)app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binarySize))) + 1; } } if ((app->localFFTPlan_inverse->multiUploadR2C) && (i == 0)) { totalBinarySize += app->localFFTPlan_inverse->R2Cdecomposition.binarySize + (uint64_t)(floor(log10((double)app->localFFTPlan_inverse->R2Cdecomposition.binarySize))) + 1; } } } if (!app->configuration.makeInversePlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan->axes[i][j].binarySize + (uint64_t)(floor(log10((double)app->localFFTPlan->axes[i][j].binarySize))) + 1; } if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan->numAxisUploads[i]; j++) { totalBinarySize += app->localFFTPlan->inverseBluesteinAxes[i][j].binarySize + (uint64_t)(floor(log10((double)app->localFFTPlan->inverseBluesteinAxes[i][j].binarySize))) + 1; } } if ((app->localFFTPlan->multiUploadR2C) && (i == 0)) { totalBinarySize += app->localFFTPlan->R2Cdecomposition.binarySize + (uint64_t)(floor(log10((double)app->localFFTPlan->R2Cdecomposition.binarySize))) + 1; } } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->useBluesteinFFT[i]) { totalBinarySize += app->applicationBluesteinStringSize[i] - 1; } } app->saveApplicationString = (char*)calloc(totalBinarySize, 1); if (!app->saveApplicationString) { deleteVkFFT(app); return VKFFT_ERROR_MALLOC_FAILED; } app->applicationStringSize = totalBinarySize; uint64_t currentPos = 0; char* localApplicationStringCast = (char*)app->saveApplicationString; if (!app->configuration.makeForwardPlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { currentPos += sprintf(localApplicationStringCast + currentPos, "%" PRIu64 "\n", app->localFFTPlan_inverse->axes[i][j].binarySize); currentPos += sprintf(localApplicationStringCast + currentPos, "%s", (char*)app->localFFTPlan_inverse->axes[i][j].binary); } if (app->useBluesteinFFT[i] && (app->localFFTPlan_inverse->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan_inverse->numAxisUploads[i]; j++) { currentPos += sprintf(localApplicationStringCast + currentPos, "%" PRIu64 "\n", app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binarySize); currentPos += sprintf(localApplicationStringCast + currentPos, "%s", (char*)app->localFFTPlan_inverse->inverseBluesteinAxes[i][j].binary); } } if ((app->localFFTPlan_inverse->multiUploadR2C) && (i == 0)) { currentPos += sprintf(localApplicationStringCast + currentPos, "%" PRIu64 "\n", app->localFFTPlan_inverse->R2Cdecomposition.binarySize); currentPos += sprintf(localApplicationStringCast + currentPos, "%s", (char*)app->localFFTPlan_inverse->R2Cdecomposition.binary); } } } if (!app->configuration.makeInversePlanOnly) { for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { for (uint64_t j = 0; j < app->localFFTPlan->numAxisUploads[i]; j++) { currentPos += sprintf(localApplicationStringCast + currentPos, "%" PRIu64 "\n", app->localFFTPlan->axes[i][j].binarySize); currentPos += sprintf(localApplicationStringCast + currentPos, "%s", (char*)app->localFFTPlan->axes[i][j].binary); } if (app->useBluesteinFFT[i] && (app->localFFTPlan->numAxisUploads[i] > 1)) { for (uint64_t j = 1; j < app->localFFTPlan->numAxisUploads[i]; j++) { currentPos += sprintf(localApplicationStringCast + currentPos, "%" PRIu64 "\n", app->localFFTPlan->inverseBluesteinAxes[i][j].binarySize); currentPos += sprintf(localApplicationStringCast + currentPos, "%s", (char*)app->localFFTPlan->inverseBluesteinAxes[i][j].binary); } } if ((app->localFFTPlan->multiUploadR2C) && (i == 0)) { currentPos += sprintf(localApplicationStringCast + currentPos, "%" PRIu64 "\n", app->localFFTPlan->R2Cdecomposition.binarySize); currentPos += sprintf(localApplicationStringCast + currentPos, "%s", (char*)app->localFFTPlan->R2Cdecomposition.binary); } } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->useBluesteinFFT[i]) { currentPos += sprintf(localApplicationStringCast + currentPos, "%s", (char*)app->applicationBluesteinString[i]); } } for (uint64_t i = 0; i < app->configuration.FFTdim; i++) { if (app->applicationBluesteinString[i] != 0) { free(app->applicationBluesteinString[i]); app->applicationBluesteinString[i] = 0; } } #endif } #if(VKFFT_BACKEND==0) if (app->configuration.isCompilerInitialized) { glslang_finalize_process(); app->configuration.isCompilerInitialized = 0; } #endif return resFFT; } static inline VkFFTResult dispatchEnhanced(VkFFTApplication* app, VkFFTAxis* axis, uint64_t* dispatchBlock) { VkFFTResult resFFT = VKFFT_SUCCESS; uint64_t maxBlockSize[3] = { (uint64_t)pow(2,(uint64_t)log2(app->configuration.maxComputeWorkGroupCount[0])),(uint64_t)pow(2,(uint64_t)log2(app->configuration.maxComputeWorkGroupCount[1])),(uint64_t)pow(2,(uint64_t)log2(app->configuration.maxComputeWorkGroupCount[2])) }; uint64_t blockNumber[3] = { (uint64_t)ceil(dispatchBlock[0] / (double)maxBlockSize[0]),(uint64_t)ceil(dispatchBlock[1] / (double)maxBlockSize[1]),(uint64_t)ceil(dispatchBlock[2] / (double)maxBlockSize[2]) }; if (blockNumber[0] == 0) blockNumber[0] = 1; if (blockNumber[1] == 0) blockNumber[1] = 1; if (blockNumber[2] == 0) blockNumber[2] = 1; if ((blockNumber[0] > 1) && (blockNumber[0] * maxBlockSize[0] != dispatchBlock[0])) { for (uint64_t i = app->configuration.maxComputeWorkGroupCount[0]; i > 0; i--) { if (dispatchBlock[0] % i == 0) { maxBlockSize[0] = i; blockNumber[0] = dispatchBlock[0] / i; i = 1; } } } if ((blockNumber[1] > 1) && (blockNumber[1] * maxBlockSize[1] != dispatchBlock[1])) { for (uint64_t i = app->configuration.maxComputeWorkGroupCount[1]; i > 0; i--) { if (dispatchBlock[1] % i == 0) { maxBlockSize[1] = i; blockNumber[1] = dispatchBlock[1] / i; i = 1; } } } if ((blockNumber[2] > 1) && (blockNumber[2] * maxBlockSize[2] != dispatchBlock[2])) { for (uint64_t i = app->configuration.maxComputeWorkGroupCount[2]; i > 0; i--) { if (dispatchBlock[2] % i == 0) { maxBlockSize[2] = i; blockNumber[2] = dispatchBlock[2] / i; i = 1; } } } if (app->configuration.specifyOffsetsAtLaunch) { axis->updatePushConstants = 1; } //printf("%" PRIu64 " %" PRIu64 " %" PRIu64 "\n", dispatchBlock[0], dispatchBlock[1], dispatchBlock[2]); //printf("%" PRIu64 " %" PRIu64 " %" PRIu64 "\n", blockNumber[0], blockNumber[1], blockNumber[2]); for (uint64_t i = 0; i < 3; i++) if (blockNumber[i] == 1) maxBlockSize[i] = dispatchBlock[i]; for (uint64_t i = 0; i < blockNumber[0]; i++) { for (uint64_t j = 0; j < blockNumber[1]; j++) { for (uint64_t k = 0; k < blockNumber[2]; k++) { if (axis->pushConstants.workGroupShift[0] != i * maxBlockSize[0]) { axis->pushConstants.workGroupShift[0] = i * maxBlockSize[0]; axis->updatePushConstants = 1; } if (axis->pushConstants.workGroupShift[1] != j * maxBlockSize[1]) { axis->pushConstants.workGroupShift[1] = j * maxBlockSize[1]; axis->updatePushConstants = 1; } if (axis->pushConstants.workGroupShift[2] != k * maxBlockSize[2]) { axis->pushConstants.workGroupShift[2] = k * maxBlockSize[2]; axis->updatePushConstants = 1; } if (axis->updatePushConstants) { if (app->configuration.useUint64) { uint64_t pushConstID = 0; if (axis->specializationConstants.performWorkGroupShift[0]) { axis->pushConstants.dataUint64[pushConstID] = axis->pushConstants.workGroupShift[0]; pushConstID++; } if (axis->specializationConstants.performWorkGroupShift[1]) { axis->pushConstants.dataUint64[pushConstID] = axis->pushConstants.workGroupShift[1]; pushConstID++; } if (axis->specializationConstants.performWorkGroupShift[2]) { axis->pushConstants.dataUint64[pushConstID] = axis->pushConstants.workGroupShift[2]; pushConstID++; } if (axis->specializationConstants.performPostCompilationInputOffset) { axis->pushConstants.dataUint64[pushConstID] = axis->specializationConstants.inputOffset / axis->specializationConstants.inputNumberByteSize; pushConstID++; } if (axis->specializationConstants.performPostCompilationOutputOffset) { axis->pushConstants.dataUint64[pushConstID] = axis->specializationConstants.outputOffset / axis->specializationConstants.outputNumberByteSize; pushConstID++; } if (axis->specializationConstants.performPostCompilationKernelOffset) { if (axis->specializationConstants.kernelNumberByteSize != 0) axis->pushConstants.dataUint64[pushConstID] = axis->specializationConstants.kernelOffset / axis->specializationConstants.kernelNumberByteSize; else axis->pushConstants.dataUint64[pushConstID] = 0; pushConstID++; } } else { uint64_t pushConstID = 0; if (axis->specializationConstants.performWorkGroupShift[0]) { axis->pushConstants.dataUint32[pushConstID] = (uint32_t)axis->pushConstants.workGroupShift[0]; pushConstID++; } if (axis->specializationConstants.performWorkGroupShift[1]) { axis->pushConstants.dataUint32[pushConstID] = (uint32_t)axis->pushConstants.workGroupShift[1]; pushConstID++; } if (axis->specializationConstants.performWorkGroupShift[2]) { axis->pushConstants.dataUint32[pushConstID] = (uint32_t)axis->pushConstants.workGroupShift[2]; pushConstID++; } if (axis->specializationConstants.performPostCompilationInputOffset) { axis->pushConstants.dataUint32[pushConstID] = (uint32_t)(axis->specializationConstants.inputOffset / axis->specializationConstants.inputNumberByteSize); pushConstID++; } if (axis->specializationConstants.performPostCompilationOutputOffset) { axis->pushConstants.dataUint32[pushConstID] = (uint32_t)(axis->specializationConstants.outputOffset / axis->specializationConstants.outputNumberByteSize); pushConstID++; } if (axis->specializationConstants.performPostCompilationKernelOffset) { axis->pushConstants.dataUint32[pushConstID] = (uint32_t)(axis->specializationConstants.kernelOffset / axis->specializationConstants.kernelNumberByteSize); pushConstID++; } } } #if(VKFFT_BACKEND==0) if (axis->pushConstants.structSize > 0) { if (app->configuration.useUint64) { vkCmdPushConstants(app->configuration.commandBuffer[0], axis->pipelineLayout, VK_SHADER_STAGE_COMPUTE_BIT, 0, (uint32_t)axis->pushConstants.structSize, axis->pushConstants.dataUint64); } else { vkCmdPushConstants(app->configuration.commandBuffer[0], axis->pipelineLayout, VK_SHADER_STAGE_COMPUTE_BIT, 0, (uint32_t)axis->pushConstants.structSize, axis->pushConstants.dataUint32); } } vkCmdDispatch(app->configuration.commandBuffer[0], (uint32_t)maxBlockSize[0], (uint32_t)maxBlockSize[1], (uint32_t)maxBlockSize[2]); #elif(VKFFT_BACKEND==1) void* args[6]; CUresult result = CUDA_SUCCESS; args[0] = axis->inputBuffer; args[1] = axis->outputBuffer; uint64_t args_id = 2; if (axis->specializationConstants.convolutionStep) { args[args_id] = app->configuration.kernel; args_id++; } if (axis->specializationConstants.LUT) { args[args_id] = &axis->bufferLUT; args_id++; } if (axis->specializationConstants.useBluesteinFFT && axis->specializationConstants.BluesteinConvolutionStep) { if (axis->specializationConstants.inverseBluestein) args[args_id] = &app->bufferBluesteinIFFT[axis->specializationConstants.axis_id]; else args[args_id] = &app->bufferBluesteinFFT[axis->specializationConstants.axis_id]; args_id++; } if (axis->specializationConstants.useBluesteinFFT && (axis->specializationConstants.BluesteinPreMultiplication || axis->specializationConstants.BluesteinPostMultiplication)) { args[args_id] = &app->bufferBluestein[axis->specializationConstants.axis_id]; args_id++; } //args[args_id] = &axis->pushConstants; if (axis->updatePushConstants) { axis->updatePushConstants = 0; if (axis->pushConstants.structSize > 0) { if (app->configuration.useUint64) { result = cuMemcpyHtoD(axis->consts_addr, axis->pushConstants.dataUint64, axis->pushConstants.structSize); } else { result = cuMemcpyHtoD(axis->consts_addr, axis->pushConstants.dataUint32, axis->pushConstants.structSize); } if (result != CUDA_SUCCESS) { printf("cuMemcpyHtoD error: %d\n", result); return VKFFT_ERROR_FAILED_TO_COPY; } } } if (app->configuration.num_streams >= 1) { result = cuLaunchKernel(axis->VkFFTKernel, (unsigned int)maxBlockSize[0], (unsigned int)maxBlockSize[1], (unsigned int)maxBlockSize[2], // grid dim (unsigned int)axis->specializationConstants.localSize[0], (unsigned int)axis->specializationConstants.localSize[1], (unsigned int)axis->specializationConstants.localSize[2], // block dim (unsigned int)axis->specializationConstants.usedSharedMemory, app->configuration.stream[app->configuration.streamID], // shared mem and stream args, 0); } else { result = cuLaunchKernel(axis->VkFFTKernel, (unsigned int)maxBlockSize[0], (unsigned int)maxBlockSize[1], (unsigned int)maxBlockSize[2], // grid dim (unsigned int)axis->specializationConstants.localSize[0], (unsigned int)axis->specializationConstants.localSize[1], (unsigned int)axis->specializationConstants.localSize[2], // block dim (unsigned int)axis->specializationConstants.usedSharedMemory, 0, // shared mem and stream args, 0); } if (result != CUDA_SUCCESS) { printf("cuLaunchKernel error: %d, %" PRIu64 " %" PRIu64 " %" PRIu64 " - %" PRIu64 " %" PRIu64 " %" PRIu64 "\n", result, maxBlockSize[0], maxBlockSize[1], maxBlockSize[2], axis->specializationConstants.localSize[0], axis->specializationConstants.localSize[1], axis->specializationConstants.localSize[2]); return VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL; } if (app->configuration.num_streams > 1) { app->configuration.streamID = app->configuration.streamCounter % app->configuration.num_streams; if (app->configuration.streamCounter == 0) { cudaError_t res2 = cudaEventRecord(app->configuration.stream_event[app->configuration.streamID], app->configuration.stream[app->configuration.streamID]); if (res2 != cudaSuccess) return VKFFT_ERROR_FAILED_TO_EVENT_RECORD; } app->configuration.streamCounter++; } #elif(VKFFT_BACKEND==2) hipError_t result = hipSuccess; void* args[6]; args[0] = axis->inputBuffer; args[1] = axis->outputBuffer; uint64_t args_id = 2; if (axis->specializationConstants.convolutionStep) { args[args_id] = app->configuration.kernel; args_id++; } if (axis->specializationConstants.LUT) { args[args_id] = &axis->bufferLUT; args_id++; } if (axis->specializationConstants.useBluesteinFFT && axis->specializationConstants.BluesteinConvolutionStep) { if (axis->specializationConstants.inverseBluestein) args[args_id] = &app->bufferBluesteinIFFT[axis->specializationConstants.axis_id]; else args[args_id] = &app->bufferBluesteinFFT[axis->specializationConstants.axis_id]; args_id++; } if (axis->specializationConstants.useBluesteinFFT && (axis->specializationConstants.BluesteinPreMultiplication || axis->specializationConstants.BluesteinPostMultiplication)) { args[args_id] = &app->bufferBluestein[axis->specializationConstants.axis_id]; args_id++; } //args[args_id] = &axis->pushConstants; if (axis->updatePushConstants) { axis->updatePushConstants = 0; if (axis->pushConstants.structSize > 0) { if (app->configuration.useUint64) { result = hipMemcpyHtoD(axis->consts_addr, axis->pushConstants.dataUint64, axis->pushConstants.structSize); } else { result = hipMemcpyHtoD(axis->consts_addr, axis->pushConstants.dataUint32, axis->pushConstants.structSize); } if (result != hipSuccess) { printf("hipMemcpyHtoD error: %d\n", result); return VKFFT_ERROR_FAILED_TO_COPY; } } } //printf("%" PRIu64 " %" PRIu64 " %" PRIu64 " %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",maxBlockSize[0], maxBlockSize[1], maxBlockSize[2], axis->specializationConstants.localSize[0], axis->specializationConstants.localSize[1], axis->specializationConstants.localSize[2]); if (app->configuration.num_streams >= 1) { result = hipModuleLaunchKernel(axis->VkFFTKernel, (unsigned int)maxBlockSize[0], (unsigned int)maxBlockSize[1], (unsigned int)maxBlockSize[2], // grid dim (unsigned int)axis->specializationConstants.localSize[0], (unsigned int)axis->specializationConstants.localSize[1], (unsigned int)axis->specializationConstants.localSize[2], // block dim (unsigned int)axis->specializationConstants.usedSharedMemory, app->configuration.stream[app->configuration.streamID], // shared mem and stream args, 0); } else { result = hipModuleLaunchKernel(axis->VkFFTKernel, (unsigned int)maxBlockSize[0], (unsigned int)maxBlockSize[1], (unsigned int)maxBlockSize[2], // grid dim (unsigned int)axis->specializationConstants.localSize[0], (unsigned int)axis->specializationConstants.localSize[1], (unsigned int)axis->specializationConstants.localSize[2], // block dim (unsigned int)axis->specializationConstants.usedSharedMemory, 0, // shared mem and stream args, 0); } if (result != hipSuccess) { printf("hipModuleLaunchKernel error: %d, %" PRIu64 " %" PRIu64 " %" PRIu64 " - %" PRIu64 " %" PRIu64 " %" PRIu64 "\n", result, maxBlockSize[0], maxBlockSize[1], maxBlockSize[2], axis->specializationConstants.localSize[0], axis->specializationConstants.localSize[1], axis->specializationConstants.localSize[2]); return VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL; } if (app->configuration.num_streams > 1) { app->configuration.streamID = app->configuration.streamCounter % app->configuration.num_streams; if (app->configuration.streamCounter == 0) { result = hipEventRecord(app->configuration.stream_event[app->configuration.streamID], app->configuration.stream[app->configuration.streamID]); if (result != hipSuccess) return VKFFT_ERROR_FAILED_TO_EVENT_RECORD; } app->configuration.streamCounter++; } #elif(VKFFT_BACKEND==3) cl_int result = CL_SUCCESS; void* args[6]; args[0] = axis->inputBuffer; result = clSetKernelArg(axis->kernel, 0, sizeof(cl_mem), args[0]); if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG; } args[1] = axis->outputBuffer; result = clSetKernelArg(axis->kernel, 1, sizeof(cl_mem), args[1]); if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG; } uint64_t args_id = 2; if (axis->specializationConstants.convolutionStep) { args[args_id] = app->configuration.kernel; result = clSetKernelArg(axis->kernel, (cl_uint)args_id, sizeof(cl_mem), args[args_id]); if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG; } args_id++; } if (axis->specializationConstants.LUT) { args[args_id] = &axis->bufferLUT; result = clSetKernelArg(axis->kernel, (cl_uint)args_id, sizeof(cl_mem), args[args_id]); if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG; } args_id++; } if (axis->specializationConstants.useBluesteinFFT && axis->specializationConstants.BluesteinConvolutionStep) { if (axis->specializationConstants.inverseBluestein) args[args_id] = &app->bufferBluesteinIFFT[axis->specializationConstants.axis_id]; else args[args_id] = &app->bufferBluesteinFFT[axis->specializationConstants.axis_id]; result = clSetKernelArg(axis->kernel, (cl_uint)args_id, sizeof(cl_mem), args[args_id]); if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG; } args_id++; } if (axis->specializationConstants.useBluesteinFFT && (axis->specializationConstants.BluesteinPreMultiplication || axis->specializationConstants.BluesteinPostMultiplication)) { args[args_id] = &app->bufferBluestein[axis->specializationConstants.axis_id]; result = clSetKernelArg(axis->kernel, (cl_uint)args_id, sizeof(cl_mem), args[args_id]); if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG; } args_id++; } if (axis->pushConstants.structSize > 0) { if (app->configuration.useUint64) { result = clSetKernelArg(axis->kernel, (cl_uint)args_id, axis->pushConstants.structSize, axis->pushConstants.dataUint64); } else { result = clSetKernelArg(axis->kernel, (cl_uint)args_id, axis->pushConstants.structSize, axis->pushConstants.dataUint32); } if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_SET_KERNEL_ARG; } args_id++; } size_t local_work_size[3] = { (size_t)axis->specializationConstants.localSize[0], (size_t)axis->specializationConstants.localSize[1],(size_t)axis->specializationConstants.localSize[2] }; size_t global_work_size[3] = { (size_t)maxBlockSize[0] * local_work_size[0] , (size_t)maxBlockSize[1] * local_work_size[1] ,(size_t)maxBlockSize[2] * local_work_size[2] }; result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, 0, 0, 0); //printf("%" PRIu64 " %" PRIu64 " %" PRIu64 " - %" PRIu64 " %" PRIu64 " %" PRIu64 "\n", maxBlockSize[0], maxBlockSize[1], maxBlockSize[2], axis->specializationConstants.localSize[0], axis->specializationConstants.localSize[1], axis->specializationConstants.localSize[2]); if (result != CL_SUCCESS) { return VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL; } #endif } } } return resFFT; } static inline VkFFTResult VkFFTSync(VkFFTApplication* app) { #if(VKFFT_BACKEND==0) vkCmdPipelineBarrier(app->configuration.commandBuffer[0], VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, 0, 1, app->configuration.memory_barrier, 0, 0, 0, 0); #elif(VKFFT_BACKEND==1) if (app->configuration.num_streams > 1) { cudaError_t res = cudaSuccess; for (uint64_t s = 0; s < app->configuration.num_streams; s++) { res = cudaEventSynchronize(app->configuration.stream_event[s]); if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } app->configuration.streamCounter = 0; } #elif(VKFFT_BACKEND==2) if (app->configuration.num_streams > 1) { hipError_t res = hipSuccess; for (uint64_t s = 0; s < app->configuration.num_streams; s++) { res = hipEventSynchronize(app->configuration.stream_event[s]); if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE; } app->configuration.streamCounter = 0; } #elif(VKFFT_BACKEND==3) #endif return VKFFT_SUCCESS; } static inline void printDebugInformation(VkFFTApplication* app, VkFFTAxis* axis) { if (app->configuration.keepShaderCode) printf("%s\n", axis->specializationConstants.code0); if (app->configuration.printMemoryLayout) { if ((axis->inputBuffer == app->configuration.inputBuffer) && (app->configuration.inputBuffer != app->configuration.buffer)) printf("read: inputBuffer\n"); if (axis->inputBuffer == app->configuration.buffer) printf("read: buffer\n"); if (axis->inputBuffer == app->configuration.tempBuffer) printf("read: tempBuffer\n"); if ((axis->inputBuffer == app->configuration.outputBuffer) && (app->configuration.outputBuffer != app->configuration.buffer)) printf("read: outputBuffer\n"); if ((axis->outputBuffer == app->configuration.inputBuffer) && (app->configuration.inputBuffer != app->configuration.buffer)) printf("write: inputBuffer\n"); if (axis->outputBuffer == app->configuration.buffer) printf("write: buffer\n"); if (axis->outputBuffer == app->configuration.tempBuffer) printf("write: tempBuffer\n"); if ((axis->outputBuffer == app->configuration.outputBuffer) && (app->configuration.outputBuffer != app->configuration.buffer)) printf("write: outputBuffer\n"); } } static inline VkFFTResult VkFFTAppend(VkFFTApplication* app, int inverse, VkFFTLaunchParams* launchParams) { VkFFTResult resFFT = VKFFT_SUCCESS; #if(VKFFT_BACKEND==0) app->configuration.commandBuffer = launchParams->commandBuffer; VkMemoryBarrier memory_barrier = { VK_STRUCTURE_TYPE_MEMORY_BARRIER, 0, VK_ACCESS_SHADER_WRITE_BIT, VK_ACCESS_SHADER_READ_BIT, }; app->configuration.memory_barrier = &memory_barrier; #elif(VKFFT_BACKEND==1) app->configuration.streamCounter = 0; #elif(VKFFT_BACKEND==2) app->configuration.streamCounter = 0; #elif(VKFFT_BACKEND==3) app->configuration.commandQueue = launchParams->commandQueue; #endif uint64_t localSize0[3]; if ((inverse != 1) && (app->configuration.makeInversePlanOnly)) return VKFFT_ERROR_ONLY_INVERSE_FFT_INITIALIZED; if ((inverse == 1) && (app->configuration.makeForwardPlanOnly)) return VKFFT_ERROR_ONLY_FORWARD_FFT_INITIALIZED; if ((inverse != 1) && (!app->configuration.makeInversePlanOnly) && (!app->localFFTPlan)) return VKFFT_ERROR_PLAN_NOT_INITIALIZED; if ((inverse == 1) && (!app->configuration.makeForwardPlanOnly) && (!app->localFFTPlan_inverse)) return VKFFT_ERROR_PLAN_NOT_INITIALIZED; if (inverse == 1) { localSize0[0] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0]; localSize0[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[1][0]; localSize0[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[2][0]; } else { localSize0[0] = app->localFFTPlan->actualFFTSizePerAxis[0][0]; localSize0[1] = app->localFFTPlan->actualFFTSizePerAxis[1][0]; localSize0[2] = app->localFFTPlan->actualFFTSizePerAxis[2][0]; } resFFT = VkFFTCheckUpdateBufferSet(app, 0, 0, launchParams); if (resFFT != VKFFT_SUCCESS) { return resFFT; } if (inverse != 1) { //FFT axis 0 if (!app->configuration.omitDimension[0]) { for (int64_t l = (int64_t)app->localFFTPlan->numAxisUploads[0] - 1; l >= 0; l--) { VkFFTAxis* axis = &app->localFFTPlan->axes[0][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 0, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; uint64_t maxCoordinate = ((app->configuration.matrixConvolution > 1) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1)) ? 1 : app->configuration.coordinateFeatures; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; if (l == 0) { if (app->localFFTPlan->numAxisUploads[0] > 2) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1]) / (double)app->localFFTPlan->axisSplit[0][1]) * app->localFFTPlan->axisSplit[0][1]; dispatchBlock[1] = app->localFFTPlan->actualFFTSizePerAxis[0][1]; } else { if (app->localFFTPlan->numAxisUploads[0] > 1) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1])); dispatchBlock[1] = app->localFFTPlan->actualFFTSizePerAxis[0][1]; } else { dispatchBlock[0] = app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim; dispatchBlock[1] = (uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][1] / (double)axis->axisBlock[1]); } } } else { dispatchBlock[0] = (uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[0]); dispatchBlock[1] = app->localFFTPlan->actualFFTSizePerAxis[0][1]; } dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[0][2] * maxCoordinate * app->configuration.numberBatches; if (axis->specializationConstants.mergeSequencesR2C == 1) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } if (app->useBluesteinFFT[0] && (app->localFFTPlan->numAxisUploads[0] > 1)) { for (int64_t l = 1; l < (int64_t)app->localFFTPlan->numAxisUploads[0]; l++) { VkFFTAxis* axis = &app->localFFTPlan->inverseBluesteinAxes[0][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 0, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; uint64_t maxCoordinate = ((app->configuration.matrixConvolution > 1) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1)) ? 1 : app->configuration.coordinateFeatures; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; if (l == 0) { if (app->localFFTPlan->numAxisUploads[0] > 2) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1]) / (double)app->localFFTPlan->axisSplit[0][1]) * app->localFFTPlan->axisSplit[0][1]; dispatchBlock[1] = app->localFFTPlan->actualFFTSizePerAxis[0][1]; } else { if (app->localFFTPlan->numAxisUploads[0] > 1) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1])); dispatchBlock[1] = app->localFFTPlan->actualFFTSizePerAxis[0][1]; } else { dispatchBlock[0] = app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim; dispatchBlock[1] = (uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][1] / (double)axis->axisBlock[1]); } } } else { dispatchBlock[0] = (uint64_t)ceil(app->localFFTPlan->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[0]); dispatchBlock[1] = app->localFFTPlan->actualFFTSizePerAxis[0][1]; } dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[0][2] * maxCoordinate * app->configuration.numberBatches; if (axis->specializationConstants.mergeSequencesR2C == 1) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } if (app->localFFTPlan->multiUploadR2C) { VkFFTAxis* axis = &app->localFFTPlan->R2Cdecomposition; resFFT = VkFFTUpdateBufferSetR2CMultiUploadDecomposition(app, app->localFFTPlan, axis, 0, 0, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; uint64_t maxCoordinate = ((app->configuration.matrixConvolution > 1) && (app->configuration.performConvolution) && (app->configuration.FFTdim == 1)) ? 1 : app->configuration.coordinateFeatures; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(((app->configuration.size[0] / 2 + 1) * app->configuration.size[1] * app->configuration.size[2]) / (double)(2 * axis->axisBlock[0])); dispatchBlock[1] = 1; dispatchBlock[2] = maxCoordinate * app->configuration.numberBatches; resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; //app->configuration.size[0] *= 2; } } if (app->configuration.FFTdim > 1) { //FFT axis 1 if (!app->configuration.omitDimension[1]) { if ((app->configuration.FFTdim == 2) && (app->configuration.performConvolution)) { for (int64_t l = (int64_t)app->localFFTPlan->numAxisUploads[1] - 1; l >= 0; l--) { VkFFTAxis* axis = &app->localFFTPlan->axes[1][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 1, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; uint64_t maxCoordinate = ((app->configuration.matrixConvolution > 1) && (l == 0)) ? 1 : app->configuration.coordinateFeatures; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[1] / (double)axis->axisBlock[0] * app->localFFTPlan->actualFFTSizePerAxis[1][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[1][2] * maxCoordinate * app->configuration.numberBatches; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } else { for (int64_t l = (int64_t)app->localFFTPlan->numAxisUploads[1] - 1; l >= 0; l--) { VkFFTAxis* axis = &app->localFFTPlan->axes[1][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 1, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[1] / (double)axis->axisBlock[0] * app->localFFTPlan->actualFFTSizePerAxis[1][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[1][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } if (app->useBluesteinFFT[1] && (app->localFFTPlan->numAxisUploads[1] > 1)) { for (int64_t l = 1; l < (int64_t)app->localFFTPlan->numAxisUploads[1]; l++) { VkFFTAxis* axis = &app->localFFTPlan->inverseBluesteinAxes[1][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 1, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[1] / (double)axis->axisBlock[0] * app->localFFTPlan->actualFFTSizePerAxis[1][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[1][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } } } } //FFT axis 2 if (app->configuration.FFTdim > 2) { if (!app->configuration.omitDimension[2]) { if ((app->configuration.FFTdim == 3) && (app->configuration.performConvolution)) { for (int64_t l = (int64_t)app->localFFTPlan->numAxisUploads[2] - 1; l >= 0; l--) { VkFFTAxis* axis = &app->localFFTPlan->axes[2][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 2, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; uint64_t maxCoordinate = ((app->configuration.matrixConvolution > 1) && (l == 0)) ? 1 : app->configuration.coordinateFeatures; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[2] / (double)axis->axisBlock[0] * app->localFFTPlan->actualFFTSizePerAxis[2][2] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[2][1] * maxCoordinate * app->configuration.numberBatches; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } else { for (int64_t l = (int64_t)app->localFFTPlan->numAxisUploads[2] - 1; l >= 0; l--) { VkFFTAxis* axis = &app->localFFTPlan->axes[2][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 2, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[2] / (double)axis->axisBlock[0] * app->localFFTPlan->actualFFTSizePerAxis[2][2] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[2][1] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } if (app->useBluesteinFFT[2] && (app->localFFTPlan->numAxisUploads[2] > 1)) { for (int64_t l = 1; l < (int64_t)app->localFFTPlan->numAxisUploads[2]; l++) { VkFFTAxis* axis = &app->localFFTPlan->inverseBluesteinAxes[2][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan, axis, 2, l, 0); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[2] / (double)axis->axisBlock[0] * app->localFFTPlan->actualFFTSizePerAxis[2][2] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan->actualFFTSizePerAxis[2][1] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } } } } } if (app->configuration.performConvolution) { if (app->configuration.FFTdim > 2) { //multiple upload ifft leftovers if (app->configuration.FFTdim == 3) { for (int64_t l = (int64_t)1; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[2]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[2][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 2, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[2] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[2][2] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[2][1] * app->configuration.coordinateFeatures * app->configuration.numberKernels; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } for (int64_t l = 0; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[1]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[1][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 1, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[2] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[1][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[1][2] * app->configuration.coordinateFeatures * app->configuration.numberKernels; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } if (app->configuration.FFTdim > 1) { if (app->configuration.FFTdim == 2) { for (int64_t l = (int64_t)1; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[1]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[1][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 1, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[1] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[1][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[1][2] * app->configuration.coordinateFeatures * app->configuration.numberKernels; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } for (int64_t l = 0; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[0]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[0][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 0, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; if (l == 0) { if (app->localFFTPlan_inverse->numAxisUploads[0] > 2) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1]) / (double)app->localFFTPlan_inverse->axisSplit[0][1]) * app->localFFTPlan_inverse->axisSplit[0][1]; dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } else { if (app->localFFTPlan_inverse->numAxisUploads[0] > 1) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1])); dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } else { dispatchBlock[0] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim; dispatchBlock[1] = (uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1] / (double)axis->axisBlock[1]); } } } else { dispatchBlock[0] = (uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[0]); dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][2] * app->configuration.coordinateFeatures * app->configuration.numberKernels; if (axis->specializationConstants.mergeSequencesR2C == 1) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } if (app->configuration.FFTdim == 1) { for (int64_t l = (int64_t)1; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[0]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[0][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 0, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][2] * app->configuration.coordinateFeatures * app->configuration.numberKernels; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } } if (inverse == 1) { //we start from axis 2 and go back to axis 0 //FFT axis 2 if (app->configuration.FFTdim > 2) { if (!app->configuration.omitDimension[2]) { for (int64_t l = (int64_t)app->localFFTPlan_inverse->numAxisUploads[2] - 1; l >= 0; l--) { //if ((!app->configuration.reorderFourStep) && (!app->useBluesteinFFT[2])) l = app->localFFTPlan_inverse->numAxisUploads[2] - 1 - l; VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[2][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 2, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[2] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[2][2] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[2][1] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.performZeropaddingInverse[0]) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); //if (app->configuration.performZeropaddingInverse[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; //if ((!app->configuration.reorderFourStep) && (!app->useBluesteinFFT[2])) l = app->localFFTPlan_inverse->numAxisUploads[2] - 1 - l; } if (app->useBluesteinFFT[2] && (app->localFFTPlan_inverse->numAxisUploads[2] > 1)) { for (int64_t l = 1; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[2]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->inverseBluesteinAxes[2][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 2, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[2] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[2][2] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[2][1] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } } } if (app->configuration.FFTdim > 1) { //FFT axis 1 if (!app->configuration.omitDimension[1]) { for (int64_t l = (int64_t)app->localFFTPlan_inverse->numAxisUploads[1] - 1; l >= 0; l--) { //if ((!app->configuration.reorderFourStep) && (!app->useBluesteinFFT[1])) l = app->localFFTPlan_inverse->numAxisUploads[1] - 1 - l; VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[1][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 1, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[1] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[1][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[1][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.mergeSequencesR2C == 1) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); //if (app->configuration.performZeropaddingInverse[0]) dispatchBlock[0] = (uint64_t)ceil(dispatchBlock[0] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); //if ((!app->configuration.reorderFourStep) && (!app->useBluesteinFFT[1])) l = app->localFFTPlan_inverse->numAxisUploads[1] - 1 - l; resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } if (app->useBluesteinFFT[1] && (app->localFFTPlan_inverse->numAxisUploads[1] > 1)) { for (int64_t l = 1; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[1]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->inverseBluesteinAxes[1][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 1, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(localSize0[1] / (double)axis->axisBlock[0] * app->localFFTPlan_inverse->actualFFTSizePerAxis[1][1] / (double)axis->specializationConstants.fftDim); dispatchBlock[1] = 1; dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[1][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches; //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } } } if (!app->configuration.omitDimension[0]) { if (app->localFFTPlan_inverse->multiUploadR2C) { //app->configuration.size[0] /= 2; VkFFTAxis* axis = &app->localFFTPlan_inverse->R2Cdecomposition; resFFT = VkFFTUpdateBufferSetR2CMultiUploadDecomposition(app, app->localFFTPlan_inverse, axis, 0, 0, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; dispatchBlock[0] = (uint64_t)ceil(((app->configuration.size[0] / 2 + 1) * app->configuration.size[1] * app->configuration.size[2]) / (double)(2 * axis->axisBlock[0])); dispatchBlock[1] = 1; dispatchBlock[2] = app->configuration.coordinateFeatures * app->configuration.numberBatches; resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } //FFT axis 0 for (int64_t l = (int64_t)app->localFFTPlan_inverse->numAxisUploads[0] - 1; l >= 0; l--) { //if ((!app->configuration.reorderFourStep) && (!app->useBluesteinFFT[0])) l = app->localFFTPlan_inverse->numAxisUploads[0] - 1 - l; VkFFTAxis* axis = &app->localFFTPlan_inverse->axes[0][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 0, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; if (l == 0) { if (app->localFFTPlan_inverse->numAxisUploads[0] > 2) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1]) / (double)app->localFFTPlan_inverse->axisSplit[0][1]) * app->localFFTPlan_inverse->axisSplit[0][1]; dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } else { if (app->localFFTPlan_inverse->numAxisUploads[0] > 1) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1])); dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } else { dispatchBlock[0] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim; dispatchBlock[1] = (uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1] / (double)axis->axisBlock[1]); } } } else { dispatchBlock[0] = (uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[0]); dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches; if (axis->specializationConstants.mergeSequencesR2C == 1) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); //if ((!app->configuration.reorderFourStep) && (!app->useBluesteinFFT[0])) l = app->localFFTPlan_inverse->numAxisUploads[0] - 1 - l; resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } if (app->useBluesteinFFT[0] && (app->localFFTPlan_inverse->numAxisUploads[0] > 1)) { for (int64_t l = 1; l < (int64_t)app->localFFTPlan_inverse->numAxisUploads[0]; l++) { VkFFTAxis* axis = &app->localFFTPlan_inverse->inverseBluesteinAxes[0][l]; resFFT = VkFFTUpdateBufferSet(app, app->localFFTPlan_inverse, axis, 0, l, 1); if (resFFT != VKFFT_SUCCESS) return resFFT; #if(VKFFT_BACKEND==0) vkCmdBindPipeline(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipeline); vkCmdBindDescriptorSets(app->configuration.commandBuffer[0], VK_PIPELINE_BIND_POINT_COMPUTE, axis->pipelineLayout, 0, 1, &axis->descriptorSet, 0, 0); #endif uint64_t dispatchBlock[3]; if (l == 0) { if (app->localFFTPlan_inverse->numAxisUploads[0] > 2) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1]) / (double)app->localFFTPlan_inverse->axisSplit[0][1]) * app->localFFTPlan_inverse->axisSplit[0][1]; dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } else { if (app->localFFTPlan_inverse->numAxisUploads[0] > 1) { dispatchBlock[0] = (uint64_t)ceil((uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[1])); dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } else { dispatchBlock[0] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim; dispatchBlock[1] = (uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1] / (double)axis->axisBlock[1]); } } } else { dispatchBlock[0] = (uint64_t)ceil(app->localFFTPlan_inverse->actualFFTSizePerAxis[0][0] / axis->specializationConstants.fftDim / (double)axis->axisBlock[0]); dispatchBlock[1] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][1]; } dispatchBlock[2] = app->localFFTPlan_inverse->actualFFTSizePerAxis[0][2] * app->configuration.coordinateFeatures * app->configuration.numberBatches; if (axis->specializationConstants.mergeSequencesR2C == 1) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[1]) dispatchBlock[1] = (uint64_t)ceil(dispatchBlock[1] / 2.0); //if (app->configuration.performZeropadding[2]) dispatchBlock[2] = (uint64_t)ceil(dispatchBlock[2] / 2.0); resFFT = dispatchEnhanced(app, axis, dispatchBlock); if (resFFT != VKFFT_SUCCESS) return resFFT; printDebugInformation(app, axis); resFFT = VkFFTSync(app); if (resFFT != VKFFT_SUCCESS) return resFFT; } } } //if (app->localFFTPlan_inverse->multiUploadR2C) app->configuration.size[0] *= 2; } return resFFT; } static inline int VkFFTGetVersion() { return 10221; //X.XX.XX format } #endifpyvkfft-2022.1.1/src/vkfft_cuda.cu0000644000076500000240000002104014202465263017456 0ustar vincentstaff00000000000000/* PyVkFFT (c) 2021- : ESRF-European Synchrotron Radiation Facility authors: Vincent Favre-Nicolin, favre@esrf.fr */ // We use the CUDA backend #define VKFFT_BACKEND 1 #include #include #include using namespace std; #include "vkFFT.h" typedef float2 Complex; #ifdef _WIN32 #define LIBRARY_API extern "C" __declspec(dllexport) #else #define LIBRARY_API extern "C" #endif LIBRARY_API VkFFTConfiguration* make_config(const size_t, const size_t, const size_t, const size_t, void*, void*, void*, const int, const size_t, const int, const int, const int, const int, const int, const int, const size_t, const int, const int, const int); LIBRARY_API VkFFTApplication* init_app(const VkFFTConfiguration*, int*); LIBRARY_API int fft(VkFFTApplication* app, void*, void*); LIBRARY_API int ifft(VkFFTApplication* app, void*, void*); LIBRARY_API void free_app(VkFFTApplication* app); LIBRARY_API void free_config(VkFFTConfiguration *config); LIBRARY_API uint32_t vkfft_version(); class PyVkFFT { public: PyVkFFT(const int nx, const int ny, const int nz, const int fftdim, void* hstream, const int norm, const int precision, const int r2c) { }; private: VkFFTConfiguration mConf; VkFFTApplication mApp; VkFFTApplication mLaunchParams; }; /** Create the VkFFTConfiguration from the array parameters * * \param nx, ny, nz: dimensions of the array. The fast axis is x. In the corresponding numpy array, * this corresponds to a shape of (nz, ny, nx) * \param fftdim: the dimension of the transform. If nz>1 and fftdim=2, the transform is only made * on the x and y axes * \param buffer, buffer_out: pointer to the GPU data source and destination arrays. These * can be fake and the actual buffers supplied in fft() and ifft. However buffer should be non-zero, * and buffer_out should be non-zero only for an out-of-place transform. * \param hstream: the stream handle (CUstream) * \param norm: 0, the L2 norm is multiplied by the size on each transform, 1, the inverse transform * divides the L2 norm by the size. * \param precision: number of bits per float, 16=half, 32=single, 64=double precision * \return: the pointer to the newly created VkFFTConfiguration, or 0 if an error occurred. */ VkFFTConfiguration* make_config(const size_t nx, const size_t ny, const size_t nz, const size_t fftdim, void *buffer, void *buffer_out, void* hstream, const int norm, const size_t precision, const int r2c, const int dct, const int disableReorderFourStep, const int registerBoost, const int useLUT, const int keepShaderCode, const size_t n_batch, const int skipx, const int skipy, const int skipz) { VkFFTConfiguration *config = new VkFFTConfiguration({}); config->FFTdim = fftdim; config->size[0] = nx; config->size[1] = ny; config->size[2] = nz; config->numberBatches = n_batch; config->omitDimension[0] = skipx; config->omitDimension[1] = skipy; config->omitDimension[2] = skipz; config->normalize = norm; config->performR2C = r2c; config->performDCT = dct; if(disableReorderFourStep>=0) config->disableReorderFourStep = disableReorderFourStep; if(registerBoost>=0) config->registerBoost = registerBoost; if(useLUT>=0) config->useLUT = useLUT; if(keepShaderCode>=0) config->keepShaderCode = keepShaderCode; switch(precision) { case 2 : config->halfPrecision = 1; case 8 : config->doublePrecision = 1; }; CUdevice *dev = new CUdevice; if(hstream != 0) { // Get context then device from current context CUcontext ctx = nullptr; CUresult res = cuStreamGetCtx ((CUstream)hstream, &ctx); if(res != CUDA_SUCCESS) { cout << "Could not get the current device from given stream"<stream = new CUstream((CUstream) hstream); config->num_streams = 1; } else { // Get device from current context CUresult res = cuCtxGetDevice(dev); if(res != CUDA_SUCCESS) { cout << "Could not get the current device. Was a CUDA context created ?"<device = dev; void ** pbuf = new void*; *pbuf = buffer; uint64_t* psize = new uint64_t; uint64_t* psizein = psize; if(r2c) { *psize = (uint64_t)((nx / 2 + 1) * ny * nz * precision * (size_t)2); if(buffer_out != NULL) { psizein = new uint64_t; *psizein = (uint64_t)(nx * ny * nz * precision); config->inverseReturnToInputBuffer = 1; config->inputBufferStride[0] = nx; config->inputBufferStride[1] = nx * ny; config->inputBufferStride[2] = nx * ny * nz; } } else { if(dct) *psize = (uint64_t)(nx * ny * nz * precision); else *psize = (uint64_t)(nx * ny * nz * precision * (size_t)2); } config->bufferSize = psize; if(buffer_out != NULL) { // Calculations are made in buffer, so with buffer != inputBuffer we keep the original data void ** pbufout = new void*; *pbufout = buffer_out; config->buffer = pbufout; config->inputBuffer = pbuf; config->inputBufferSize = psizein; config->isInputFormatted = 1; } else { config->buffer = pbuf; } /* cout << "make_config: "<buffer<<", "<< *(config->buffer)<<", " << config->size[0] << " " << config->size[1] << " " << config->size[2] << " "<< config->FFTdim << " " << *(config->bufferSize) << endl; */ return config; } /** Initialise the VkFFTApplication from the given configuration. * * \param config: the pointer to the VkFFTConfiguration * \return: the pointer to the newly created VkFFTApplication */ VkFFTApplication* init_app(const VkFFTConfiguration* config, int *res) { VkFFTApplication* app = new VkFFTApplication({}); *res = initializeVkFFT(app, *config); /* cout << "init_app: "<buffer<<", "<< *(config->buffer)<<", " << config->size[0] << " " << config->size[1] << " " << config->size[2] << " "<< config->FFTdim << " " << *(config->bufferSize) << endl<configuration.buffer) = out; *(app->configuration.inputBuffer) = in; *(app->configuration.outputBuffer) = out; VkFFTLaunchParams par = {}; par.buffer = app->configuration.buffer; par.inputBuffer = app->configuration.inputBuffer; par.outputBuffer = app->configuration.outputBuffer; return VkFFTAppend(app, -1, &par); } int ifft(VkFFTApplication* app, void *in, void *out) { // Modify the original app only to avoid allocating // new buffer pointers in memory *(app->configuration.buffer) = out; *(app->configuration.inputBuffer) = in; *(app->configuration.outputBuffer) = out; VkFFTLaunchParams par = {}; par.buffer = app->configuration.buffer; par.inputBuffer = app->configuration.inputBuffer; par.outputBuffer = app->configuration.outputBuffer; return VkFFTAppend(app, 1, &par); } /** Free memory allocated during make_config() * */ void free_app(VkFFTApplication* app) { if(app != NULL) { deleteVkFFT(app); free(app); } } /** Free memory associated to the vkFFT app * */ void free_config(VkFFTConfiguration *config) { free(config->device); // Only frees the pointer to the buffer pointer, not the buffer itself. free(config->buffer); free(config->bufferSize); if((config->outputBuffer != NULL) && (config->buffer != config->outputBuffer)) free(config->outputBuffer); if((config->inputBuffer != NULL) && (config->buffer != config->inputBuffer) && (config->outputBuffer != config->inputBuffer)) free(config->inputBuffer); if((config->inputBufferSize != NULL) && (config->inputBufferSize != config->bufferSize)) free(config->inputBufferSize); if((config->outputBufferSize != NULL) && (config->outputBufferSize != config->bufferSize) && (config->outputBufferSize != config->inputBufferSize)) free(config->outputBufferSize); if(config->stream != 0) free(config->stream); free(config); } /// Get VkFFT version uint32_t vkfft_version() { return VkFFTGetVersion(); }; pyvkfft-2022.1.1/src/vkfft_opencl.cpp0000644000076500000240000001751414202465263020210 0ustar vincentstaff00000000000000/* PyVkFFT (c) 2021- : ESRF-European Synchrotron Radiation Facility authors: Vincent Favre-Nicolin, favre@esrf.fr */ // We use the OpenCL backend #define VKFFT_BACKEND 3 #include #include #include #include using namespace std; #include "vkFFT.h" #ifdef _WIN32 #define LIBRARY_API extern "C" __declspec(dllexport) #else #define LIBRARY_API extern "C" #endif LIBRARY_API VkFFTConfiguration* make_config(const size_t, const size_t, const size_t, const size_t, void*, void*, void*, void*, void*, const int, const size_t, const int, const int, const int, const int, const int, const int, const size_t, const int, const int, const int); LIBRARY_API VkFFTApplication* init_app(const VkFFTConfiguration*, void*, int*); LIBRARY_API int fft(VkFFTApplication* app, void*, void*, void*); LIBRARY_API int ifft(VkFFTApplication* app, void*, void*, void*); LIBRARY_API void free_app(VkFFTApplication* app); LIBRARY_API void free_config(VkFFTConfiguration *config); LIBRARY_API uint32_t vkfft_version(); /** Create the VkFFTConfiguration from the array parameters * * \param nx, ny, nz: dimensions of the array. The fast axis is x. In the corresponding numpy array, * this corresponds to a shape of (nz, ny, nx) * \param fftdim: the dimension of the transform. If nz>1 and fftdim=2, the transform is only made * on the x and y axes * \param buffer, buffer_out: pointer to the GPU data source and destination arrays. These * can be fake and the actual buffers supplied in fft() and ifft. However buffer should be non-zero, * and buffer_out should be non-zero only for an out-of-place transform. * \param platform: the cl_platform * \param device: the cl_device * \param ctx: the cl_context * \param norm: 0, the L2 norm is multiplied by the size on each transform, 1, the inverse transform * divides the L2 norm by the size. * \param precision: number of bits per float, 16=half, 32=single, 64=double precision * \param r2c: if True, create a configuration for a real<->complex transform * \return: the pointer to the newly created VkFFTConfiguration, or 0 if an error occurred. */ VkFFTConfiguration* make_config(const size_t nx, const size_t ny, const size_t nz, const size_t fftdim, void *buffer, void *buffer_out, void* platform, void* device, void* ctx, const int norm, const size_t precision, const int r2c, const int dct, const int disableReorderFourStep, const int registerBoost, const int useLUT, const int keepShaderCode, const size_t n_batch, const int skipx, const int skipy, const int skipz) { VkFFTConfiguration *config = new VkFFTConfiguration({}); config->FFTdim = fftdim; config->size[0] = nx; config->size[1] = ny; config->size[2] = nz; config->numberBatches = n_batch; config->omitDimension[0] = skipx; config->omitDimension[1] = skipy; config->omitDimension[2] = skipz; config->normalize = norm; config->performR2C = r2c; config->performDCT = dct; if(disableReorderFourStep>=0) config->disableReorderFourStep = disableReorderFourStep; if(registerBoost>=0) config->registerBoost = registerBoost; if(useLUT>=0) config->useLUT = useLUT; if(keepShaderCode>=0) config->keepShaderCode = keepShaderCode; switch(precision) { case 2 : config->halfPrecision = 1; case 8 : config->doublePrecision = 1; }; cl_device_id *pdev = new cl_device_id; *pdev = (cl_device_id)device; config->device = pdev; cl_platform_id *pplatform = new cl_platform_id; *pplatform = (cl_platform_id)platform; config->platform = pplatform; cl_context * pctx = new cl_context; *pctx = (cl_context) ctx; config->context = pctx; void ** pbuf = new void*; *pbuf = buffer; uint64_t* psize = new uint64_t; uint64_t* psizein = psize; if(r2c) { *psize = (uint64_t)((nx / 2 + 1) * ny * nz * precision * (size_t)2); if(buffer_out != NULL) { psizein = new uint64_t; *psizein = (uint64_t)(nx * ny * nz * precision); config->inverseReturnToInputBuffer = 1; config->inputBufferStride[0] = nx; config->inputBufferStride[1] = nx * ny; config->inputBufferStride[2] = nx * ny * nz; } } else { if(dct) *psize = (uint64_t)(nx * ny * nz * precision); else *psize = (uint64_t)(nx * ny * nz * precision * (size_t)2); } config->bufferSize = psize; if(buffer_out != NULL) { // Calculations are made in buffer, so with buffer != inputBuffer we keep the original data void ** pbufout = new void*; *pbufout = buffer_out; config->buffer = (cl_mem*)pbufout; config->inputBuffer = (cl_mem*)pbuf; config->inputBufferSize = psizein; config->isInputFormatted = 1; } else { config->buffer = (cl_mem*)pbuf; } return config; } /** Initialise the VkFFTApplication from the given configuration. * * \param config: the pointer to the VkFFTConfiguration * \param queue: the cl_command_queue * \return: the pointer to the newly created VkFFTApplication */ VkFFTApplication* init_app(const VkFFTConfiguration* config, void *queue, int *res) { VkFFTApplication* app = new VkFFTApplication({}); *res = initializeVkFFT(app, *config); if(*res!=0) { delete app; return 0; } return app; } int fft(VkFFTApplication* app, void *in, void *out, void* queue) { cl_command_queue q = (cl_command_queue) queue; // Modify the original app only to avoid allocating // new buffer pointers in memory *(app->configuration.buffer) = (cl_mem)out; *(app->configuration.inputBuffer) = (cl_mem)in; *(app->configuration.outputBuffer) = (cl_mem)out; app->configuration.commandQueue = &q; VkFFTLaunchParams par = {}; par.commandQueue = &q; par.buffer = app->configuration.buffer; par.inputBuffer = app->configuration.inputBuffer; par.outputBuffer = app->configuration.outputBuffer; return VkFFTAppend(app, -1, &par); } int ifft(VkFFTApplication* app, void *in, void *out, void* queue) { cl_command_queue q = (cl_command_queue) queue; // Modify the original app only to avoid allocating // new buffer pointers in memory *(app->configuration.buffer) = (cl_mem)out; *(app->configuration.inputBuffer) = (cl_mem)in; *(app->configuration.outputBuffer) = (cl_mem)out; app->configuration.commandQueue = &q; VkFFTLaunchParams par = {}; par.commandQueue = &q; par.buffer = app->configuration.buffer; par.inputBuffer = app->configuration.inputBuffer; par.outputBuffer = app->configuration.outputBuffer; return VkFFTAppend(app, 1, &par); } /** Free memory associated to the vkFFT app * */ void free_app(VkFFTApplication* app) { if(app != NULL) { deleteVkFFT(app); free(app); } } /** Free memory allocated during make_config() * */ void free_config(VkFFTConfiguration *config) { free(config->platform); free(config->device); free(config->context); // Only frees the pointer to the buffer pointer, not the buffer itself. free(config->buffer); free(config->bufferSize); if((config->outputBuffer != NULL) && (config->buffer != config->outputBuffer)) free(config->outputBuffer); if((config->inputBuffer != NULL) && (config->buffer != config->inputBuffer) && (config->outputBuffer != config->inputBuffer)) free(config->inputBuffer); if((config->inputBufferSize != NULL) && (config->inputBufferSize != config->bufferSize)) free(config->inputBufferSize); if((config->outputBufferSize != NULL) && (config->outputBufferSize != config->bufferSize) && (config->outputBufferSize != config->inputBufferSize)) free(config->outputBufferSize); free(config); } /// Get VkFFT version uint32_t vkfft_version() { return VkFFTGetVersion(); };
GPUbackendtransformndimrangeradixdtypeinplaceLUTnormtime-durationFAILERROR
%s' % (args.gpu if args.gpu is not None else '-') if args.backend is not None: for a in args.backend: html += a + ' ' else: html += 'all' html += ' 0 or t.range_nd_narrow[1] > 0) and t.ndim > 1: html += " [|Ni-N1|<={%d; %d%%N1}]" % \ (t.range_nd_narrow[1], int(t.range_nd_narrow[0] * 100)) elif t.ndim > 1: html += ' (' + 'N,' * (t.ndim - 1) + 'N)' html += "Bluestein-" for i in (args.radix if len(args.radix) else [2, 3, 5, 7, 11, 13]): html += "%d," % i html = html[:-1] if t.max_pow is not None: html += '[^N,N<=%d]' % t.max_pow html += '%s%s%s%dRegular multi-dimensional C2C/R2C/DCT test%s +%s %s 0 0