pax_global_header00006660000000000000000000000064145631342670014525gustar00rootroot0000000000000052 comment=39baaf56bb0bc871f271247e18f2f1dd7e2c187a GiMMiK-3.2.1/000077500000000000000000000000001456313426700126055ustar00rootroot00000000000000GiMMiK-3.2.1/.github/000077500000000000000000000000001456313426700141455ustar00rootroot00000000000000GiMMiK-3.2.1/.github/workflows/000077500000000000000000000000001456313426700162025ustar00rootroot00000000000000GiMMiK-3.2.1/.github/workflows/python-publish.yml000066400000000000000000000015401456313426700217120ustar00rootroot00000000000000# This workflow will upload a Python Package using Twine when a release is created # For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries name: Upload Python Package on: release: types: [created] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v3 with: python-version: '3.x' - name: Install dependencies run: | python -m pip install --upgrade pip pip install setuptools wheel twine - name: Build and publish env: TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} run: | python setup.py sdist bdist_wheel twine upload dist/* GiMMiK-3.2.1/.gitignore000066400000000000000000000010401456313426700145700ustar00rootroot00000000000000# Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] # C extensions *.so # Distribution / packaging .Python env/ bin/ build/ develop-eggs/ dist/ eggs/ lib/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .cache nosetests.xml coverage.xml # Translations *.mo # Mr Developer .mr.developer.cfg .project .pydevproject # Rope .ropeproject # Django stuff: *.log *.pot # Sphinx documentation docs/_build/ GiMMiK-3.2.1/AUTHORS000066400000000000000000000001351456313426700136540ustar00rootroot00000000000000Freddie Witherden Bartosz Wozniak GiMMiK-3.2.1/LICENSE000066400000000000000000000027471456313426700136240ustar00rootroot00000000000000Copyright (c) 2014, 2015, 2016 Fredie Witherden and Bartosz Wozniak All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of GiMMiK nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. GiMMiK-3.2.1/README.rst000066400000000000000000000031561456313426700143010ustar00rootroot00000000000000GiMMiK ====== Generator of Matrix Multiplication Kernels - GiMMiK - is a tool for generation of high performance matrix multiplication kernel code for various accelerator platforms. Currently C, CUDA, HIP, ISPC, Metal, and OpenCL are supported. What does GiMMiK do? -------------------- Consider matrix multiplication of the form C = α∙A×B + β∙C GiMMiK generates fully unrolled kernels, highly specialised to a given operator matrix. The generated code is fully unrolled - each kernel computes a single column of the output matrix. GiMMiK was designed to perform well in a Block by Panel type of matrix multiplication where the operator matrix is small. GiMMiK also removes any sparsity form the operator matrix as well as attempts to reduce common sub-expressions. How do I install GiMMiK? ------------------------ Clone the git repository and use `setup.py` to install the GiMMiK package. You will need the following dependencies: * `mako `_ * `numpy >= 1.7 `_ Once obtained, you can install GiMMiK by running :: python setup.py install to perform a system-wide install. Alternatively, run :: python setup.py install --user to install the package locally. How do I use GiMMiK? -------------------- Once installed, you are ready to use GiMMiK. .. code:: python from gimmik import generate_mm ... # Generate a CUDA kernel for C = 2*mat*B src = generate_mm(mat, np.float32, platform='cuda', alpha=2.0, beta=0.0) ... Who uses GiMMiK? ---------------- GiMMiK was develop to improve performance of the `PyFR `_ framework. GiMMiK-3.2.1/gimmik/000077500000000000000000000000001456313426700140625ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/__init__.py000066400000000000000000000015141456313426700161740ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik._version import __version__ from gimmik.c import CMatMul from gimmik.copenmp import COpenMPMatMul from gimmik.cuda import CUDAMatMul from gimmik.ispc import ISPCMatMul from gimmik.hip import HIPMatMul from gimmik.metal import MetalMatMul from gimmik.opencl import OpenCLMatMul def generate_mm(mat, dtype, platform, alpha=1.0, beta=0.0, funcn='gimmik_mm', n=None, ldb=None, ldc=None): import warnings warnings.warn('generate_mm is deprecated, use MatMul', DeprecationWarning) platmap = { 'c': CMatMul, 'c-omp': COpenMPMatMul, 'cuda': CUDAMatMul, 'ispc': ISPCMatMul, 'hip': HIPMatMul, 'opencl': OpenCLMatMul } mm = platmap[platform](alpha*mat, beta, None, n, ldb, ldc) return next(mm.kernels(dtype, kname=funcn))[0] GiMMiK-3.2.1/gimmik/_version.py000066400000000000000000000000571456313426700162620ustar00rootroot00000000000000# -*- coding: utf-8 -*- __version__ = '3.2.1' GiMMiK-3.2.1/gimmik/base.py000066400000000000000000000115211456313426700153460ustar00rootroot00000000000000# -*- coding: utf-8 -*- import itertools as it import pkgutil import re from mako.lookup import TemplateLookup from mako.template import Template import numpy as np class _PlatformTemplateLookup(TemplateLookup): def __init__(self, platform): self.platform = platform def adjust_uri(self, uri, relto): return uri def get_template(self, name): platform = self.platform src = pkgutil.get_data(__name__, f'kernels/{platform}/{name}.mako') return Template(src, lookup=self) def _dot(bfn, row, maxsplit=1): nzixs, = np.nonzero(row) if not nzixs.size: return '0.0' nsplit = max(min(maxsplit, nzixs.size // 3), 1) snzixs = np.array_split(nzixs, nsplit) frags = [' + '.join(f'{row[i]}*{bfn(i)}' for i in ix) for ix in snzixs] return ' + '.join(f'({f})' for f in frags) def _partition(mat, into, by): if by == 'rows': return [list(range(i, len(mat), into)) for i in range(into)] elif by == 'cols': return [list(range(i, len(mat.T), into)) for i in range(into)] else: raise ValueError('Invalid partition by') def _chunk(l, chunksz): l, n = iter(l), len(l) nchunks = -(-n // chunksz) return [list(it.islice(l, chunksz)) for i in range(nchunks)] class MatMul: platform = None def __init__(self, A, beta=0.0, aligne=None, n=None, ldb=None, ldc=None): self.A = A self.beta = beta self.aligne = aligne if n is None and ldb is None and ldc is None: self.n = self.ldb = self.ldc = None elif n is not None and ldb is not None and ldc is not None: if aligne is not None and (ldb % aligne or ldc % aligne): raise ValueError('ldb/ldc not compatible with aligne') self.n, self.ldb, self.ldc = n, ldb, ldc else: raise ValueError('Must provide all of (n, ldb, ldc) or none') # Check the matrix has a non-zero if not A.any(): raise ValueError('A can not be empty') # Extract the shape of A self.m, self.k = m, k = A.shape # Determine the index of the first and last non-zero in each row of A self.afix = (A != 0).argmax(axis=1) self.alix = k - 1 - (A != 0)[:, ::-1].argmax(axis=1) # Mark rows of A which are all zero self.afix = np.where(np.any(A != 0, axis=1), self.afix, -1) self.alix = np.where(np.any(A != 0, axis=1), self.alix, -1) self.has_zero_rows = np.any(self.afix == -1) # Determine which entries of B partake in the multiplication self.bix = np.nonzero(np.any(A != 0, axis=0))[0] self.bix = {kx: k for k, kx in enumerate(self.bix)} def kernels(self, dtype, kname='gimmik_mm', **kwargs): basemeta = self.basemeta # Process the data type dtype = np.dtype(dtype).type if dtype == np.float32: dtype, dsize = 'float', 4 elif dtype == np.float64: dtype, dsize = 'double', 8 else: raise ValueError('Invalid floating point data type') # Common template arguments baseargs = { 'dtype': dtype, 'kname': kname, 'A': self.A, 'beta': self.beta, 'width': 1, 'm': self.m, 'n': self.n, 'k': self.k, 'ldb': self.ldb, 'ldc': self.ldc, 'afix': self.afix, 'alix': self.alix, 'bix': self.bix, 'dot': _dot, 'partition': _partition, 'chunk': _chunk } # Incrementally generate and render the kernels gen = self._kernel_generators(dtype, dsize, **kwargs) try: resp = None while True: # Generate the next kernel in the sequence name, exargs, exmeta = gen.send(resp) # Merge in the base arguments and metadata args = baseargs | exargs meta = basemeta | exmeta # Render the kernel template src = self._render_kernel(dtype, name, args) # Post-process the metadata meta['tplname'] = name self._process_meta(meta) # Yield the source and metadata and await a response resp = yield (src, meta) except StopIteration: pass def _process_meta(self, meta): pass def _render_kernel(self, dtype, tplname, tplargs): tpl = _PlatformTemplateLookup(self.platform).get_template(tplname) src = tpl.render(**tplargs) # At single precision suffix all floating point constants by 'f' if dtype == 'float': src = re.sub(r'(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?', r'\g<0>f', src) # Cleanup src = re.sub(r'^\w+\n$', '', src.strip()) src = re.sub(r'\n\n+', r'\n\n', src) + '\n' src = re.sub(r'\w+$', '', src) return src GiMMiK-3.2.1/gimmik/c.py000066400000000000000000000003111456313426700146510ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik.base import MatMul class CMatMul(MatMul): platform = 'c' basemeta = {} def _kernel_generators(self, dtype, dsize): yield ('cstream', {}, {}) GiMMiK-3.2.1/gimmik/copenmp.py000066400000000000000000000003271456313426700160770ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik.base import MatMul class COpenMPMatMul(MatMul): platform = 'c-openmp' basemeta = {} def _kernel_generators(self, dtype, dsize): yield ('cstream', {}, {}) GiMMiK-3.2.1/gimmik/cuda.py000066400000000000000000000043001456313426700153450ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik.base import MatMul class CUDAMatMul(MatMul): platform = 'cuda' basemeta = {'block': (128, 1, 1), 'width': 1, 'shared': 0, 'dynamic_shared': 0} def _kernel_generators(self, dtype, dsize, *, compute_capability=None): # B loading, C streaming kernel yield ('cstream', {}, {}) # B streaming, C accumulation kernel yield ('bstream', {}, {}) # Four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 24, 32 args = {'msplit': ms, 'bsz': bsz, 'blockx': blkx} meta = {'block': (blkx, ms, 1), 'shared': 2*bsz*blkx*dsize} yield ('bstream-msplit', args, meta) # Two-way k-split B loading, C streaming kernel ks, csz, blkx = 2, 24, 32 args = {'ksplit': ks, 'csz': csz, 'blockx': blkx} meta = {'block': (blkx, ks, 1), 'shared': (ks - 1)*csz*blkx*dsize} yield ('cstream-ksplit', args, meta) # At single precision also consider vectorized kernels if (dtype == 'float' and self.aligne is not None and self.aligne % 2 == 0): # Vector B loading, C streaming kernel args = {'dtype': 'float2', 'width': 2} meta = {'width': 2} yield ('cstream', args, meta) # Vector four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 16, 32 args = {'dtype': 'float2', 'width': 2, 'msplit': ms, 'bsz': bsz, 'blockx': blkx} meta = {'block': (blkx, ms, 1), 'width': 2, 'shared': 2*blkx*bsz*2*dsize} yield ('bstream-msplit', args, meta) # Vector two-way k-split B loading, C streaming kernel ks, csz, blkx = 2, 24, 32 args = {'dtype': 'float2', 'width': 2, 'ksplit': ks, 'csz': csz, 'blockx': blkx} meta = {'block': (blkx, ks, 1), 'width': 2, 'shared': 2*(ks - 1)*csz*blkx*dsize} yield ('cstream-ksplit', args, meta) def _process_meta(self, meta): if self.n is not None: div = meta['block'][0]*meta['width'] meta['grid'] = (-(-self.n // div), 1, 1) GiMMiK-3.2.1/gimmik/hip.py000066400000000000000000000021241456313426700152130ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik.base import MatMul class HIPMatMul(MatMul): platform = 'hip' basemeta = {'block': (128, 1, 1), 'width': 1, 'shared': 0} def _kernel_generators(self, dtype, dsize, *, gcn_arch=None, warp_size=64): # B loading, C streaming kernel yield ('cstream', {}, {}) # B streaming, C accumulation kernel yield ('bstream', {}, {}) # Four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 24, 64 args = {'msplit': ms, 'bsz': bsz, 'blockx': blkx} meta = {'block': (blkx, ms, 1), 'shared': 2*bsz*blkx*dsize} yield ('bstream-msplit', args, meta) # Two-way k-split B loading, C streaming kernel ks, csz, blkx = 2, 24, 64 args = {'ksplit': ks, 'csz': csz, 'blockx': blkx} meta = {'block': (blkx, ks, 1), 'shared': (ks - 1)*csz*blkx*dsize} yield ('cstream-ksplit', args, meta) def _process_meta(self, meta): if self.n is not None: div = meta['block'][0]*meta['width'] meta['grid'] = (-(-self.n // div), 1, 1) GiMMiK-3.2.1/gimmik/ispc.py000066400000000000000000000003171456313426700153730ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik.base import MatMul class ISPCMatMul(MatMul): platform = 'ispc' basemeta = {} def _kernel_generators(self, dtype, dsize): yield ('cstream', {}, {}) GiMMiK-3.2.1/gimmik/kernels/000077500000000000000000000000001456313426700155255ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/c-openmp/000077500000000000000000000000001456313426700172435ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/c-openmp/cstream.mako000066400000000000000000000014531456313426700215550ustar00rootroot00000000000000void % if n is None: ${kname}(int n, const ${dtype}* restrict b, int ldb, ${dtype}* restrict c, int ldc) { % else: ${kname}(const ${dtype}* restrict b, ${dtype}* restrict c) { const int n = ${n}; const ${'long long' if k*ldb >= 2**31 else 'int'} ldb = ${ldb}; const ${'long long' if m*ldc >= 2**31 else 'int'} ldc = ${ldc}; % endif #pragma omp parallel for simd private(dotp) for (int i = 0; i < n; i++) { % for j, jx in enumerate(A): % if beta == 0: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % elif beta == 1: c[i + ${j}*ldc] += ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % else: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)} + ${beta}*c[i + ${j}*ldc]; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/c/000077500000000000000000000000001456313426700157475ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/c/cstream.mako000066400000000000000000000014201456313426700202530ustar00rootroot00000000000000void % if n is None: ${kname}(int n, const ${dtype}* restrict b, int ldb, ${dtype}* restrict c, int ldc) { % else: ${kname}(const ${dtype}* restrict b, ${dtype}* restrict c) { const int n = ${n}; const ${'long long' if k*ldb >= 2**31 else 'int'} ldb = ${ldb}; const ${'long long' if m*ldc >= 2**31 else 'int'} ldc = ${ldc}; % endif #pragma omp simd for (int i = 0; i < n; i++) { % for j, jx in enumerate(A): % if beta == 0: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % elif beta == 1: c[i + ${j}*ldc] += ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % else: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)} + ${beta}*c[i + ${j}*ldc]; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/cuda/000077500000000000000000000000001456313426700164415ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/cuda/base.mako000066400000000000000000000016751456313426700202350ustar00rootroot00000000000000% if dtype.endswith('4'): inline __device__ ${dtype} operator+(${dtype} a, ${dtype} b) { return make_${dtype}(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w); } inline __device__ ${dtype} operator*(${dtype[:-1]} a, ${dtype} b) { return make_${dtype}(a*b.x, a*b.y, a*b.z, a*b.w); } inline __device__ void operator+=(${dtype} &a, ${dtype} b) { a.x += b.x; a.y += b.y; a.z += b.z; a.w += b.w; } inline __device__ ${dtype} make_zero() { return make_${dtype}(0, 0, 0, 0); } % elif dtype.endswith('2'): inline __device__ ${dtype} operator+(${dtype} a, ${dtype} b) { return make_${dtype}(a.x + b.x, a.y + b.y); } inline __device__ ${dtype} operator*(${dtype[:-1]} a, ${dtype} b) { return make_${dtype}(a*b.x, a*b.y); } inline __device__ void operator+=(${dtype} &a, ${dtype} b) { a.x += b.x; a.y += b.y; } inline __device__ ${dtype} make_zero() { return make_${dtype}(0, 0); } % else: inline __device__ ${dtype} make_zero() { return 0; } % endif ${next.body()} GiMMiK-3.2.1/gimmik/kernels/cuda/bstream-msplit.mako000066400000000000000000000051231456313426700222560ustar00rootroot00000000000000<%inherit file='base'/> <% mx = partition(A, into=msplit, by='rows') bchunks = chunk(bix, bsz) %> __global__ void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif int i = blockDim.x*blockIdx.x + threadIdx.x; ${dtype} bv, csub[${-(-m // msplit)}]; __shared__ ${dtype} bsub[2][${bsz}][${blockx}]; if (i >= n) return; ## Iterate over each row-chunk of C % for cid, mcx in enumerate(mx): if (threadIdx.y == ${cid}) { ## Iterate over each row-chunk of B % for bb in range(len(bchunks)): ## Fill the initial shared memory block % if loop.first: % for kx in bchunks[0]: % if loop.index % msplit == cid: bsub[0][${loop.index}][threadIdx.x] = __ldcg(b + i + ${kx}*ldb); % endif % endfor __barrier_sync(0); % endif ## Start filling the next shared memory block % if not loop.last: % for kx in bchunks[bb + 1]: % if loop.index % msplit == cid: bsub[${(bb + 1) % 2}][${loop.index}][threadIdx.x] = __ldcg(b + i + ${kx}*ldb); % endif % endfor % endif ## Accumulate our dot products % for kx in bchunks[bb]: bv = bsub[${bb % 2}][${loop.index}][threadIdx.x]; % for j, jx in enumerate(A[mcx, kx]): % if jx != 0 and kx == afix[mcx[j]]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## If we're done with this dot product then store to global % if kx == alix[mcx[j]] and beta == 0: __stcg(c + i + ${mcx[j]}*ldc, csub[${j}]); % elif kx == alix[mcx[j]] and beta == 1: c[i + ${mcx[j]}*ldc] += csub[${j}]; % elif kx == alix[mcx[j]]: c[i + ${mcx[j]}*ldc] = csub[${j}] + ${beta}*c[i + ${mcx[j]}*ldc]; % endif % endfor % endfor __barrier_sync(0); % endfor ## Handle rows of A which are all zero % for j, jx in enumerate(afix): % if jx == -1 and j % msplit == cid and beta == 0: __stcg(c + i + ${j}*ldc, make_zero()); % elif jx == -1 and j % msplit == cid and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor } % endfor } GiMMiK-3.2.1/gimmik/kernels/cuda/bstream.mako000066400000000000000000000027141456313426700207530ustar00rootroot00000000000000<%inherit file='base'/> __global__ void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif const int i = blockDim.x*blockIdx.x + threadIdx.x; if (i < n) { ${dtype} bv, csub[${m}]; ## Iterare through the used rows of B % for kx in bix: bv = __ldcg(b + i + ${kx}*ldb); % for j, jx in enumerate(A[:, kx]): % if jx != 0 and kx == afix[j]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## % if kx == alix[j] and beta == 0: __stcg(c + i + ${j}*ldc, csub[${j}]); % elif kx == alix[j] and beta == 1: c[i + ${j}*ldc] += csub[${j}]; % elif kx == alix[j]: c[i + ${j}*ldc] = csub[${j}] + ${beta}*c[i + ${j}*ldc]; % endif % endfor % endfor ## Handle rows of A which are all zero % for j, jx in enumerate(afix): % if jx == -1 and beta == 0: c[i + ${j}*ldc] = make_zero(); % elif jx == -1 and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/cuda/cstream-ksplit.mako000066400000000000000000000045511456313426700222610ustar00rootroot00000000000000<%inherit file='base'/> <% kparts = partition(A, ksplit, by='cols') cchunks = chunk(range(m), csz) loaded = set() %> __global__ void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif const int i = blockDim.x*blockIdx.x + threadIdx.x; ${dtype} cv[${-(-csz // ksplit)}], bv[${-(-k // ksplit)}], dotp; __shared__ ${dtype} csub[${ksplit - 1}][${csz}][${blockx}]; if (i >= n) return; ## Iterate over the column-partitions of B % for bid, kbx in enumerate(kparts): if (threadIdx.y == ${bid}) { ## Iterate over the row-partitions of C % for cchunk in cchunks: ## Evaluate our partial dot products % for j in cchunk: ## Load in any missing parts of B % for kx in kbx: % if A[j, kx] != 0 and kx not in loaded: bv[${loop.index}] = __ldcg(b + i + ${kx}*ldb); <% loaded.add(kx) %> % endif % endfor % if (dotex := dot(lambda kx: f'bv[{kx}]', A[j, kbx])) != '0.0': dotp = ${dotex}; % else: dotp = make_zero(); % endif ## Save to a register % if loop.index % ksplit == bid: cv[${loop.index // ksplit}] = dotp; ## Save to shared memory % else: csub[${bid - (bid > loop.index % ksplit)}][${loop.index}][threadIdx.x] = dotp; % endif % endfor __barrier_sync(0); ## Sum and output the final set of dot products % for j in cchunk: % if loop.index % ksplit == bid: dotp = cv[${loop.index // ksplit}] + ${' + '.join(f'csub[{i}][{loop.index}][threadIdx.x]' for i in range(ksplit - 1))}; % if beta == 0: __stcg(c + i + ${j}*ldc, dotp); % elif beta == 1: c[i + ${j}*ldc] += dotp; % else: c[i + ${j}*ldc] = dotp + ${beta}*c[i + ${j}*ldc]; % endif % endif % endfor __barrier_sync(0); % endfor } % endfor } GiMMiK-3.2.1/gimmik/kernels/cuda/cstream.mako000066400000000000000000000021311456313426700207450ustar00rootroot00000000000000<%inherit file='base'/> <% ksplit = 2 if m < 36 else 1 %> __global__ void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif const int i = blockDim.x*blockIdx.x + threadIdx.x; ${dtype} dotp; if (i < n) { % for j, jx in enumerate(A): % if (dotex := dot(lambda kx: f'b[i + {kx}*ldb]', jx, maxsplit=ksplit)) != '0.0': dotp = ${dotex}; % else: dotp = make_zero(); % endif % if beta == 0: c[i + ${j}*ldc] = dotp; % elif beta == 1 and dotex != '0.0': c[i + ${j}*ldc] += dotp; % else: c[i + ${j}*ldc] = dotp + ${beta}*c[i + ${j}*ldc]; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/hip/000077500000000000000000000000001456313426700163055ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/hip/base.mako000066400000000000000000000004641456313426700200740ustar00rootroot00000000000000% if dtype.endswith('4'): static inline __device__ ${dtype} make_zero() { return make_${dtype}(0, 0, 0, 0); } % elif dtype.endswith('2'): static inline __device__ ${dtype} make_zero() { return make_${dtype}(0, 0); } % else: static inline __device__ ${dtype} make_zero() { return 0; } % endif ${next.body()} GiMMiK-3.2.1/gimmik/kernels/hip/bstream-msplit.mako000066400000000000000000000052341456313426700221250ustar00rootroot00000000000000<%inherit file='base'/> <% mx = partition(A, into=msplit, by='rows') bchunks = chunk(bix, bsz) %> __global__ __launch_bounds__(${blockx*msplit}) void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif int i = blockDim.x*blockIdx.x + threadIdx.x; ${dtype} bv, csub[${-(-m // msplit)}]; __shared__ ${dtype} bsub[2][${bsz}][${blockx}]; ## Fill the initial shared memory block % for cid in range(msplit): if (i < n && threadIdx.y == ${cid}) { % for kx in bchunks[0]: % if loop.index % msplit == cid: bsub[0][${loop.index}][threadIdx.x] = b[i + ${kx}*ldb]; % endif % endfor } % endfor __syncthreads(); ## Iterate over each row-chunk of B % for bb in range(len(bchunks)): ## Iterate over each row-chunk of C % for cid, mcx in enumerate(mx): if (i < n && threadIdx.y == ${cid}) { ## Start filling the next shared memory block % if not loop.parent.last: % for kx in bchunks[bb + 1]: % if loop.index % msplit == cid: bsub[${(bb + 1) % 2}][${loop.index}][threadIdx.x] = b[i + ${kx}*ldb]; % endif % endfor % endif ## Accumulate our dot products % for kx in bchunks[bb]: bv = bsub[${bb % 2}][${loop.index}][threadIdx.x]; % for j, jx in enumerate(A[mcx, kx]): % if jx != 0 and kx == afix[mcx[j]]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## If we're done with this dot product then store to global % if kx == alix[mcx[j]] and beta == 0: c[i + ${mcx[j]}*ldc] = csub[${j}]; % elif kx == alix[mcx[j]] and beta == 1: c[i + ${mcx[j]}*ldc] += csub[${j}]; % elif kx == alix[mcx[j]]: c[i + ${mcx[j]}*ldc] = csub[${j}] + ${beta}*c[i + ${mcx[j]}*ldc]; % endif % endfor % endfor ## Handle rows of A which are all zero % if loop.parent.last: % for j, jx in enumerate(afix): % if jx == -1 and j % msplit == cid and beta == 0: c[i + ${j}*ldc] = make_zero(); % elif jx == -1 and j % msplit == cid and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor % endif } % endfor __syncthreads(); % endfor } GiMMiK-3.2.1/gimmik/kernels/hip/bstream.mako000066400000000000000000000027221456313426700206160ustar00rootroot00000000000000<%inherit file='base'/> __global__ __launch_bounds__(128) void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif const int i = blockDim.x*blockIdx.x + threadIdx.x; if (i < n) { ${dtype} bv, csub[${m}]; ## Iterare through the used rows of B % for kx in bix: bv = b[i + ${kx}*ldb]; % for j, jx in enumerate(A[:, kx]): % if jx != 0 and kx == afix[j]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## % if kx == alix[j] and beta == 0: c[i + ${j}*ldc] = csub[${j}]; % elif kx == alix[j] and beta == 1: c[i + ${j}*ldc] += csub[${j}]; % elif kx == alix[j]: c[i + ${j}*ldc] = csub[${j}] + ${beta}*c[i + ${j}*ldc]; % endif % endfor % endfor ## Handle rows of A which are all zero % for j, jx in enumerate(afix): % if jx == -1 and beta == 0: c[i + ${j}*ldc] = make_zero(); % elif jx == -1 and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/hip/cstream-ksplit.mako000066400000000000000000000047421456313426700221270ustar00rootroot00000000000000<%inherit file='base'/> <% kparts = partition(A, ksplit, by='cols') cchunks = chunk(range(m), csz) loaded = set() %> __global__ __launch_bounds__(${blockx*ksplit}) void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif int i = blockDim.x*blockIdx.x + threadIdx.x; ${dtype} cv[${-(-csz // ksplit)}], bv[${-(-k // ksplit)}], dotp; __shared__ ${dtype} csub[${ksplit - 1}][${csz}][${blockx}]; ## Iterate over the row-partitions of C % for cchunk in cchunks: ## Iterate over the row-partitions of B % for bid, kbx in enumerate(kparts): if (i < n && threadIdx.y == ${bid}) { ## Evaluate our partial dot products % for j in cchunk: ## Load in any missing parts of B % for kx in kbx: % if A[j, kx] != 0 and kx not in loaded: bv[${loop.index}] = b[i + ${kx}*ldb]; <% loaded.add(kx) %> % endif % endfor % if (dotex := dot(lambda kx: f'bv[{kx}]', A[j, kbx])) != '0.0': dotp = ${dotex}; % else: dotp = make_zero(); % endif ## Save to a register % if loop.index % ksplit == bid: cv[${loop.index // ksplit}] = dotp; ## Save to shared memory % else: csub[${bid - (bid > loop.index % ksplit)}][${loop.index}][threadIdx.x] = dotp; % endif % endfor } % endfor __syncthreads(); ## Iterate over the column-partitions of B % for bid, kbx in enumerate(kparts): if (i < n && threadIdx.y == ${bid}) { ## Sum and output the final set of dot products % for j in cchunk: % if loop.index % ksplit == bid: dotp = cv[${loop.index // ksplit}] + ${' + '.join(f'csub[{i}][{loop.index}][threadIdx.x]' for i in range(ksplit - 1))}; % if beta == 0: c[i + ${j}*ldc] = dotp; % elif beta == 1: c[i + ${j}*ldc] += dotp; % else: c[i + ${j}*ldc] = dotp + ${beta}*c[i + ${j}*ldc]; % endif % endif % endfor } % endfor __syncthreads(); % endfor } GiMMiK-3.2.1/gimmik/kernels/hip/cstream.mako000066400000000000000000000021601456313426700206130ustar00rootroot00000000000000<%inherit file='base'/> <% ksplit = 2 if m < 36 else 1 %> __global__ __launch_bounds__(128) void % if n is None: ${kname}(int n, const ${dtype}* __restrict__ b, int ldb, ${dtype}* __restrict__ c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(const ${dtype}* __restrict__ b, ${dtype}* __restrict__ c) { const int n = ${-(-n // width)}; const ${'long long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif const int i = blockDim.x*blockIdx.x + threadIdx.x; ${dtype} dotp; if (i < n) { % for j, jx in enumerate(A): % if (dotex := dot(lambda kx: f'b[i + {kx}*ldb]', jx, maxsplit=ksplit)) != '0.0': dotp = ${dotex}; % else: dotp = make_zero(); % endif % if beta == 0: c[i + ${j}*ldc] = dotp; % elif beta == 1 and dotex != '0.0': c[i + ${j}*ldc] += dotp; % else: c[i + ${j}*ldc] = dotp + ${beta}*c[i + ${j}*ldc]; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/ispc/000077500000000000000000000000001456313426700164635ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/ispc/cstream.mako000066400000000000000000000014541456313426700207760ustar00rootroot00000000000000export void % if n is None: ${kname}(uniform int n, const uniform ${dtype} b[], uniform int ldb, ${dtype} uniform c[], uniform int ldc) { % else: ${kname}(const uniform ${dtype} b[], ${dtype} uniform c[]) { const uniform int n = ${n}; const uniform ${'long long' if k*ldb >= 2**31 else 'int'} ldb = ${ldb}; const uniform ${'long long' if m*ldc >= 2**31 else 'int'} ldc = ${ldc}; % endif foreach (i = 0 ... n) { % for j, jx in enumerate(A): % if beta == 0: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % elif beta == 1: c[i + ${j}*ldc] += ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % else: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)} + ${beta}*c[i + ${j}*ldc]; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/metal/000077500000000000000000000000001456313426700166275ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/metal/base.mako000066400000000000000000000004721456313426700204150ustar00rootroot00000000000000#include using namespace metal; % if dtype.endswith('4'): static inline ${dtype} make_zero() { return ${dtype}(0, 0, 0, 0); } % elif dtype.endswith('2'): static inline ${dtype} make_zero() { return ${dtype}(0, 0); } % else: static inline ${dtype} make_zero() { return 0; } % endif ${next.body()} GiMMiK-3.2.1/gimmik/kernels/metal/bstream-msplit.mako000066400000000000000000000055461456313426700224550ustar00rootroot00000000000000<%inherit file='base'/> <% mx = partition(A, into=msplit, by='rows') bchunks = chunk(bix, bsz) %> kernel void % if n is None: ${kname}(constant int& n_, device ${dtype}* b, constant int& ldb_, device ${dtype}* c, constant int& ldc_, uint2 tpig [[thread_position_in_grid]], uint2 tpitg [[thread_position_in_threadgroup]]) { const int n = ((n_ + ${width} - 1) / ${width}) * ${width}; const int ldb = ldb_ / ${width}; const int ldc = ldc_ / ${width}; % else: ${kname}(device const ${dtype}* b, device ${dtype}* c, uint2 tpig [[thread_position_in_grid]], uint2 tpitg [[thread_position_in_threadgroup]]) { const int n = ${-(-n // width)}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif const int i = tpig.x; ${dtype} bv, csub[${-(-m // msplit)}]; threadgroup ${dtype} bsub[2][${bsz}][${blockx}]; ## Fill the initial shared memory block % for cid in range(msplit): if (i < n && tpitg.y == ${cid}) { % for kx in bchunks[0]: % if loop.index % msplit == cid: bsub[0][${loop.index}][tpitg.x] = b[i + ${kx}*ldb]; % endif % endfor } % endfor threadgroup_barrier(mem_flags::mem_threadgroup); ## Iterate over each row-chunk of B % for bb in range(len(bchunks)): ## Iterate over each row-chunk of C % for cid, mcx in enumerate(mx): if (i < n && tpitg.y == ${cid}) { ## Start filling the next shared memory block % if not loop.parent.last: % for kx in bchunks[bb + 1]: % if loop.index % msplit == cid: bsub[${(bb + 1) % 2}][${loop.index}][tpitg.x] = b[i + ${kx}*ldb]; % endif % endfor % endif ## Accumulate our dot products % for kx in bchunks[bb]: bv = bsub[${bb % 2}][${loop.index}][tpitg.x]; % for j, jx in enumerate(A[mcx, kx]): % if jx != 0 and kx == afix[mcx[j]]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## If we're done with this dot product then store to global % if kx == alix[mcx[j]] and beta == 0: c[i + ${mcx[j]}*ldc] = csub[${j}]; % elif kx == alix[mcx[j]] and beta == 1: c[i + ${mcx[j]}*ldc] += csub[${j}]; % elif kx == alix[mcx[j]]: c[i + ${mcx[j]}*ldc] = csub[${j}] + ${beta}*c[i + ${mcx[j]}*ldc]; % endif % endfor % endfor ## Handle rows of A which are all zero % if loop.parent.last: % for j, jx in enumerate(afix): % if jx == -1 and j % msplit == cid and beta == 0: c[i + ${j}*ldc] = make_zero(); % elif jx == -1 and j % msplit == cid and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor % endif } % endfor threadgroup_barrier(mem_flags::mem_threadgroup); % endfor } GiMMiK-3.2.1/gimmik/kernels/metal/bstream.mako000066400000000000000000000027421456313426700211420ustar00rootroot00000000000000<%inherit file='base'/> kernel void % if n is None: ${kname}(constant int& n_, device ${dtype}* b, constant int& ldb_, device ${dtype}* c, constant int& ldc_, uint i [[thread_position_in_grid]]) { const int n = ((n_ + ${width} - 1) / ${width}) * ${width}; const int ldb = ldb_ / ${width}; const int ldc = ldc_ / ${width}; % else: ${kname}(device const ${dtype}* b, device ${dtype}* c, uint i [[thread_position_in_grid]]) { const int n = ${-(-n // width)}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif if (i < n) { ${dtype} bv, csub[${m}]; ## Iterare through the used rows of B % for kx in bix: bv = b[i + ${kx}*ldb]; % for j, jx in enumerate(A[:, kx]): % if jx != 0 and kx == afix[j]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## % if kx == alix[j] and beta == 0: c[i + ${j}*ldc] = csub[${j}]; % elif kx == alix[j] and beta == 1: c[i + ${j}*ldc] += csub[${j}]; % elif kx == alix[j]: c[i + ${j}*ldc] = csub[${j}] + ${beta}*c[i + ${j}*ldc]; % endif % endfor % endfor ## Handle rows of A which are all zero % for j, jx in enumerate(afix): % if jx == -1 and beta == 0: c[i + ${j}*ldc] = make_zero(); % elif jx == -1 and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/metal/cstream-ksplit.mako000066400000000000000000000052601456313426700224450ustar00rootroot00000000000000<%inherit file='base'/> <% kparts = partition(A, ksplit, by='cols') cchunks = chunk(range(m), csz) loaded = set() %> kernel void % if n is None: ${kname}(constant int& n_, device ${dtype}* b, constant int& ldb_, device ${dtype}* c, constant int& ldc_, uint2 tpig [[thread_position_in_grid]], uint2 tpitg [[thread_position_in_threadgroup]]) { const int n = ((n_ + ${width} - 1) / ${width}) * ${width}; const int ldb = ldb_ / ${width}; const int ldc = ldc_ / ${width}; % else: ${kname}(device const ${dtype}* b, device ${dtype}* c, uint2 tpig [[thread_position_in_grid]], uint2 tpitg [[thread_position_in_threadgroup]]) { const int n = ${-(-n // width)}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif const int i = tpig.x; ${dtype} cv[${-(-csz // ksplit)}], bv[${-(-k // ksplit)}], dotp; threadgroup ${dtype} csub[${ksplit - 1}][${csz}][${blockx}]; ## Iterate over the row-partitions of C % for cchunk in cchunks: ## Iterate over the row-partitions of B % for bid, kbx in enumerate(kparts): if (i < n && tpitg.y == ${bid}) { ## Evaluate our partial dot products % for j in cchunk: ## Load in any missing parts of B % for kx in kbx: % if A[j, kx] != 0 and kx not in loaded: bv[${loop.index}] = b[i + ${kx}*ldb]; <% loaded.add(kx) %> % endif % endfor % if (dotex := dot(lambda kx: f'bv[{kx}]', A[j, kbx])) != '0.0': dotp = ${dotex}; % else: dotp = make_zero(); % endif ## Save to a register % if loop.index % ksplit == bid: cv[${loop.index // ksplit}] = dotp; ## Save to shared memory % else: csub[${bid - (bid > loop.index % ksplit)}][${loop.index}][tpitg.x] = dotp; % endif % endfor } % endfor threadgroup_barrier(mem_flags::mem_threadgroup); ## Iterate over the column-partitions of B % for bid, kbx in enumerate(kparts): if (i < n && tpitg.y == ${bid}) { ## Sum and output the final set of dot products % for j in cchunk: % if loop.index % ksplit == bid: dotp = cv[${loop.index // ksplit}] + ${' + '.join(f'csub[{i}][{loop.index}][tpitg.x]' for i in range(ksplit - 1))}; % if beta == 0: c[i + ${j}*ldc] = dotp; % elif beta == 1: c[i + ${j}*ldc] += dotp; % else: c[i + ${j}*ldc] = dotp + ${beta}*c[i + ${j}*ldc]; % endif % endif % endfor } % endfor threadgroup_barrier(mem_flags::mem_threadgroup); % endfor } GiMMiK-3.2.1/gimmik/kernels/metal/cstream.mako000066400000000000000000000022001456313426700211300ustar00rootroot00000000000000<%inherit file='base'/> <% ksplit = 2 if m < 36 else 1 %> kernel void % if n is None: ${kname}(constant int& n_, device ${dtype}* b, constant int& ldb_, device ${dtype}* c, constant int& ldc_, uint i [[thread_position_in_grid]]) { const int n = ((n_ + ${width} - 1) / ${width}) * ${width}; const int ldb = ldb_ / ${width}; const int ldc = ldc_ / ${width}; % else: ${kname}(device const ${dtype}* b, device ${dtype}* c, uint i [[thread_position_in_grid]]) { const int n = ${-(-n // width)}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif ${dtype} dotp; if (i < n) { % for j, jx in enumerate(A): % if (dotex := dot(lambda kx: f'b[i + {kx}*ldb]', jx, maxsplit=ksplit)) != '0.0': dotp = ${dotex}; % else: dotp = make_zero(); % endif % if beta == 0: c[i + ${j}*ldc] = dotp; % elif beta == 1 and dotex != '0.0': c[i + ${j}*ldc] += dotp; % else: c[i + ${j}*ldc] = dotp + ${beta}*c[i + ${j}*ldc]; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/opencl/000077500000000000000000000000001456313426700170055ustar00rootroot00000000000000GiMMiK-3.2.1/gimmik/kernels/opencl/bstream-msplit.mako000066400000000000000000000052751456313426700226320ustar00rootroot00000000000000<% mx = partition(A, into=msplit, by='rows') bchunks = chunk(bix, bsz) %> __kernel __attribute__((reqd_work_group_size(${blockx}, ${msplit}, 1))) void % if n is None: ${kname}(int n, __global const ${dtype}* restrict b, int ldb, __global ${dtype}* restrict c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(__global const ${dtype}* restrict b, __global ${dtype}* restrict c) { const int n = ${-(-n // width)}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif int i = get_global_id(0); int lx = get_local_id(0), ly = get_local_id(1); ${dtype} bv, csub[${-(-m // msplit)}]; __local ${dtype} bsub[2][${bsz}][${blockx}]; ## Fill the initial shared memory block % for cid in range(msplit): if (i < n && ly == ${cid}) { % for kx in bchunks[0]: % if loop.index % msplit == cid: bsub[0][${loop.index}][lx] = b[i + ${kx}*ldb]; % endif % endfor } % endfor work_group_barrier(CLK_LOCAL_MEM_FENCE); ## Iterate over each row-chunk of B % for bb in range(len(bchunks)): ## Iterate over each row-chunk of C % for cid, mcx in enumerate(mx): if (i < n && ly == ${cid}) { ## Start filling the next shared memory block % if not loop.parent.last: % for kx in bchunks[bb + 1]: % if loop.index % msplit == cid: bsub[${(bb + 1) % 2}][${loop.index}][lx] = b[i + ${kx}*ldb]; % endif % endfor % endif ## Accumulate our dot products % for kx in bchunks[bb]: bv = bsub[${bb % 2}][${loop.index}][lx]; % for j, jx in enumerate(A[mcx, kx]): % if jx != 0 and kx == afix[mcx[j]]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## If we're done with this dot product then store to global % if kx == alix[mcx[j]] and beta == 0: c[i + ${mcx[j]}*ldc] = csub[${j}]; % elif kx == alix[mcx[j]] and beta == 1: c[i + ${mcx[j]}*ldc] += csub[${j}]; % elif kx == alix[mcx[j]]: c[i + ${mcx[j]}*ldc] = csub[${j}] + ${beta}*c[i + ${mcx[j]}*ldc]; % endif % endfor % endfor ## Handle rows of A which are all zero % if loop.parent.last: % for j, jx in enumerate(afix): % if jx == -1 and j % msplit == cid and beta == 0: c[i + ${j}*ldc] = 0; % elif jx == -1 and j % msplit == cid and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor % endif } % endfor work_group_barrier(CLK_LOCAL_MEM_FENCE); % endfor } GiMMiK-3.2.1/gimmik/kernels/opencl/bstream.mako000066400000000000000000000024001456313426700213070ustar00rootroot00000000000000__kernel void % if n is None: ${kname}(int n, __global const ${dtype}* restrict b, int ldb, __global ${dtype}* restrict c, int ldc) { % else: ${kname}(__global const ${dtype}* restrict b, __global ${dtype}* restrict c) { const int n = ${n}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif int i = get_global_id(0); if (i < n) { ${dtype} bv, csub[${m}]; ## Iterare through the used rows of B % for kx in bix: bv = b[i + ${kx}*ldb]; % for j, jx in enumerate(A[:, kx]): % if jx != 0 and kx == afix[j]: csub[${j}] = ${jx}*bv; % elif jx != 0: csub[${j}] += ${jx}*bv; % endif ## % if kx == alix[j] and beta == 0: c[i + ${j}*ldc] = csub[${j}]; % elif kx == alix[j] and beta == 1: c[i + ${j}*ldc] += csub[${j}]; % elif kx == alix[j]: c[i + ${j}*ldc] = csub[${j}] + ${beta}*c[i + ${j}*ldc]; % endif % endfor % endfor ## Handle rows of A which are all zero % for j, jx in enumerate(afix): % if jx == -1 and beta == 0: c[i + ${j}*ldc] = 0; % elif jx == -1 and beta != 1: c[i + ${j}*ldc] *= ${beta}; % endif % endfor } } GiMMiK-3.2.1/gimmik/kernels/opencl/cstream-ksplit.mako000066400000000000000000000050141456313426700226200ustar00rootroot00000000000000<% kparts = partition(A, ksplit, by='cols') cchunks = chunk(range(m), csz) loaded = set() %> __kernel __attribute__((reqd_work_group_size(${blockx}, ${ksplit}, 1))) void % if n is None: ${kname}(int n, __global const ${dtype}* restrict b, int ldb, __global ${dtype}* restrict c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(__global const ${dtype}* restrict b, __global ${dtype}* restrict c) { const int n = ${-(-n // width)}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif int i = get_global_id(0); int lx = get_local_id(0), ly = get_local_id(1); ${dtype} cv[${-(-csz // ksplit)}], bv[${-(-k // ksplit)}], dotp; __local ${dtype} csub[${ksplit - 1}][${csz}][${blockx}]; ## Iterate over the row-partitions of C % for cchunk in cchunks: ## Iterate over the row-partitions of B % for bid, kbx in enumerate(kparts): if (i < n && ly == ${bid}) { ## Evaluate our partial dot products % for j in cchunk: ## Load in any missing parts of B % for kx in kbx: % if A[j, kx] != 0 and kx not in loaded: bv[${loop.index}] = b[i + ${kx}*ldb]; <% loaded.add(kx) %> % endif % endfor % if (dotex := dot(lambda kx: f'bv[{kx}]', A[j, kbx])) != '0.0': dotp = ${dotex}; % else: dotp = 0; % endif ## Save to a register % if loop.index % ksplit == bid: cv[${loop.index // ksplit}] = dotp; ## Save to shared memory % else: csub[${bid - (bid > loop.index % ksplit)}][${loop.index}][lx] = dotp; % endif % endfor } % endfor work_group_barrier(CLK_LOCAL_MEM_FENCE); ## Iterate over the column-partitions of B % for bid, kbx in enumerate(kparts): if (i < n && ly == ${bid}) { ## Sum and output the final set of dot products % for j in cchunk: % if loop.index % ksplit == bid: dotp = cv[${loop.index // ksplit}] + ${' + '.join(f'csub[{i}][{loop.index}][lx]' for i in range(ksplit - 1))}; % if beta == 0: c[i + ${j}*ldc] = dotp; % elif beta == 1: c[i + ${j}*ldc] += dotp; % else: c[i + ${j}*ldc] = dotp + ${beta}*c[i + ${j}*ldc]; % endif % endif % endfor } % endfor work_group_barrier(CLK_LOCAL_MEM_FENCE); % endfor } GiMMiK-3.2.1/gimmik/kernels/opencl/cstream.mako000066400000000000000000000017201456313426700213140ustar00rootroot00000000000000__kernel void % if n is None: ${kname}(int n, __global const ${dtype}* restrict b, int ldb, __global ${dtype}* restrict c, int ldc) { % if width > 1: n = ((n + ${width} - 1) / ${width}) * ${width}; ldb /= ${width}; ldc /= ${width}; % endif % else: ${kname}(__global const ${dtype}* restrict b, __global ${dtype}* restrict c) { const int n = ${-(-n // width)}; const ${'long' if k*ldb >= width*2**31 else 'int'} ldb = ${ldb // width}; const ${'long' if m*ldc >= width*2**31 else 'int'} ldc = ${ldc // width}; % endif int i = get_global_id(0); if (i < n) { % for j, jx in enumerate(A): % if beta == 0: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % elif beta == 1: c[i + ${j}*ldc] += ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)}; % else: c[i + ${j}*ldc] = ${dot(lambda kx: f'b[i + {kx}*ldb]', jx)} + ${beta}*c[i + ${j}*ldc]; % endif % endfor } } GiMMiK-3.2.1/gimmik/metal.py000066400000000000000000000042651456313426700155450ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik.base import MatMul class MetalMatMul(MatMul): platform = 'metal' basemeta = {'threadgroup': (128, 1, 1), 'threadgroup_mem_size': 0, 'width': 1} def _kernel_generators(self, dtype, dsize): # B loading, C streaming kernel yield ('cstream', {}, {}) # B streaming, C accumulation kernel yield ('bstream', {}, {}) # Four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 16, 32 args = {'msplit': ms, 'blockx': blkx, 'bsz': bsz} meta = {'threadgroup': (blkx, ms, 1), 'threadgroup_mem_size': 2*blkx*bsz*dsize} yield ('bstream-msplit', args, meta) # Four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 20, 32 args = {'msplit': ms, 'blockx': blkx, 'bsz': bsz} meta = {'threadgroup': (blkx, ms, 1), 'threadgroup_mem_size': 2*blkx*bsz*dsize} yield ('bstream-msplit', args, meta) # Two-way k-split B loading, C streaming kernel ks, csz, blkx = 2, 20, 32 args = {'ksplit': ks, 'csz': csz, 'blockx': blkx} meta = {'threadgroup': (blkx, ks, 1), 'threadgroup_mem_size': (ks - 1)*csz*blkx*dsize} yield ('cstream-ksplit', args, meta) if self.aligne is not None and self.aligne % 2 == 0: # Vector B loading, C streaming kernel args = {'dtype': 'float2', 'width': 2} meta = {'width': 2} yield ('cstream', args, meta) # Vector B streaming, C accumulation kernel yield ('bstream', args, meta) # Vector four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 16, 32 args = {'dtype': 'float2', 'width': 2, 'msplit': ms, 'blockx': blkx, 'bsz': bsz} meta = {'threadgroup': (blkx, ms, 1), 'threadgroup_mem_size': 2*blkx*bsz*dsize, 'width': 2} yield ('bstream-msplit', args, meta) def _process_meta(self, meta): if self.n is not None: tg = meta['threadgroup'] meta['grid'] = (-(-self.n // meta['width']), tg[1], 1) GiMMiK-3.2.1/gimmik/opencl.py000066400000000000000000000043471456313426700157240ustar00rootroot00000000000000# -*- coding: utf-8 -*- from gimmik.base import MatMul class OpenCLMatMul(MatMul): platform = 'opencl' basemeta = {'local_work_size': None, 'local_mem_size': 0, 'width': 1} def _kernel_generators(self, dtype, dsize, *, local_mem_size=None): max_local_mem = local_mem_size or 1024**3 # B loading, C streaming kernel yield ('cstream', {}, {}) # B streaming, C accumulation kernel yield ('bstream', {}, {}) # Four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 16, 64 args = {'msplit': ms, 'blockx': blkx, 'bsz': bsz} meta = {'local_work_size': (blkx, ms), 'local_mem_size': 2*blkx*bsz*dsize} if meta['local_mem_size'] < max_local_mem: yield ('bstream-msplit', args, meta) # Two-way k-split B loading, C streaming kernel ks, csz, blkx = 2, 32, 64 args = {'ksplit': ks, 'csz': csz, 'blockx': blkx} meta = {'local_work_size': (blkx, ks), 'local_mem_size': (ks - 1)*csz*blkx*dsize} if meta['local_mem_size'] < max_local_mem: yield ('cstream-ksplit', args, meta) # At single precision also consider vectorized kernels if (dtype == 'float' and self.aligne is not None and self.aligne % 2 == 0): # Vector B loading, C streaming kernel args = {'dtype': 'float2', 'width': 2} meta = {'width': 2} yield ('cstream', args, meta) # Vector four-way m-split B streaming, C accumulation kernel ms, bsz, blkx = 4, 16, 64 args = {'dtype': 'float2', 'width': 2, 'msplit': ms, 'blockx': blkx, 'bsz': bsz} meta = {'local_work_size': (blkx, ms), 'local_mem_size': 2*blkx*bsz*dsize, 'width': 2} if meta['local_mem_size'] < max_local_mem: yield ('bstream-msplit', args, meta) def _process_meta(self, meta): if self.n is not None: lws, width = meta['local_work_size'], meta['width'] if lws is not None: meta['global_work_size'] = (-(-self.n // width), lws[1]) else: meta['global_work_size'] = (-(-self.n // width),) GiMMiK-3.2.1/setup.py000077500000000000000000000036631456313426700143320ustar00rootroot00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- import re from setuptools import setup import sys # Python version if sys.version_info[:2] < (3, 9): print('GiMMiK requires Python 3.9 or newer') sys.exit(-1) # GiMMiK version vfile = open('gimmik/_version.py').read() vsrch = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", vfile, re.M) if vsrch: version = vsrch.group(1) else: print('Unable to find a version string in gimmik/_version.py') # Data package_data = { 'gimmik': ['kernels/*/*.mako'], } # Hard dependencies install_requires = [ 'mako', 'numpy >= 1.7' ] # Info classifiers = [ 'License :: OSI Approved :: BSD License', 'Programming Language :: Python :: 3.9', 'Programming Language :: Python :: 3.10', 'Programming Language :: Python :: 3.11', 'Topic :: Scientific/Engineering' ] # Long Description long_description = '''GiMMiK is a Python based kernel generator for matrix multiplication kernels for various accelerator platforms. For small operator matrices the generated kernels are capable of outperfoming the state-of-the-art general matrix multiplication routines such as cuBLAS GEMM or clBLAS GEMM. GiMMiK was originally developed as part of Bartosz Wozniak's master's thesis in the Department of Computing at Imperial College London and is currently maintained by Freddie Witherden.''' # Keywords keywords = ['Matrix Multiplication', 'ISPC', 'GPU', 'CUDA', 'HIP', 'Metal', 'OpenCL'] setup(name='gimmik', version=version, # Packages packages=['gimmik'], package_data=package_data, install_requires=install_requires, # Metadata description='Generator of Matrix Multiplication Kernels', long_description=long_description, maintainer='Freddie Witherden', maintainer_email='freddie@witherden.org', url='https://github.com/vincentlab/GiMMiK', license='BSD', keywords=keywords, classifiers=classifiers)